Losing the Keys to the Kingdom

Whatever is rightly done, however humble, is noble.
— Henry Royce

The Compromise

On the July 11th 2023 Microsoft announced the compromise of a cryptographic key that was used to protect hundreds of millions of customer accounts. This key should have been safely hidden away in the depths of a well designed and meticulously monitored key management system, yet somehow it found its way into the hands of a hacker and was then used to gain access to customer accounts and emails.

This should never have happened.

As an industry we know how to design, build and operate secure key management systems. We know how to design minimal APIs to implement cryptographic operations without exposing the underlying keys. We know how to build in multiple layers of defense to ensure that no single software vulnerability, process flaw or simple human mistake can lead to a key being compromised. We have access to a wide range of specialist hardware and software that can help to ensure that keys are generated, used, and destroyed within a secure environment. We know how to design systems that keep sensitive key management components safely isolated from less trusted infrastructure. And we know how to monitor and audit key management systems to detect any potential signs of misuse or attack. 

So what went wrong? How did Microsoft manage to lose something that should have been so well protected? Let's take a look through the details that Microsoft released and see if there there are any indications as to what really went wrong here.

Digging for Clues

Most of the details were released in the document: Analysis of Storm-0558 techniques for unauthorized email access. This covers a mix of comments from Microsoft on the key compromise and the specific tools and techniques used by the attacker. We will focus on some of the points that relate to Microsoft’s internal key management practices.

Looking through this document it appears that Microsoft were operating two internal key management systems, one for "Consumer Keys" (also known as Microsoft Account or MSA keys) that are used to sign tokens for personal Microsoft accounts and one for "Enterprise keys" that are used sign tokens for Azure AD accounts. It was one of the Consumer / MSA keys that was compromised.

"We have substantially hardened key issuance systems since the acquired MSA key was initially issued. This includes increased isolation of the systems, refined monitoring of system activity, and moving to the hardened key store used for our enterprise systems"

This is an interesting comment. If the system has been "substantially hardened" then it follows that the system was, at one point, substantially less hardened (less secure) than it is now. This is of course a good thing, systems do need to grow and evolve to counter new and evolving threats. The comment on isolation and monitoring is a little concerning though, given the security critical nature of the system it really should have been very strongly isolated and closely monitored from day one, if there was room for "substantial" improvement in this area then it does cast some doubt over the level of operational due diligence Microsoft really applies to these systems.

There is also a hint of a bigger problem here: Microsoft are stating that the compromised key was generated using a system that was substantially less secure than their current system. If Microsoft already knew that was the case, then why hadn’t the potentially less secure keys already been revoked and replaced? Security is a weakest link problem, if you make substantial improvements to a key issuance system but then continue to use and trust older, weaker, keys then you don't really gain anything. This is a worrying symptom of a lack of operational due diligence.

"In-depth analysis of the Exchange Online activity discovered that in fact the actor was forging Azure AD tokens using an acquired Microsoft account (MSA) consumer signing key. This was made possible by a validation error in Microsoft code."

Although it seems that Microsoft were operating two separate key management systems, the front-end identify systems that use these keys were not separate. More specifically, the compromised key could be used by applications that need to validate personal accounts and Azure AD accounts. Any validating applications would then need to carefully check the specific key that was used and decide if they should trust the token or not. It turned out that Outlook was not doing this validation correctly so it permitted a token signed with the MSA key to represent an Azure AD identity. 

This was a validation error in the code but it is arguable that Microsoft created this problem in the first place by trying to use keys with different levels of security in the same identity system. They essentially made an architectural decision that meant extra validation code was needed in order for everything to be secure, this decision weakened the whole system significantly. A single code validation error was then all it took to enable a compromised MSA key to provide access to Azure AD accounts.

"Storm-0558 acquired an inactive MSA consumer signing key and used it to forge authentication tokens for Azure AD enterprise and MSA consumer to access OWA and Outlook.com."

This one is something of a smoking gun. Cryptographic keys have a well defined and well understood lifecycle. To borrow a quote from Thales "Keys have a life cycle; they’re created, live useful lives, and are retired". There should be no such thing as an "inactive" key. Inactive simply means: "a key that we forgot to revoke".

"The method by which the actor acquired the key is a matter of ongoing investigation."

In other words: Microsoft do not yet know how, or when, the key was compromised. This is not really surprising. One of the main design goals of a secure key management system is to ensure that you have full traceability of the keys and know exactly what hardware and software components could have accessed them. Without such a system it is next to impossible to retrace all of the direct and indirect touch points at which a key could potentially have been compromised, especially if you need to look many years into the past.

All we can say here is that, at some point in time, a breach of some kind occurred and the key found its way into the wrong hands. This could have been an internal network breach, it could have been an employee that gained access to the key, it could have been found on an old hard drive that got recycled. These are all threats that would have been mitigated by a well operated secure key management system.

Assembling the Jigsaw

If we put all of these together we can start to assemble a potential chain of events that could have collectively enabled this compromise:

  • The initial system used to generate the MSA key may not have had a good level of isolation and monitoring.

  • System improvements were made but the risk from older keys being compromised was not addressed as existing keys were not revoked.

  • Even when the older, potentially less-secure, key became inactive it still wasn't replaced and revoked.

  • At some point in time the less-secure key was compromised.

There were two further events that served to widen the scope of the attack to Azure AD enterprise accounts:

  • The design decision to combine MSA and enterprise keys into a single system extended the risk of the potentially insecure keys.

  • A code level flow turned the design risk into an exploitable vulnerability, this allowed the weak key to be used to access enterprise accounts.

Although this was derived using a combination of 'reading between the lines' along with a certain level of technical speculation, this does paint a plausible picture of how Microsoft ended up putting so much trust in a compromised key.

What Went Wrong?

Operating and maintaining any production system will generally include having to make a series of technical decisions that need to balance a complex set of performance, scalability, reliability, usability and security constraints. In this case it appears as though Microsoft took a series of steps, each of which may have appeared at the time to be a reasonable compromise between conflicting factors, but the compound impact of all these steps resulted in the whole system being much less secure than it should have been. 

This essentially boils down to a risk management failure: somebody really should have noticed that the system had evolved into a state where it was putting a very high level of trust in a key that was at real risk of being compromised. If this had been spotted, the key could have been revoked and replaced a long time ago.

The issues above also suggest that the actual keys were not being treated with an appropriate level of care. This is an easy trap to fall into. If you apply a standard threat model based approach to assessing the risk of any new system feature or update then cryptographic keys will essentially appear as "just another asset" that need to be protected by an appropriate set of technical or procedural mitigations. The problem is that these activities are always forward looking, they evaluate the likelihood of future threats and ways those threats can be mitigated.

When it comes to cryptographic keys, the past is just as important as the future. If you can't establish a high assurance provenance for a cryptographic key then it shouldn't be trusted. Adding a whole pile of additional mitigations to the current system doesn't fix this. If the key was compromised years ago, as may have happened here, then you still end up with an insecure system.

What Next?

There are now billions of people whose accounts and data are secured using a handful of cryptographic keys. These keys are stored and managed in systems that are designed and operated behind closed doors so we essentially need to trust our identity and cloud service providers to get this right. If the issues noted here reflect the current state of the industry then this trust may well be misplaced.

Following this compromise, the Department of Homeland Security’s Cyber Safety Review Board announced that they will be conducting a review into issues relating to cloud-based identity and authentication. There are still a lot of unanswered questions relating to the root cause, timing and scope of this compromise so it will be interesting to see the results of this review.

Previous
Previous

A Million Weak SSH Keys?

Next
Next

We need to talk about Product Security