Friday, November 13, 2020

Oops, I missed it again!

Written by Brandon Azad, when working at Project Zero

This is a quick anecdotal post describing one of the more frustrating aspects of vulnerability research: realizing that you missed a bug that was staring you in the face only once you see the patched version!

Some suspicious code

After writing the oob_timestamp exploit, I spent some time trying to find another vulnerability to exploit. Typically, it's a lot easier to develop an exploit when you already have a research platform (read: another exploit) available to help with your analysis, for example by dumping kernel memory to ensure that your heap spray is placing objects at their intended locations. Developing an exploit blind, as I had done with voucher_swap, is much trickier. (For oob_timestamp, I relied on checkra1n to bootstrap the exploit on A11, and later expanded it to A13.) So, I thought it might be nice to chain my next exploit off of oob_timestamp to avoid having to re-bootstrap later.

As I had already spent a fair amount of time reversing the iOS 13.3 (17C54) kernelcache for oob_timestamp, I decided to continue that effort on a new user client. I wrote a small program to enumerate IOUserClient classes reachable from the app sandbox (inadvertently discovering another bug in the process) and looked for classes that I had not researched previously.

A quick primer for those less familiar with Apple kernels: Apple's kernel is called XNU, and IOKit is XNU's C++ framework for implementing drivers. An app in userspace can call IOServiceGetMatchingServices() to get handles to the drivers, but the app can't actually do much with the raw driver handle. Instead, the app needs to direct the driver to create a "user client" by calling IOServiceOpen(), passing the type of user client it wants. Since the user client is what provides most of the functionality to userspace, this is the step that is subject to a sandbox check, ensuring that the app is allowed to open the requested type of user client. Once the app has a handle to a user client for the driver, the app can interact with the user client by calling functions like IOConnectCallMethod() on the user client handle, specifying the "selector" (index) of the method the app wants to invoke. In the kernel, IOConnectCallMethod() will use the selector to index a table of methods provided by the user client, invoking the one requested.

As I was scanning for user clients I could open, one reachable class stood out: H11ANEInDirectPathClient, a user client of the H11ANEIn driver. I hadn't seen this class before, but some quick Googling showed that it wasn't open source, which suggested to me that the code had probably undergone substantially less security review, and hence probably had more low-hanging bugs in it, than the open-source parts of the kernel.

I discovered several interesting things in the process of reversing. First, H11ANEIn appeared to actually have 2 user clients: H11ANEInDirectPathClient (the one I had opened) and H11ANEInUserClient (which I could not open in the sandbox). Reading the strings in the method H11ANEIn::newUserClient(), it appeared that H11ANEInDirectPathClient is the less privileged version of H11ANEInUserClient, so it made sense that I could open the former but not the latter.

if ( type == 1 ) // H11ANEInDirectPathClient



        "%s : ... : Creating direct evaluate client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");



else // H11ANEInUserClient



        "%s : ... : Creating default full-entitlement client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");



The traditional starting point when looking for bugs in IOKit user clients is to look at the external methods that are provided. These are usually identifiable as tables of function pointers near the user client's vtable in the kernelcache image. Here are the external method tables I identified for the two user clients, curiously laid out back-to-back in the kernelcache rather than each near their respective vtable:

Also, I noticed something interesting when I looked at the cross-references to these two tables: it seemed like since the classes were basically identical except for one being a less-privileged version of the other, Apple had made the rather unusual decision to share the parts of the external method tables corresponding to shared functionality between the two user client types!

This was evident from how the ::externalMethod() methods of each user client accessed the overlapping parts of the external method tables. The H11ANEInDirectPathClient version:

int H11ANEInDirectPathClient::externalMethod(H11ANEInDirectPathClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)


    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInDirectPathClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);


And the H11ANEInUserClient version:

int H11ANEInUserClient::externalMethod(H11ANEInUserClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)


    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInUserClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);


Since each can access 34 methods and the first 3 in the array are reserved for H11ANEInDirectPathClient, this meant that the last 3 would be reserved for H11ANEInUserClient, which seemed to check out since there were 37 methods total. Neat.

So, I started digging into the methods accessible by H11ANEInDirectPathClient, and very quickly adopted the opinion that the code quality in this driver was not very high. For example, I found that the 3500-line method H11ANEIn::ANE_ProgramSendRequest_gated(), reachable through selectors 2 and 33, exhibited some pretty trivial out-of-bounds reads right at the top of the function:

Here, the content of args is fully controlled, so the args->totInputBuffers count can be arbitrarily high, past the ends of the inputBufferSymbolIndex and inputBufferSurfaceId arrays.

Since the code quality seemed to be low, and since I was not particularly keen on untangling multi-thousand-line functions, I also tried to perform some very trivial fuzzing. My fuzzing experience was quite limited, but I had long ago written a dumb fuzzer that just blindly calls IOConnectCallMethod() from userspace passing randomly generated values; surprisingly, this had been sufficient before to find real kernel vulnerabilities. So, I decided to revive that old fuzzer and point it at H11ANEInDirectPathClient.

Within one second of launching the fuzzer app, the device panicked.

I was of course quite excited at this development, but it turned out that the bug was a pretty trivial NULL pointer dereference; not exploitable on iOS. And further fuzzing didn't seem to trigger anything else interesting. So, with other more interesting projects mounting, I sent a quick non-security report to Apple alerting that this area of the code could be problematic and then turned away from H11ANEInDirectPathClient.

Once more, with symbols

Fast forward to the end of August.

As had happened before with the iOS 12 beta, Apple had accidentally included a symbolicated kernelcache in some of the iOS 14 beta releases. I hadn't had a chance to dig into them yet, but I figured that the addition of symbols (and in particular the limited type information that could be inferred from mangled C++ method names) would make reversing the web of multi-thousand-line H11ANEIn functions faster and thus more worthwhile. So, I opened IDA and jumped once again to the external method tables to see if there were any obvious changes.

But almost immediately, something about the external method tables caught my eye:

Oddly, the external method tables for both H11ANEInDirectPathClient and H11ANEInUserClient had defined symbols. This was weird: I had expected the code would consist of a single array of IOExternalMethodDispatch structs, so that H11ANEInDirectPathClient could claim the 34 methods starting at index 0 while H11ANEInUserClient could claim the 34 methods starting at index 3. In such an arrangement, there should only be one symbol, that for the array as a whole.

Then it dawned on me: my notion of overlapping external method arrays was nonsense, and the "sharing" of external methods was a simple out-of-bounds access by H11ANEInDirectPathClient! The less privileged client was supposed to only have 3 methods, but it just so happened that there was a typo in the bounds-check, allowing H11ANEInDirectPathClient to access and call external methods from the more privileged client. And in so doing, each call by H11ANEInDirectPathClient to an H11ANEInUserClient was implicitly triggering a type confusion on the this pointer!

In hindsight, I realized that the "sharing external method arrays" arrangement made no sense: any such use would have to be careful to avoid type confusion between the two classes of user clients, and no such precaution was taking place. This conviction was confirmed when I decompiled H11ANEInDirectPathClient::externalMethod() in the new kernelcache and saw that the bounds check on the selector had decreased from 33 to 2, meaning the bug was now patched.

So, I had missed an issue staring me in the face the whole time, whose existence I had justified by inventing a concept of overlapping method tables. And of course, to add insult to injury, the NULL pointer dereferences I had reported as a non-security issue were only reached by calling two of the out-of-bounds methods.

Another recipe for copypasta

How might this bug have come to exist in the first place? Since the buggy version included the same bounds check for both ::externalMethod() implementations, I suspect this was another case of a copy-paste bug. Here's my guess for what H11ANEInUserClient::externalMethod() actually looks like in Apple's source:

IOReturn H11ANEInUserClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)


    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInUserClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);


My guess is that this code was copy-pasted to create the H11ANEInDirectPathClient version, but the author accidentally forgot to change the type name in the selector check:

IOReturn H11ANEInDirectPathClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)


    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInDirectPathClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);


Aside from that, it's mostly a convenient accident that the compiler laid the external method tables back-to-back, making this bug plausibly exploitable (as opposed to past cases of out-of-bounds external methods that I'm aware of). That said, I have not examined the actual exploitability of this issue.


So, what are the takeaways from this story?

First, it's really easy to miss bugs, even ones that you feel should have been obvious. I kicked myself for missing this, given the mental gymnastics I went through to justify why a code pattern like this could exist in the first place. If there's one lesson I've had to teach myself again and again, it's to be inherently suspicious of code and to never assume that it's doing what it does on purpose.

Second, copy-paste is a really quick way to create code, but it's also a quick way to create subtle bugs that, by their nature, are tricky to spot by glancing at the source code. It's easy to tell that 2 arrays are "overlapping" by looking in a disassembler, but it's harder to see that the wrong one of two very similar class names was used in copy-pasted code. While it doesn't solve the problem 100%, it can help to decompose copy-pasted code patterns into reusable helper functions.

Finally, even though I only realized that there was a bug when I looked at the symbolicated kernelcache, I don't want Apple to get the impression that releasing symbols is a security risk. Security researchers rejoice when Apple accidentally releases symbolicated kernelcaches or development libraries, but this is just because it saves time reversing, not because it makes things newly reversible. Any capable attacker will find bugs regardless of the presence or absence of symbols; all the lack of symbols does is keep the bug away from eyes (like mine) that might report it. Hence, withholding symbols is an incredibly weak protection, only deterring the lowest tiers of attackers and serving to make the bugs that have been found last longer.

Tuesday, October 6, 2020

Enter the Vault: Authentication Issues in HashiCorp Vault

 Posted by Felix Wilhelm, Project Zero


In this blog post I'll discuss two vulnerabilities in HashiCorp Vault and its integration with Amazon Web Services (AWS) and Google Cloud Platform (GCP). These issues can lead to an authentication bypass in configurations that use the aws and gcp auth methods, and demonstrate the type of issues you can find in modern “cloud-native” software. Both vulnerabilities (CVE-2020-16250/16251) were addressed by HashiCorp and are fixed in Vault versions 1.2.5, 1.3.8, 1.4.4 and 1.5.1 released in August.

Vault is a widely used tool for securely storing, generating and accessing secrets such as API keys, passwords or certificates. It can be used as a shared password manager for human users, but its feature set is optimized for API based access by other services. An example use case for Vault is to provide one of your services, such as your webserver, short lived credentials to your database or a third-party resource like an AWS S3 bucket.

Using a central secret storage like Vault offers security benefits such as centralized auditing, enforced credentials rotation or encrypted data storage. However, a central storage is also a very interesting target for an attacker. Exploiting a vulnerability in Vault could give an attacker full access to a wide range of important secrets and large parts of the target's infrastructure.

Before diving into the technical details of the vulnerabilities, the next section gives an overview about Vault’s authentication architecture and the way it integrates with cloud providers. Readers familiar with Vault can feel free to skip this section.

Authenticating to Vault

Interfacing with Vault requires authentication and Vault supports role-based access control to govern access to stored secrets. For authentication, it supports pluggable auth methods ranging from static credentials, LDAP or Radius, to full integration into third-party OpenID Connect (OIDC) providers or Cloud Identity Access Management (IAM) platforms. For infrastructure that runs on a supported cloud provider, using the provider's IAM platform for authentication is a logical choice.

Take AWS as an example: Almost every workload you can run in AWS executes in the context of a specific AWS IAM user. By enabling and configuring the aws auth method, you can create a mapping between certain IAM users or roles to Vault roles.

Imagine that you have an AWS Lambda function and want to give it access to a database password stored in Vault. Instead of storing hard coded credentials in the function code, a Vault administrator can assign a vault role to the Lambda function execution role using the vault CLI:

vault write auth/aws/role/dbclient auth_type=iam \

              bound_iam_principal_arn=arn:aws:iam::123456789012:role/lambda-role policies=prod,dev max_ttl=10m

This will create a mapping between a vault role named dbclient and the AWS IAM role lambda-role. A vault policy can now be used to grant the dbclient role access to the database secret.

When the lambda function executes, it authenticates to Vault by sending a request to the /v1/auth/aws/login API endpoint. I’ll go into the exact layout of this request later in the post, but for now just assume that the request allows Vault to verify the AWS IAM role of the caller. If authentication succeeds, Vault returns a short-lived API token for the dbclient role back to the lambda function. This token can now be used to fetch the database secret from Vault. Depending on the database backend, this secret could be a static user-password combination, a short lived client certificate or even a dynamically created credential pair.

Using Vault in this way has some nice security benefits: The lambda function itself does not need to contain bootstrap credentials and every access to the database credentials is auditable. Rotating old or compromised database credentials is straightforward and can be centrally enforced.

However, this operational simplicity is only possible because of hidden complexity in the AWS iam auth method. How does the /v1/auth/aws/login API endpoint actually work and is there a way a unauthenticated attacker can impersonate a random AWS IAM role? Let’s take a look.


Vault’s aws auth method supports two different authentication mechanisms internally: iam and ec2. We are interested in the iam mechanism, which is the recommended variant and also used in our previous Lambda example. iam auth is built on top of an AWS API method called GetCallerIdentity, part of the AWS Security Token Service (STS).

As its name implies, GetCallerIdentity returns details about the IAM role or user whose credentials were used to call the API. To understand how Vault uses this method to authenticate clients we need to understand how AWS APIs perform authentication:

Instead of attaching some form of authentication token or credential to API requests, AWS requires clients to calculate an HMAC signature for the (canonicalized) request using the caller's secret access key and attach this signature to the request. This mechanism makes it possible to pre-sign a request and forward it to another party to allow a limited form of impersonation. A popular example use case is to give clients the ability to upload a file to S3 without giving them access to credentials with write permissions.

The Vault aws authentication mechanism is a simple variant of this technique. 

The client pre-signs an HTTP request to the STS GetCallerIdentity method and sends a serialized version of it to the Vault server. The Vault server sends the pre-signed requests to the STS host and extracts the AWS IAM information out of the result. The server-side part of this flow is implemented in pathLoginUpdate in builtin/credential/aws/path_login.go:

func (b *backend) pathLoginUpdateIam(ctx context.Context, req *logical.Request, data *framework.FieldData) (*logical.Response, error) {

    method := data.Get("iam_http_request_method").(string)


    // In the future, might consider supporting GET

    if method != "POST" {

            return logical.ErrorResponse(...), nil


    rawUrlB64 := data.Get("iam_request_url").(string)


    rawUrl, err := base64.StdEncoding.DecodeString(rawUrlB64)


    parsedUrl, err := url.Parse(string(rawUrl))

    if err != nil {

            return logical.ErrorResponse(...), nil


    bodyB64 := data.Get("iam_request_body").(string)


    bodyRaw, err := base64.StdEncoding.DecodeString(bodyB64)


    body := string(bodyRaw)

    headers := data.Get("iam_request_headers").(http.Header)


    endpoint := ""


    callerID, err := submitCallerIdentityRequest(ctx, maxRetries, method, endpoint, parsedUrl, body, headers)

The function extracts HTTP method, URL, body and headers out of the supplied request body which is stored in data. It then calls submitCallerIdentity to forward the request to the STS server and to fetch and parse the result in parseGetCallerIdentityResponse:

func submitCallerIdentityRequest(ctx context.Context, maxRetries int, method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) (*GetCallerIdentityResult, error) {


    request := buildHttpRequest(method, endpoint, parsedUrl, body, headers)

    retryableReq, err := retryablehttp.FromRequest(request)


    response, err := retryingClient.Do(retryableReq)

    responseBody, err := ioutil.ReadAll(response.Body)


    if response.StatusCode != 200 {

            return nil, fmt.Errorf(..)


    callerIdentityResponse, err := parseGetCallerIdentityResponse(string(responseBody))

    if err != nil {

            return nil, fmt.Errorf("error parsing STS response")


    return &callerIdentityResponse.GetCallerIdentityResult[0], nil



func buildHttpRequest(method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) *http.Request {


    targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) 

    request, err := http.NewRequest(method, targetUrl, strings.NewReader(body))


    request.Host = parsedUrl.Host

    for k, vals := range headers {

            for _, val := range vals {

                    request.Header.Add(k, val)



    return request


buildHttpRequest creates a http.Request object based on the user supplied parameters, but uses the hardcoded constant to build the target URL. 

Without this restriction, we could simply trigger a request to a server under our control and return a fake caller identity.

However, the complete lack of validation for URL path, query, POST body and HTTP headers still looks like a promising attack surface. The next section describes how we can turn this gap into a full authentication bypass.

STS (Caller) Identity Theft 

Our goal is to trick Vault’s submitCallerIdentityRequest function into returning an attacker controlled caller identity. One way to achieve this is to manipulate the Vault server into sending a request to a host we control, bypassing the hardcoded endpoint host. Looking at the buildHttpRequest method, two approaches come to mind:

  • The code for calculating targetUrl targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) doesn't look very robust against URL parsing issues. However, tricks like embedding a fake userinfo ( and similar ideas do not work against the robust Go URL parser.

  • Even though Vault will always create a HTTPS request pointing at the hardcoded endpoint, the attacker has full control over the Host http header (request.Host = parsedUrl.Host). This could be a problem if a load balancer in front of the STS API makes routing decisions based on the Host header, but blind testing against the STS host did not lead to any success.

After ruling out the easy way forward, we still have another approach available: Vault does not restrict our URL query parameters. This means we are not limited to pre-signing requests to GetCallerIdentity and can create requests to any action of the STS API. STS supports 8 different actions, but none gives us the ability to completely control the response. At this point I was slowly getting frustrated and decided to take a look at Vault’s response parsing code:

func parseGetCallerIdentityResponse(response string) (GetCallerIdentityResponse, error) {

        decoder := xml.NewDecoder(strings.NewReader(response))

        result := GetCallerIdentityResponse{}

        err := decoder.Decode(&result)

        return result, err


type GetCallerIdentityResponse struct {

 XMLName                 xml.Name                 `xml:"GetCallerIdentityResponse"`

 GetCallerIdentityResult []GetCallerIdentityResult `xml:"GetCallerIdentityResult"`

 ResponseMetadata        []ResponseMetadata        `xml:"ResponseMetadata"`


parseGetCallerIdentityResponse is called on every response received from STS as long as the status code is 200. The function uses the Golang standard XML library to decode an XML response into a GetCallerIdentityResponse structure and returns an error if decoding fails. 

There is an easy to miss problem with this code: Vault never enforces or verifies that the STS response is actually XML encoded. While STS responses are XML encoded by default, it also supports JSON encoding for clients that send an Accept: application/json HTTP header.

For Vault, this turns into a security issue due to a somewhat surprising feature of the Go XML decoder: The decoder silently ignores non XML content before and after the expected XML root. This means that calling parseGetCallerIdentityResponse with a (JSON encoded) server response such as ‘{“abc” : “xzy<GetCallerIdentityResponse></GetCallerIdentityResponse>}’ will succeed and return an (empty) CallerIdentityResponse structure.

This brings us really close to our goal of spoofing an arbitrary caller identity: We just need to find a STS action that reflects attacker controlled text as part of its API response. Serialize a request to it while including an Accept: application/json header and put an arbitrary GetCallerIdentityResponse XML blob into the reflected payload.

Finding a reflected parameter that is not constrained to alpha-numeric characters turns out to be tricky. After some trial and error, I decided to target the AssumeRoleWithWebIdentity action and its SubjectFromWebIdentityToken response element. AssumeRoleWithWebIdentity is used to translate JSON Web Tokens (JWT) signed by an OpenID Connect (OIDC)  provider into AWS IAM identities. 

Sending a request to this action with a valid signed JWT will return the sub field of the token in the SubjectFromWebIdentityToken field.

Of course, a normal OIDC provider won’t sign a JWT with an XML payload in the subject field. Still, an attacker can just create their own OIDC Identity Provider (IdP), register it on an AWS account they own and sign arbitrary tokens with their own keys.

Let's put all of this together and walk through the full attack step-by-step.

  1. Create a minimal OIDC IdP. This boils down to generating a RSA key pair, creating an OIDC discovery.json and key.json document and hosting the json files on a web server (see here, for an example setup using S3).

  2. Use your own AWS account to register an OID IdP -> AWS IAM role mapping. It is important to note that the AWS account used for this does not need to have any relationship with our target.

  3. We can now use our OIDP to sign a JWT that contains an arbitrary GetCallerIdentityResponse as part of its subject claim. A decoded example token could look like this: iss, azp and aud match the details specified in the step 2. sub contains our spoofed response, identifying us as the AWS IAM account arn:aws:iam::superprivileged-aws-account

{'iss': '',

 'azp': 'abcdef', 'aud': 'abcdef', 

 'sub': '<GetCallerIdentityResponse><GetCallerIdentityResult><Arn>arn:aws:iam::superprivileged-aws-account</Arn><UserId>XYZ</UserId></GetCallerIdentityResult></GetCallerIdentityResponse>',

 'exp': 1595120834, 'iat': 1594207895}

  1. We can test if everything is setup correctly by sending a direct request to the STS AssumeRoleWithWebIdentity action using the (signed) token from step 3 and the RoleArn used in step 2:

curl -H "Accept: application/json"


If everything goes as planned STS will reflect the token subject as part of its JSON encoded response. As discussed above, the Go XML decoder will skip all of the content before and after the GetCallerIdentityResponse object leading Vault to consider this a valid STS CallerIdentity response.






  1. The final step is to convert this request into the form expected by Vault (e.g base64 encoding all required headers, the url and an empty post body) and to send it to the target Vault server as a login request on /v1/auth/aws/login. Vault will deserialize the request, send it to STS and misinterpret the response. If the AWS ARN/UserID in our fake GetCallerIdentityResponse has privileges on the Vault server we get a valid session token back, which we can use to interact with the Vault server to fetch some secrets.

curl -X POST "https://vault-server/v1/auth/aws/login" -d '{"role":"dev-role-iam",

"iam_http_request_method": "POST", "iam_request_body": "encoded-body", , "iam_request_headers" :

"encoded-headers", "iam_request_url" : "encoded-url"}'


of \"768h\" exceeded the effective max_ttl of \"500h\"; TTL value is capped




I wrote a proof-of-concept exploit that takes care of most of the busy work around JWT creation and serialization. While the OIDC provider setup adds some complexity, we end up with a nice authentication bypass for arbitrary AWS enabled roles. The only requirement is that the attacker knows the name of an privileged AWS role in the target Vault server. 

What went wrong here? Looking at it from an attacker perspective, the whole authentication mechanism seems clever but error-prone. Putting HTTP request forwarding into the unauthenticated external attack surface of a security product requires strong confidence in the implementation and the underlying HTTP libraries. This becomes even more difficult as the security depends on implementation details of the Security Token Service, which might change at any point in the future. For example, AWS might decide to put STS behind a load balancing frontend, which uses the Host header for routing decisions. Without any change to the Vault codebase, this could severely degrade the security of this authentication mechanism from one moment to another. 

Of course, there is a reason why the authentication works as described: AWS IAM doesn’t have a straightforward way of proving a service’s identity to other non-AWS services. Third-party services can’t easily verify pre-signed requests and AWS IAM doesn’t offer any standard signing primitives that could be used to implement certificate based authentication or JWTs.

In the end, Hashicorp fixed the vulnerability by enforcing an allowlist of HTTP headers, restricting requests to the GetCallerIdentity action and stronger validation of the STS response, which is hopefully enough to protect against unexpected changes to the STS implementation or HTTP parser differences between STS and Golang.

After finding this issue in the AWS authentication module, I decided to review its GCP equivalent. The next section describes how GCP authentication for Vault is implemented and how a simple logic flaw can lead to an authentication bypass in many configurations.

Exploiting Vault-on-GCP

Vault supports the gcp auth method for deployments on Google Cloud. Similar to its AWS counterpart, the auth method supports two different authentication mechanisms: iam and gce. Whereas the iam mechanism supports arbitrary service accounts and can be used from services such as App Engine or Cloud Functions, gce can only be used to authenticate virtual machines running on Google Compute Engine. Still, it has some interesting advantages. Instead of only making authentication decisions based on a service account identity, gce can also grant access based on a number of VM attributes. For example, a configuration could give only VMs in a specific region (europe-west-6) access to certain secrets, allow all VMs in the xyz-prod GCP project access or restrict it even further using instance-groups.

Both iam and gce are built on top of JWT. A vault client that wants to

authenticate, creates a signed token to prove its identity and sends it to the vault

server to get a session token back. For the iam mechanism, the client signs the token directly

using a service account private key under their control or with the projects.serviceAccounts.signJwt IAM API method.

For gce, the client is expected to run on an authorized GCE VM. It fetches a signed token by sending a request to the instance identity endpoint of the GCP metadata server. In contrast to service account tokens, this token is signed by an official Google certificate. In addition to the normal JWT claims (sub, aud, iat, exp), the tokens returned from the metadata server also contains a special compute_engine claim that lists details about the instance, which are processed as part of the auth process:



JWT has a number of design choices that make it very prone to implementation errors (see this blog post by securitum for an overview about typical issues), so I decided to spend a day on reviewing Vault’s token processing.

The function parseAndValidateJwt is responsible for processing both gce and iam tokens.

It first parses the token without verifying the signature and passes the decoded token into the getSigningKey helper method:

// Process JWT string.

signedJwt, ok := data.GetOk("jwt")

if !ok {

        return nil, errors.New("jwt argument is required")


// Parse 'kid' key id from headers.

jwtVal, err := jwt.ParseSigned(signedJwt.(string))

if err != nil {

        return nil, errwrap.Wrapf("unable to parse signed JWT: {{err}}", err)


key, err := b.getSigningKey(ctx, jwtVal, signedJwt.(string), loginInfo.Role, req.Storage) 

if err != nil {

        return nil, errwrap.Wrapf("unable to get public key for signed JWT: %v", err)


getSigningKey extracts the key id claim (kid) out of the token header and tries to find a google-wide oAuth key with the same identifier. This will work for GCE metadata tokens, but not for tokens signed by a service account:

func (b *GcpAuthBackend) getSigningKey(...) (interface{}, error) {

b.Logger().Debug("Getting signing Key for JWT")

if len(token.Headers) != 1 {

        return nil, errors.New("expected token to have exactly one header")


kid := token.Headers[0].KeyID

b.Logger().Debug("kid found for JWT", "kid", kid)

// Try getting Google-wide key

k, gErr := gcputil.OAuth2RSAPublicKey(ctx, kid)

if gErr == nil {

        b.Logger().Debug("Found Google OAuth2 provider key", "kid", kid)

        return k, nil


If this approach fails, the Vault server extracts the Subject (sub) claim from the supplied token. For valid tokens, this claim contains the email address of the signing service account. Knowing the key id and subject of the token, Vault fetches the public key used for signing using the service account GCP API:

// If that failed, try to get account-specific key

b.Logger().Debug("Unable to get Google-wide OAuth2 Key, trying service-account public key")

saId, err := getJWTSubject(rawToken)

if err != nil {

        return nil, err


k, saErr := gcputil.ServiceAccountPublicKey(saId, kid)

if saErr != nil {

        return nil, errwrap.Wrapf(fmt.Sprintf("unable to get public key %q for JWT subject %q: {{err}}", kid, saId), saErr)


return k, nil

In both cases, the Vault server now has access to a public key that can verify the signature of the JWT:

// Parse claims and verify signature.

baseClaims := &jwt.Claims{}

customClaims := &gcputil.CustomJWTClaims{}

if err = jwtVal.Claims(key, baseClaims, customClaims); err != nil {

        return nil, err


if err = validateBaseJWTClaims(baseClaims, loginInfo.RoleName); err != nil {

        return nil, err


If verification succeeds, Vault fills out the loginInfo struct that is later used to grant or deny access. If the token contains a compute_engine claim it is copied into the loginInfo.GceMetada field:

loginInfo.JWTClaims = baseClaims

if len(baseClaims.Subject) == 0 {

        return nil, errors.New("expected JWT to have non-empty 'sub' claim")


loginInfo.EmailOrId = baseClaims.Subject

if customClaims.Google != nil && customClaims.Google.Compute != nil &&  len(customClaims.Google.Compute.InstanceId) > 0 {

        loginInfo.GceMetadata = customClaims.Google.Compute


if loginInfo.Role.RoleType == gceRoleType && loginInfo.GceMetadata == nil {

        return nil, errors.New("expected JWT to have claims with GCE metadata")


return loginInfo, nil

As mentioned above, all of this code is shared between the iam and gce auth methods. The issue here is that no check enforces that a token signed by an arbitrary service account doesn’t contain GCE compute_engine claims. While the content in a GCE metadata token is trustworthy and controlled by Google, service account tokens are completely controlled by the owner of the service account and can therefore contain arbitrary claims.

If we follow the control flow of the gce method to the end we can see that Vault uses loginInfo.GceMetadata as part of its auth decision in pathGceLogin if two conditions are met:

  • The VM described in the metadata section needs to exist. This is verified using the GCE API and requires an attacker to impersonate an actively running VM. In practice, only project_id, zone and instance_name are verified and need to be set to valid values.

  • The service account in subject claim of the JWT token needs to exist. This is verified using the ServiceAccount GCP API which requires the iam.serviceAccounts.get permission in the project hosting the service account. As the attacker can just use a service account in their own project, it is straightforward to just grant this permission to the GCP identity Vault is running under or even allUsers.

Finally, AuthorizeGCE is called to grant or deny access. If the attacker impersonated

a GCE instance with the right attributes (project, label, zones..) everything works out well and

the attacker gets a valid session token back. The only auth restriction that can’t be bypassed is a hardcoded service account name, as this value will be equal to the attacker account and not the expected VM account name.

An end-to-end attack against a vulnerable configuration will look like this:

  1. Create a service account in a GCP project you control and generate a private key using gcloud: gcloud iam service-accounts keys create key.json --iam-account

  2. Sign a JWT with a fake compute_engine claim describing an existing and privileged VM. See here for a simple proof-of-concept script that takes care of most of the details.

  3. Now simply use the token to sign-in to Vault: curl --request POST --data '{"role": "my-gce-role", "jwt" : "...."}' http://vault:8200/v1/auth/gcp/login

This is an interesting bug that requires some knowledge of GCP IAM to spot. The root cause  seems to be the merging of two separate authentication flows into a single code path in the parseAndValidateJwt function, which makes it difficult to reason about all security requirements when writing or reviewing the code. At the same time, GCP makes it easy to shoot yourself in the foot by offering two types of JWT tokens with completely different security properties.


This blog post describes two authentication vulnerabilities in HashiCorp Vault, a “cloud-native” software for secret management. While Vault was clearly developed with security in mind and profits from the memory safety and high quality standard library of its implementation language Go, I was still able to identify two critical vulnerabilities in its unauthenticated attack surface.

In my experience, tricky vulnerabilities like this often exist where developers have to interact with external systems and services. A strong developer might be able to reason about all security boundaries, requirements and pitfalls of their own software, but it becomes very difficult once a complex external service comes into play. Modern cloud IAM solutions are powerful and often more secure than comparable on-premise solutions, but they come with their own security pitfalls and a high implementation complexity. As more and more companies move to the big cloud providers, familiarity with these technology stacks will become a key skill for security engineers and researchers and it is safe to assume that there will be a lot of similar issues in the next few years.

Finally, both discussed vulnerabilities demonstrate how difficult it is to write secure software. Even with memory-safe languages, strong cryptography primitives, static analysis and large fuzzing infrastructure, some issues can only be discovered by manual code review and an attacker mindset.