The Sneaky Weakness Behind AWS’ Managed KMS Keys

Lambda is growing rapidly in popularity as a compute platform. After delegating a whole range of operational decisions to AWS, we are free, we’re told, to focus on our application logic while AWS tries to make the supporting machinery as transparent as possible.

Until that all goes away because the execution role was deleted and re-created. The Lambda function is unable to run, your application stops working and AWS provides almost no feedback on why.

The problem with Lambda’s KMS keys

Let’s take a look at this issue. If you want to play along at home, I have a GitHub repository with a CloudFormation stack and step-by-step instructions on how to replicate the behavior.

When you create a Lambda function, you can configure a raft of attributes, but the ones we care about here are the execution role, at least one environment variable, and not specifying a KMS customer-managed key to encrypt them. When you include environment variables, Lambda very helpfully encrypts them for you. If you’ve specified a KMS key, it will use that. But if you haven’t, it will use the AWS-managed key with the alias aws/lambda.

If Lambda uses the default key, it will create a KMS grant on that key, allowing your function’s execution role to use it for decrypting the environment variables. You can even see Lambda making the “CreateGrant” API call in CloudTrail.

At this point, your function can be invoked. Your code will start running, logs will be generated, and even if your code errors you’ll see invocations and duration metrics populated. Of particular interest to us, if you check CloudTrail, you’ll be able to see Lambda assume the execution role and then decrypt the environment variables using the aws/lambda key.

But, what happens if the execution role is accidentally deleted? Well, your function will break and won’t start. But let’s say you re-create the role exactly as it was, whether by your chosen infrastructure as code provider or during a panic attack using the console.

Your function still won’t work.

AWS doesn’t offer much in the way of troubleshooting assistance here: No logs will be created and every attempt at invocation will just be marked as an error with no further details provided. If you check CloudTrail again, you’ll see that when the assumed execution role makes the API call to KMS to decrypt the environment variables, it receives an Access Denied error.

If you try to invoke the function manually, you will get a significantly more useful error message:

An error occurred (KMSAccessDeniedException) when calling the Invoke operation (reached max retries: 2): Lambda was unable to decrypt the environment variables because KMS access was denied.

This immediately provides us with a lot more context on what the issue is, and we can start to put two and two together. The documentation tells us that a default key will be used for encrypting and decrypting our environment variables, and now our function’s execution role isn’t allowed to use it. We never had to declare an identity-based policy giving access to that key, so we can infer that there was a grant on the aws/lambda key that’s broken — and that’s where we need to start investigating.

Why we need resource-based policies and IAM unique IDs

A KMS grant is a resource-based policy. You’ve almost certainly seen these before with IAM role trust policies or S3 bucket policies. They are attached to a specific resource and let us specify who is allowed to access that resource, what they’re allowed to do with it, and under what conditions they can do it.

For our purposes, the really critical part of a resource-based policy is the “who,” formally known as the “principal.” AWS recognizes a few different types of principals, but for now we want to focus on IAM principals.

When an IAM identity is created, it is given a unique identifier that is different from its name or Amazon Resource Name. Even though we might put an ARN into the principal section of our policy, behind-the-scenes AWS resolves this to the unique identifier.

Recall that when the environment variable for our function is set, Lambda creates a KMS grant on the AWS-managed key, allowing the function’s execution role to decrypt the environment variables at runtime. When we delete the role and re-create it, the grant no longer applies to the re-created role because it has a different unique identifier from the original role.

This is absolutely the intended behavior of a resource-based policy and a pretty important security feature. Even though AWS enforces uniqueness for account numbers and IAM principal names at any one time, there is no guarantee that they belong to the same person today as they will tomorrow. As an example, your senior DBA with the IAM user “paul.allen” might leave today and you might onboard a junior developer with the same username in the future. You probably don’t want PostgreSQL superuser secrets accessible by the new “paul.allen” IAM user, even if they do have the same ARN.

Fortunately, AWS makes it apparent when the trust in a resource-based policy has been broken, if you know where to look. When the ARN of the principal in a resource-based policy still resolves to the same unique identifier, then it will report the principal by the ARN. However, if the ARN no longer resolves to the same unique identifier, it will show the trusted unique ID followed by the friendly name.

If you check the grants on the aws/lambda KMS key after deleting and re-creating the execution role for your function, then you’ll be able to see that the “GranteePrincipal” now shows a unique identifier starting in “AROA” rather than the ARN of the execution role. Without permission to use the key to decrypt the environment variables, your function errors (almost) silently.

Options for ‘fixing’ the KMS key issue

Something very important to understand about the AWS-managed keys is that while you can see them and use them if your IAM credentials have the right permissions, you cannot manage them. Because you cannot create a grant on the key, you need to get Lambda to do it for you. To do this, you’ve got two options: You can change the execution role back and forth or you can delete and re-create the function.

Neither of the options are particularly good. When the execution role is updated, Lambda creates a new grant for each role. But, depending on the security setup of your accounts, this might not be an option. Deleting and re-creating the function means that you lose any previously published versions that you might have aliases pointed at. Or worse still, you might have event sources pointed toward specific published versions of your function.

Either choice will get your function running, but you’ll probably want to try to stop it from happening again.

Do we even have power to stop this issue in the first place?

AWS won’t stop you from deleting a role that’s being used as an execution role by a Lambda function. This is different from some other ways you can try to shoot yourself in the foot. For instance, if you try to delete a security group still actively associated with EC2 instances, AWS will stop you. Because AWS won’t stop you from deleting the role, there’s nothing to stop someone from deleting it by accident, either because they misread it or because they didn’t know it was being used.

Infrastructure as code tools won’t necessarily save you. They will gladly delete and re-create IAM roles without consideration for resource-based policies that aren’t controlled by those same tools. Worse still, they won’t identify the drift in the unique identifier, so they might actually point you away from the cause of the errors. Well-intentioned refactoring of a CloudFormation stack or Terraform module can have these unintended consequences.

Because the KMS grant occurs behind the scenes, most users won’t even know to be aware of it to try to mitigate the problem from happening in the first place. And because we cannot manage grants on AWS-managed keys, we can’t use resource dependency in our infrastructure as code tool of choice to protect us.

AWS, what’s your fix here?

The only programmatic mitigation I can think of is to use a customer-managed key. Those cost you $1 per key per month, and you pay extra for usage on top of that. It may not sound like much, but it shouldn’t be necessary.

If AWS says it’s going to use a managed key to do it for me, I’d like to think it would do it with a bit of resilience or warn me if I’m going to break something. AWS has given us the opportunity to cede control but has missed a critical edge case that can have catastrophic consequences.