Wrapping static API keys with IAM roles and beating Shai-Hulud

In view of the recent Shai-Hulud supply chain attack where a malicious package install script exfiltrated the current user’s environment variables and static secrets found in their files, I am describing a solution to eliminate static tokens in favor of dynamic, machine/workload-specific credentials that are time- and scope-limited.

Note that this does not aim to prevent malware attacks per se; an attacker obtaining remote code execution on a privileged node is still able to do damage. However, this technique should reduce the impact of any attack by allowing you to convert any long-lived API tokens into shorter-lived ones without relying on the API vendor’s cooperation.

Avoiding static secrets for cloud API consumption #

In a modern cloud environment, your services and developers already do not need to use static tokens to interact with the cloud provider’s services.

You can use your workload’s identity to assume a certain role (assigned to it by an IAM policy configured separately) and talk to the service using said role, and the service’s access policy can similarly be configured to allow said role.

For developer access, developers use your existing SSO system (ideally with hardware authentication) to obtain a set of time-limited credentials with which they can assume specific roles, ultimately allowing them to access the cloud provider’s services (they can, for example, use this identity to run their app locally for debugging and have the app interact with the same services it would in production).

While those ephemeral credentials sadly still rely on static secrets (instead of keys + client certificates) that can be exfiltrated, they are time-limited (after which another SSO round—ideally with hardware-backed auth—is required to renew) and only grant access to a staging environment with no sensitive data (production access would require another, heavily-guarded role that is obtainable only upon approval and should not be needed day-to-day, thus limiting its exposure).

An aside: client certificates #

Client certificates are interesting because, unlike static tokens (which even if ephemeral and time-limited can ultimately be exfiltrated), they rely on public key cryptography and the key can be stored in hardware in a way that it can be used but never extracted.

The hardware can be a YubiKey, a smartcard (fun fact: the PKI functionality of YubiKeys is exposed as a USB smartcard reader with an always-inserted smartcard), a TPM or other secure element (T2- and Apple Silicon-equipped MacBooks have one), HSM, or isolated process that allows the user to perform cryptographic operations with the key without allowing the key itself to be exfiltrated.

The key can then be used during every TLS connection establishment to prove to the server your identity based on a previously-issued certificate (an internal CA is fine, as you control the server and thus the certificates it decides to trust) in a cryptographic way.

If malware manages to get on the machine, it can misuse the key just like the legitimate user would; however, the difference is that the malware is unable to extract the key, and thus must be present and use the machine as a real-time “signing proxy”, relaying the authentication attempts to the YubiKey/HSM/etc.

This would defeat any attack where the malware is unable to maintain real-time, two-way communication to perform the misuse. Remediation is also easier as the key does not actually need to be rotated, since it was never exfiltrated, merely temporarily misused. Your HSM may even provide you a tamper-proof audit log to ascertain whether the key was even used to begin with.

Unfortunately, client certificates and mutual TLS do not appear to get much love in the technical community.

Some projects are embracing it for machine-to-machine communication, but even then, they are mostly used as glorified shared secrets (trust is often bootstrapped using one) and otherwise kept around as files, not taking advantage of the isolation possibilities where a separate process (or even taking advantage of a TPM) can hold the key so as to preserve it in case of compromise of the “key user” process.

For end-user-to-machine, or even software-to-API communication, they are very seldom used, and when they are (mostly in banking-related applications), they’re treated as an inconveniently-large shared secret with both the private key and certificate managed the same way other static secrets are.

I think part of the reason why they get little love is that the tooling around them was terrible for a long time; the Swiss-army-knife for anything to do with certificates was OpenSSL, and while powerful, it has an arcane and inconvenient configuration language, with some parameters supplied exclusively via the config file and others via the CLI or environment variables, and is not easy to interact with programmatically (unless you fancy calling its internal routines from C).

Thankfully we’ve got better tooling since then, but old habits die hard.

However, this doesn’t help in the short term as you still need a way to manage the initial provisioning of those certificates; if you merely provide an API where a static (time-limited?) token can be exchanged for a certificate, the malware can just exfiltrate that token the same way.

Avoiding static secrets for third-party APIs #

Now, we’ve already established that all the tooling exists to skip using static shared secrets for consuming cloud APIs; however, the biggest use-case for them is calling external APIs.

Most services still expect you to authenticate to them using a shared secret (commonly known as an API key), and you rarely have any choice in the matter; furthermore, when debugging and running the application locally, it’s common to stick a valid key in your environment variables/env file, up for grabs for any malware; most API keys stolen by the latest attack were there for this reason.

How do we protect that key and avoid distributing it to all developers and services?

We can use the same approach we’d use for actual cryptographic authentication and essentially build an “HSM”. This “HSM” will accept ephemeral authentication (using our already-established chain of trust via the cloud provider’s IAM system) and give you the ability to call the API, stamp your request with this shared secret (but without letting you ever see it), and proxy it to the API hostname. It can also log the requests (and responses, if necessary) and the originating IAM role, providing a tamper-proof audit log.

This is just a reverse-proxy that will stamp the request with an Authorization: Bearer <secret> before forwarding it to the actual third-party API server. That’s actually the easy part.

The harder part is: how do we make this proxy accept and verify IAM authentication? At least on AWS, they do not provide any way for a server to accept and verify such authentication; only their own services can do so.

One promising solution is the API Gateway; it’s essentially a REST-aware server and reverse proxy, can be configured to append a header (such as our Authorization: Bearer <secret>), proxy the requests to a specific upstream, and can accept IAM authentication.

However, one limitation is that they disallow tampering with the Authorization header, even when making a call to an external API. My understanding is that AWS may internally use such a header to carry their own inter-service authentication context, but they should allow it for external API consumption.

So we can’t use the API Gateway directly to act as our proxy; however, we can still make use of it as an IAM role validator, and then forward the request to an AWS Lambda target, which will finally call our upstream API server with the Authorization header set (the secret is obtained from Secrets Manager using the Lambda’s own IAM role). A basic HTTP proxy can be trivially written in your language of choice; here’s an example to get started.

The Lambda will then only be callable by the API Gateway (enforced again by IAM policies), and the consuming services and developers can use short-term credentials and IAM roles to call the API Gateway instead of the external API directly, allowing them to use the static secret without ever being able to extract it.


Like what you see? I may be available for infrastructure, cloud and security-related consulting - reach out on LinkedIn or via e-mail!

 
3
Kudos
 
3
Kudos

Now read this

Fixing spurious teardown test failures with Django’s LiveServerTestCase

If you are using Django’s LiveServerTestCase to do browser-based testing, you might be running into sporadic database failures in between individual test methods. The root cause is that the server is bound to the lifetime of the test... Continue →