Stay secure: Strategies and tooling for updating container images
One of the most critical actions to keep systems secure is to apply updates. In modern, containerized infrastructures that will often mean updating containers. A casual observer might expect such a standard and important task to have agreed-on best practices and standardized tooling, but they will likely be shocked by the multitude of different solutions and opinions on this problem.
This post will delve into some of the options and try to steer the reader towards a path that works for them and keeps their systems both stable and secure.
Updates mean risk
The core issue is that applying updates is a fundamentally risky endeavor; any update to software risks a change to behavior and system breakages. For this reason, operation teams — especially ones that have been burned in the past — can be reluctant to upgrade. Common wisdom is often to avoid major updates for weeks (or even months) to ensure bugs have been worked out before upgrading.
Larger software projects (like Postgres, Java, and Node.js) often have multiple versions of their project in support at the same time. This means that users can stay on an older version and avoid the more risky updates while still getting security patches. Although this approach is undoubtedly helpful to ops teams, it is only practical on large projects with paid maintainers that are happy to spend time backporting fixes. Smaller projects struggle just to keep the main version up to date. The outcome, however, is that the only, safe, supported version for the majority of software is the most recent release.
Not updating means more risk
I am a firm believer in upgrading as soon as practicable. You never want to be on an End of Life (EOL) version that is no longer officially supported. That way lies pain, huge expenses (take a look at the bills when Windows XP went EOL), and exposure to risks. Yes, upgrading requires constant work and attention. Most of all, it requires proper testing to be in place. But it is much better than placing your systems at risk and accruing technical debt.
If my application has an automated test suite with good coverage, I can be confident that any breakages caused by upgrades will be caught before deployment to production. Of course, whenever a test failure occurs, there will be work required to address it. Putting off that work by delaying upgrades will only mean that more work is required in the future as further changes pile up.
Another way organizations test and reduce the risk of breaking changes introduced by updates is through the use of staging environments where changes are tried out before being pushed to production. An alternative approach to this is "testing in production" which typically involves using techniques like feature flags and staged updates to verify the effects of changes before they hit the majority of users.
Strategies and tools for keeping your images updated
As mentioned in the introduction, there are multiple solutions for keeping your container images up to date. There are essentially two parts to the problem that have to be solved:
Knowing when an update is available
Then applying that update
Most solutions tackle both, but more bespoke set-ups may use separate tooling and techniques.
I'm assuming in this article that you are familiar with semantic versioning (semver) and tagging of images, if not please check out this guide on Considerations for Keeping Images Up to Date.
Knowing when updates are available
In the container world, the primary way of knowing when a new image is available is via the registry itself. Many registries will offer a webhook callback service (e.g. Docker Hub), but this is typically only for your own repositories. If you want to get notified when a public repository is updated, you'll either have to use a third-party service like NewReleases or implement polling (at a slow enough pace to avoid being banned by the registry).
Visualizing your status
If you're trying to find out how badly out of date the images in your Kubernetes cluster are, take a look at the version-checker project. This is a Kubernetes utility that will create an inventory of your current images and produce a chart showing how out of date they are. The dashboard can form part of a full solution with notifications for out-of-date software being sent to cluster admins for mitigation.
Updating solutions
These are the major options for keeping images up to date. We're only really looking at automation here — you can argue that kubectl set image is an updating solution, but it would only be scalable as part of an automated pipeline.
The lazy option; just use "latest" or major version tags
This means having something like the following in your Dockerfile:
FROM cgr.dev/chainguard/redis:3
Or your Kubernetes manifest:
image: cgr.dev/chainguard/redis:latest
The tag used will determine the jump in the updated version; "latest" will jump major versions so typically a major or minor tag is chosen to limit the size of changes.
The plan (hope?) is that the image will be updated whenever it is rebuilt or redeployed. The reality is that it is still dependent on caching and configuration (especially image pull policy) and the time to redeployment may be considerable. It's also hard to force an update in Kubernetes; if the image name in the manifest stays the same and nothing else changes, the Kubernetes API will assume everything is up to date.
The major issue with this approach is you lack control and reproducibility over the images that will be deployed. In Kubernetes you can end up with different pods in the same deployment running different versions of the application because nodes pulled the image at slightly different times. Debugging becomes difficult as you can't easily recreate the system. You can't even say for sure what is running in the cluster, which is never a great situation and particularly bad when you need to respond to a security situation.
The advantage is that it is relatively simple, requires little maintenance and will keep up to date with changes over time (until it breaks). This can be appropriate for simple projects, or example code, so don't entirely write it off. But please don't deploy the "latest" tag to production unless you really know what you're doing.
Keel
Keel is a Kubernetes Operator that will automatically update Kubernetes manifests and Helm charts. It has multiple options for finding updates — typically using webhooks from registries and falling back to polling for new versions. Updates are controlled via policies which cover the typical cases (update to major, minor, patch etc.).
However, there are some drawbacks with this approach:
As it stands, it doesn't seem compatible with GitOps although there's not really any good reason it couldn't be (as discussed here). Either way, most people using GitOps will likely be using different tools.
If you require approvals, you need to work this into the workflow somehow. This perhaps isn't a disadvantage as much as a fact of life.
I'm unsure how it deals with tests; ideally an update would only be applied if it passed the tests. I believe the idea is to automatically apply updates to a staging or dev cluster and retag them for deployment to production once tested, which may cause issues for organizations that employ "test in production" techniques.
At the time of this writing, there doesn't seem to be support for automatic (or semi-automatic) rollbacks.
I have to admit to having no first-hand experience with Keel. However, with 2.4k stars on GitHub, it seems a popular choice. I also appreciate the quote on the website:
kubectl is the new SSH. If you are using it to update production workloads, you are doing it wrong.
GitOps: Flux and ArgoCD
Which brings us to GitOps, a set of practices that leverage Git as a single source of truth for infrastructure automation. The establishment of GitOps was a reaction to teams using kubectl to administer clusters in a non-auditable and repeatable fashion, instead using Git as the source of truth and declarative interface for making changes to clusters.
If you're using GitOps of some flavor, you probably already have an updating solution in place. If you're not using GitOps, migrating will require some work (although likely worthwhile) and buy-in from the team. Even if you're not on GitOps, it's still worth looking at the typical GitOps approaches to see if you want to replicate or learn from them. The two leading GitOps solutions are Flux and ArgoCD (though there are also newer solutions gaining traction including fleet and kluctl).
Flux
Flux uses an Image Repository custom resource that polls repositories for updates. There is also support for webhooks via the Notification Controller. An Image Policy custom resource defines what tags we're interested in — typically you will use a semver policy, such as "range: 5.0.x" to get minor updates.
An Image Update Automation resource then defines how to handle updates — e.g. check an update commit in directly, or commit to a new branch and open a GitHub pull request for manual approval.
There is also support for reverting updates and suspending automation to support incident response.
The separation of responsibilities between controllers and wide range of features suggest that Flux has one of the most thought-out and battle-tested approaches to updates.
ArgoCD
ArgoCD has a separate Image Updater project that can be used to automate updates. This is an interesting deviation from the Flux approach; presumably ArgoCD wanted to keep the core of ArgoCD clean and simple and therefore moved updating to a separate project, whereas Flux felt image updating was core to the whole concept of GitOps and needed to be part of the base offering.
Rather than creating new resources, ArgoCD relies on annotations being added to existing manifests. Update "strategies" are similar to Flux, with support for semver and regexps to filter tags. Unlike Flux, there is currently no support for webhooks, but this is on the roadmap. Integration with ArgoCD's rollback feature is being worked on.
ImageStreams
Next we're changing streams (ho ho ho) to look at OpenShift, which has the concept of ImageStreams for handling updates.
ImageStreams are a "virtual view" over the top of images. Deployments and builds can listen for ImageStream notifications to automatically update for new versions. The underlying data for an ImageStream comes from registries, but decoupling this data means it is possible to have different versions in the ImageStream and on the registry. This in turns allows for processes such as rolling back a deployment without retagging images on the registry. The ImageStream itself is represented as a custom resource which contains a history of previous digests, ensuring that rollbacks are possible even when tags are overwritten (assuming the image isn't deleted).
You're only going to be using ImageStreams if you're on OpenShift and then you will likely be forced to use them. I asked some people why ImageStreams never made the jump to mainstream Kubernetes and the consensus seemed to be they weren't worth the effort; they add a confusing extra layer for little extra benefit.
Frizbee and digestabot
A best practice in supply chain security is to specify GitHub actions and container images by their digest. The digest is a content-based SHA of the image, so it is guaranteed to always refer to exactly the same version of the action or code, and it is guaranteed to not have changed. In other words, digests are immutable; they can't be changed to point to something else. The disadvantages are that digests aren't very human readable and you need to keep updating them to stay up to date.
However, It is possible to get something approaching the best of both worlds — something human readable yet immutable. The following are valid image references which specify both a meaningful tag and an immutable digest:
cgr.dev/chainguard/wolfi-base:latest@sha256:3eff851ab805966c768d2a8107545a96218426cee1e5cc805865505edbe6ce92
redis:7@sha256:01afb31d6d633451d84475ff3eb95f8c48bf0ee59ec9c948b161adb4da882053
There even exists the frizbee tool from Stacklok that will update image references to the most up-to-date digest — e.g. for the above example, it will ask the registry for the digest of the cgr.dev/chainguard/wolfi-base:latest image and update it if it doesn't match.
[As an aside, the tag part between the : and @ isn't actually used by Docker (so docker pull redis:nonsense@sha256:01afb31d6d633451d84475ff3eb95f8c48bf0ee59ec9c948b161adb4da882053 works), but that doesn't mean other tools can't use it.]
At Chainguard we take a similar approach with the digestabot tool. Digestabot is a GitHub action that will look up digests in the above format and open a PR to update them.
The digest approach is particularly useful when using the free developer tier of Chainguard Images where only :latest and :latest-dev images are available. These images are constantly updated and will jump over minor and major updates. By pinning to a digest, you can control updating until your tests all pass and you're ready to make the change. (Of course, in the meantime you'll be stuck using an old image that isn't getting updates — talk to us if you need to stay on an older supported version and still get updates.)
Dependabot
Dependabot is GitHub's tool for monitoring dependencies. It really has two related but separate functions:
Alerting when vulnerable packages are found
Updating packages to the latest version
Both are relevant to this discussion, but we will focus on version updates.
Dependabot is used with GitHub, but can be self-hosted. It's designed to work with a bunch of different package ecosystems (npm, python, ruby) including container images referenced in Dockerfiles and Kubernetes manifests. Dependabot runs on a schedule (e.g. daily) and will open Pull Requests (PRs) to update dependencies when it finds out-of-date versions.
As Dependabot doesn't have access to Kubernetes clusters, it is essential that Git represents the current state of the system for this to work. If you're doing this, you're likely already using GitOps, which suggests you already have a solution in place. Dependabot is still a fantastic solution for helping keep software dependencies up to date (including references inside Dockerfiles), but is likely less used for updating infrastructure like clusters.
Renovate
Renovate is a similar solution to Dependabot and again will open PRs to update out-of-date dependencies. The major differences are that Renovate is a self-hosted application that supports multiple repositories, not just GitHub (e.g. GitLab, Azure, Bitbucket). It also comes with more configurability and a nice dashboard.
Conclusion
Something as seemingly simple and important as keeping packages up to date turns out to have more approaches and tooling than you might expect.
It would be remiss to write this whole article and not offer any recommendations, but this is a situation where context is king and different organizations will choose different solutions. That being said, I would personally lean towards a GitOps solution for control and oversight into the cluster. If this doesn't work for your team, just be aware of the tradeoffs you're making and the solutions and tooling available — don't roll your own solution from scratch; build on the work of others (and ideally contribute back!).
This is an area of continuing research for me; please let me know if you have a different approach or want to suggest a tool I've not mentioned!
Ready to Lock Down Your Supply Chain?
Talk to our customer obsessed, community-driven team.