You’ve done it: you’ve secured your software supply chain. You’ve implemented SLSA, started signing commits, adopted minimal, reproducible container base images, and ensured compliance. All your developers are following best practices—and they’re even happy with the tooling! Time to take a well-earned vacation.
Fast-forward six months: you’ve been hauled in front of Congress to testify about the breach that led to the theft of your customers’ financial data. How? A data scientist was running some statistics about customer data when they pulled some weights for a neural net off the internet and loaded it with `pickle.load()`. This compromised their Jupyter notebook which had a live connection to a production database. From there, the attackers had won.
Just as “MLOps” applies techniques and processes from the software lifecycle to the data lifecycle, the practice of “ML Supply Chain Security” applies techniques and processes for securing the software lifecycle. In this post, we’ll learn:
Machine Learning Supply Chain Security
According to Microsoft, a typical data science lifecycle for a team involves a step called “Data acquisition and understanding.” This sounds suspiciously like what software engineers do when they engage with open source software. But there are a few factors that make the data science setting scarier:
You still have all of the dangers of a standard software supply chain, because you take dependencies on open source code to even do the analysis:
Further, models are really just code:
Finally, data from unknown sources poses all the same threats:
To sum it up: the data you depend on can introduce vulnerabilities and run attacker code just like software dependencies. However, it’s totally opaque and you can’t tell by looking. Further, the state of the art in operations for data science is about a decade behind that of software engineering. Once an attacker has a foothold, they can do a ton of damage:
Securing the Machine Learning Supply Chain
Fortunately, we know how to secure the ML supply chain: it’s the same way we secure the software supply chain! Specifically:
Many of these are already best practices in the MLOps world, and follow directly from applying frameworks like SLSA to this problem space. They even have tangential benefits, like reproducibility—which is just good science!
However, the tooling for solving these problems for data science is currently immature. In some cases, software tools can be applied as-is, but ML-specific implementations might be required to support workflows that data scientists have come to expect (for instance, using interactive notebooks). This brief from Georgetown University’s Center for Security and Emerging Technology provides policy recommendations, and thisTransatlantic Cyber Forum report recommends a “security approach rooted in conventional information security” and outlines the many steps that will be required to implement it, like “[i]ncreas[ing] transparency, traceability, validation, and verification”—which projects like Sigstore are doing for software!
It will be a long journey to secure ML supply chains, but we can follow the tracks laid in the software world; the sooner we start, the better.