Home
Unchained
Research Blog

Pairing security advisories with vulnerable functions using LLMs

Trevor Dunlap, Principal Researcher, John Speed Meyers, Head of Chainguard Labs

TL;DR: 


Existing Common Vulnerabilities and Exposure (CVE) data is messy, wasting organizations’ time and hindering technological progress. We’re focusing on enhancing vulnerability data by pairing security advisories with vulnerable functions. Using large language models (LLMs), we achieved a 173% increase in precision with only an 18% decrease in recall compared to naive methods, demonstrating LLMs' potential for aiding vulnerability management.


Where’s the vulnerability?


Vulnerability data has been all the rage lately, from bogus CVEs and half-day vulnerabilities to the NVD halting enrichment of MITRE’s CVE List. The reality is that dirty CVE data is a significant issue in cybersecurity. Inaccurate or inconsistent information can lead to wasted efforts and missed threats for downstream data users. Not only is it wasting time for organizations, but it also impedes newer technologies aimed at solving the holy grail of security: automatically determining whether a vulnerability affects a piece of software and then fixing it.


One example is reachability analysis, which aims to determine whether functions within dependencies (either direct or transitive) that contain vulnerable code are ever called. The intuition is simple: if the vulnerable portion of code is never called, you’re most likely not affected. However, reachability analysis has its Achilles' heel — knowing exactly what you’re “reaching” to, and existing vulnerability reports today generally do not provide that information. 


To address this challenge, we set out to understand how to identify the vulnerable functions associated with a security advisory efficiently. Given the promise of large language models (LLMs) in various domains, particularly their coding capabilities, we hypothesize that LLMs can significantly aid in the accuracy of pairing security advisories with vulnerable functions. 


Existing approaches


Today, there are two common approaches for pairing security advisories with vulnerable functions: (1) manual curation and (2) using a naive automated approach that considers all changed functions in a security patch link as vulnerable. 


The Go Vulnerability Database (GoVulnDB), supported by Google’s Go security team, is at the forefront of enriching their data with vulnerable functions. GoVulnDB refers to these vulnerable functions as affected symbols. They define these as “a string array with the names of the symbols (function or method) that contains the vulnerability.” These vulnerable functions power their reachability tool Govulncheck


Originally, GoVulnDB used a manual approach of pairing security advisories with vulnerable functions. A reliable and precise technique, yet time-consuming. Recently, GoVulnDB switched from a manual approach to experimenting with an automated approach.


Figure 1: The naive automated approach of assuming all modified functions within a patch link are vulnerable. Note, function g() only refactors code and has nothing to do with fixing the vulnerability.
Figure 1: The naive automated approach of assuming all modified functions within a patch link are vulnerable. Note, function g() only refactors code and has nothing to do with fixing the vulnerability.

Alternatively, the automated approach (Figure 1), often seen in academic literature, is intuitive but flawed. The naive automated approach assumes all modified functions in a security patch are vulnerable. The underlying assumption is that developers are only addressing the vulnerability and not any additional changes during the patching process. 


However, our preliminary analysis found that out of all the functions the automated approach labeled as vulnerable, only 22% were correct when using GoVulnDB as ground truth. This means that 78% of the functions modified in a security patch were unrelated to the actual vulnerability, underscoring the significant issue of overestimation and highlighting the need for improved automated methods.


Using LLMs to identify vulnerable functions


We hypothesize that LLMs can significantly improve the accuracy of pairing security advisories with vulnerable functions over the existing approach. Our high-level intuition of using LLMs can be seen in Figure 2:


Figure 2: A high-level approach of using LLMs to pair security advisories with vulnerable functions.
Figure 2: A high-level approach of using LLMs to pair security advisories with vulnerable functions.

We take a vulnerability description from the security advisory and the associated changes from the security patch link to ask the model if the changes fix the underlying vulnerability. To further evaluate this overarching technique, we explore various model sizes (i.e., CodeLlama 7 billion (B) parameters, 13B, and 34B versions built on top of Llama 2) using a variety of prompting strategies. We develop three prompts: a simple standard prompt, a detailed prompt, and a chain-of-thought prompt that involves the model providing an explanation. We evaluate these prompts in two learning paradigms: either zero-shot or few-shot. Additionally, we evaluate the computation times for each strategy.


We use the GoVulnDB data collected until October 2023 as ground truth data. In total, we identified 280 reports listing vulnerable functions. From the associated patch link in each advisory, we identified 2,370 modified functions. Of these, 528 were labeled as vulnerable, with the remaining 1,842 considered non-vulnerable. 


Prompting matters


Ask, and thou shalt receive — as long as you're specific with what you ask for. To test this, we designed three prompts. The first is what we define as a “standard” prompt. It’s straightforward and simple. The second is our “detailed” prompt, which adds the security advisory description and is a bit more specific in asking if a certain type of vulnerability was fixed. Finally, we have the “chain-of-thought” (CoT) prompt, which is designed to let the model “think” or “reason” before concluding.

Table 1: Prompt design of our standard, detailed, and chain-of-thought prompt to pair security advisories with vulnerable functions. 


Standard Prompt

Detailed Prompt

Chain-of-Thought (CoT) Prompt

I want you to act as a vulnerability fix detection system. Determine if the following git-hunk fixed a vulnerability. Respond with only ‘True’ or ‘False.’

Your task is to analyze the provided code changes (GIT-HUNK) to determine if they target a specific vulnerability (Fixed Vulnerability Description). Review each line, considering its direct relevance to the vulnerability description. Ignore any new vulnerabilities that may have been introduced. Answer ‘True’ if changes are directly related to fixing or even somewhat partially related to fixing the vulnerability. Only provide a ‘False’ conclusion if the GIT-HUNK changes are absolutely unrelated to the vulnerability. Respond with only ‘True’ or ‘False.’

{Detailed Prompt} + Provide a brief description of what the GIT-HUNK changes have done, then conclude by labeling ‘True’ if changes are directly related to fixing or even somewhat partially related to fixing the vulnerability. Only provide a ‘False’ conclusion if the GIT-HUNK changes are absolutely unrelated to the vulnerability. Justify your decision before ending with a clear ‘True’ or ‘False’ decision. Do not answer right away. Answer in the following format:

Explanation — {Your Explanation}

Final Decision — {True/False}


Figure 3 displays the results across the prompts from Table 1. The prompts have the largest impact on the smaller 7B model. For instance, switching from the standard to the detailed prompt resulted in a 66.7% increase (0.27 to 0.45) in the F1 score. Switching from the detailed prompt to the CoT prompt resulted in another 26.7% increase (0.45 to 0.57) in the F1 score. However, these increases in F1 have tradeoffs in precision and recall. The detailed and CoT decreased the precision and increased the recall in the 7B model. A higher precision and lower recall indicate the model was predominately responding False to each input, with fewer True responses. However, when the recall increased, the model responded with more True responses.


Figure 3: Comparison of prompts across the CodeLlama Family of Models.
Figure 3: Comparison of prompts across the CodeLlama Family of Models.

A consistent trend was higher performance using the CoT prompt. However, the greatest takeaway from Figure 3 is that the CoT prompt allows the smaller 7B model to perform similarly to the larger 13B and 34B models. While prompts are important, we find guiding examples can help as well.


Examples help too


We test two common methods, a zero-shot and a few-shot process. 


Zero-shot: The zero-shot learning paradigm prompts an LLM without examples, such as the results in Figure 3. Zero-shot is useful when end-users have little to no example data.


Few-shot: Contrary to zero-shot, few-shot learning incorporates a set of examples within the prompt. These examples serve as a guide, allowing the LLM to grasp the context and specifics of the new task quickly. This approach is beneficial in limited data scenarios, but a set of known ground truth labels must exist. We rely on our retrieval system to obtain the appropriate examples for few-shot learning. 


We compare the performance of a few-shot setting across the CodeLlama models in Table 2.


Table 2: Zero-shot vs Few-Shot in terms of F1 across CodeLlama (7B, 13B, 34B) using the standard, detailed, and CoT prompt. The best metrics are bolded.


CodeLlama-7B CodeLlama-13B CodeLlama-34B
# Examples 0 3 5 10 0 3 5 10 0 3 5 10
Standard 0.27 0.38 0.39 0.35 0.53 0.57 0.54 0.53 0.59 0.53 0.54 0.60
Detailed 0.45 0.53 0.58 0.52 0.54 0.56 0.56 0.54 0.62 0.57 0.59 0.62
CoT 0.57 0.56 0.59 0.60 0.59 0.60 0.60 0.55 0.62 0.63 0.62 0.61

Introducing a few-shot paradigm had the largest impact on the smaller CodeLlama 7B model. On average, each prompt in a few-shot setting increased the F1 score of the CodeLlama 7B model by 26.2%. Few-shot learning increased the CodeLlama 13B version’s F1 on average by 4.1% and the CodeLlama 34B version’s by only 1.1%. 


The overall trend for the standard and detailed prompt after adding few-shot examples was an increase in recall and a decrease in precision. The CoT prompt had the opposite reaction, typically seeing a minor reduction in recall and increased precision. The few-shot approach in the detailed and CoT prompts allows the smaller CodeLlama 7B model to perform similarly to the 13B and 34B versions.


Additional models


We also evaluated other open-source models: Mixtral 8x7B, WizardCoder 15B, and DeepSeek 33B. Notably, Mixtral 8x7B has the highest overall F1 score. Specifically, its precision is a 173% improvement over the naive automated approach (0.60 vs. 0.22) and an 18% decrease in recall (0.90 vs 0.74). In contrast, CodeLlama 34B has a slightly higher recall than Mixtral 8x7B but significantly reduces precision (0.60 to 0.48). DeepSeek and WizardCoder achieved higher precision scores but at the expense of lower recall rates than CodeLlama and Mixtral. This suggests that DeepSeek and WizardCoder were more prone to producing False responses than the other two models. We note that the Mixtral 8x7B is the largest model (47B parameters) of any code model we evaluated, which could be a factor in the higher performance.


Final thoughts


In summary, this study highlights the challenge of automating the pairing between security advisories and vulnerable functions, demonstrating the potential of LLMs as a promising initial solution. Our findings suggest that while CoT prompting enhances accuracy, it incurs significant computational costs. Alternatively, strategies using few-shot learning and concise prompts achieve comparable accuracy with reduced computational overhead.


Manual analysis provides high-quality results but does not scale effectively. Previous automated approaches often introduce messy data by mispairing non-vulnerable functions with security advisories. Although LLMs are not perfect, they represent a step forward. We must evaluate LLMs to continue progressing rather than ignoring or unquestioningly trusting new technologies. Until we reach 100% accuracy in automated approaches, LLMs can assist analysts in making quicker decisions when populating vulnerability databases. Inaccurate data in these databases impacts the effectiveness of downstream tools that rely on them. As the saying goes: garbage in, garbage out.

What's next? So far, we’ve only evaluated a handful of smaller, open-source models. Could larger proprietary models, such as OpenAI's GPT-4, Google Gemini, or Anthropic's Claude, yield better results? Most likely! Could incorporating other aspects of vulnerability reports, such as PoCs, additional links, PRs, and issue threads, also improve data quality? Again, most likely! If you’re working on a similar topic, we’d also like to hear more from you. 


Interested in the complete details? Check out the full paper, Pairing Security Advisories with Vulnerable Functions Using Open-Source LLMs, we authored with friends from North Carolina State University, Dr. William Enck and Dr. Brad Reaves, that appeared at the 2024 Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).

Share

Ready to Lock Down Your Supply Chain?

Talk to our customer obsessed, community-driven team.

Get Started