Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited few-shot prompting performance for code vulnerability detection, primarily due to the difficulty of selecting high-quality, semantically relevant contextual examples. Method: This paper proposes a fine-tuning-free retrieval-augmented few-shot prompting approach. It retrieves code snippets matching known vulnerability patterns via semantic similarity search and integrates them into prompts using tailored prompt engineering and a retrieval-labeling mechanism, enabling efficient inference on Gemini-1.5-Flash. Contribution/Results: With only 20 retrieved examples, the method achieves 74.05% F1 score and 83.90% partial-match accuracy—substantially outperforming zero-shot and random few-shot baselines, and approaching the performance of fine-tuned models like CodeBERT, while eliminating costly training. To our knowledge, this is the first systematic study demonstrating the effectiveness and practicality of retrieval-augmented prompting for security-critical code analysis.

Technology Category

Application Category

📝 Abstract

Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.

Problem

Research questions and friction points this paper is trying to address.

Compares retrieval-augmented prompting with fine-tuning for code vulnerability detection.

Evaluates methods to improve few-shot performance in identifying security weaknesses in code.

Assesses strategies to reduce dependency on training while maintaining detection accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented prompting improves few-shot code vulnerability detection.

Semantically similar examples enhance performance over random selection.

This approach avoids fine-tuning costs while outperforming zero-shot methods.

🔎 Similar Papers

No similar papers found.