Distilling Lightweight Language Models for C/C++ Vulnerabilities

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the low detection accuracy and high computational overhead in identifying complex logical vulnerabilities (e.g., TOCTOU, race conditions) in C/C++ code, this paper proposes FineSec—a novel framework that systematically introduces knowledge distillation into code vulnerability detection. It employs a large language model (LLM) as the teacher to guide a lightweight student model in learning fine-grained vulnerability representations. FineSec integrates end-to-end components: automated data preparation, task-specific fine-tuning, multi-dimensional evaluation, and continual learning. Experiments across multiple real-world C/C++ codebases demonstrate that FineSec significantly outperforms baseline models and even larger LLMs on complex vulnerability detection, achieving a 12.7% F1-score improvement, an 83% reduction in inference latency, and a 76% decrease in memory footprint. All code, datasets, and evaluation results are publicly released, ensuring strong reproducibility and practical deployability in industrial settings.

Technology Category

Application Category

📝 Abstract

The increasing complexity of modern software systems exacerbates the prevalence of security vulnerabilities, posing risks of severe breaches and substantial economic loss. Consequently, robust code vulnerability detection is essential for software security. While Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, their potential for automated code vulnerability detection remains underexplored. This paper presents FineSec, a novel framework that harnesses LLMs through knowledge distillation to enable efficient and precise vulnerability identification in C/C++ codebases. FineSec utilizes knowledge distillation to transfer expertise from large teacher models to compact student models, achieving high accuracy with minimal computational cost. By integrating data preparation, training, evaluation, and continuous learning into a unified, single-task workflow, FineSec offers a streamlined approach. Extensive evaluations on C/C++ codebases demonstrate its superiority over both base models and larger LLMs in identifying complex vulnerabilities and logical flaws, establishing FineSec as a practical and scalable solution for real-world software security. To facilitate reproducibility, the datasets, source code, and experimental results are made publicly available at: https://github.com/yangxiaoxuan123/FineSec_detect.

Problem

Research questions and friction points this paper is trying to address.

Detecting security vulnerabilities in complex C/C++ codebases

Reducing computational costs of large language models for vulnerability identification

Developing lightweight models through knowledge distillation for software security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation transfers LLM expertise

Compact student models enable efficient detection

Unified workflow integrates training and evaluation

🔎 Similar Papers

NAVRepair: Node-type Aware C/C++ Code Vulnerability Repair