Distilling Lightweight Language Models for C/C++ Vulnerabilities

πŸ“… 2025-10-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the low detection accuracy and high computational overhead in identifying complex logical vulnerabilities (e.g., TOCTOU, race conditions) in C/C++ code, this paper proposes FineSecβ€”a novel framework that systematically introduces knowledge distillation into code vulnerability detection. It employs a large language model (LLM) as the teacher to guide a lightweight student model in learning fine-grained vulnerability representations. FineSec integrates end-to-end components: automated data preparation, task-specific fine-tuning, multi-dimensional evaluation, and continual learning. Experiments across multiple real-world C/C++ codebases demonstrate that FineSec significantly outperforms baseline models and even larger LLMs on complex vulnerability detection, achieving a 12.7% F1-score improvement, an 83% reduction in inference latency, and a 76% decrease in memory footprint. All code, datasets, and evaluation results are publicly released, ensuring strong reproducibility and practical deployability in industrial settings.

Technology Category

Application Category

πŸ“ Abstract
The increasing complexity of modern software systems exacerbates the prevalence of security vulnerabilities, posing risks of severe breaches and substantial economic loss. Consequently, robust code vulnerability detection is essential for software security. While Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, their potential for automated code vulnerability detection remains underexplored. This paper presents FineSec, a novel framework that harnesses LLMs through knowledge distillation to enable efficient and precise vulnerability identification in C/C++ codebases. FineSec utilizes knowledge distillation to transfer expertise from large teacher models to compact student models, achieving high accuracy with minimal computational cost. By integrating data preparation, training, evaluation, and continuous learning into a unified, single-task workflow, FineSec offers a streamlined approach. Extensive evaluations on C/C++ codebases demonstrate its superiority over both base models and larger LLMs in identifying complex vulnerabilities and logical flaws, establishing FineSec as a practical and scalable solution for real-world software security. To facilitate reproducibility, the datasets, source code, and experimental results are made publicly available at: https://github.com/yangxiaoxuan123/FineSec_detect.
Problem

Research questions and friction points this paper is trying to address.

Detecting security vulnerabilities in complex C/C++ codebases
Reducing computational costs of large language models for vulnerability identification
Developing lightweight models through knowledge distillation for software security
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation transfers LLM expertise
Compact student models enable efficient detection
Unified workflow integrates training and evaluation
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhiyuan Wei
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Xiaoxuan Yang
Xiaoxuan Yang
University of Virginia
In-Memory ComputingComputer-Aided DesginMachine Learning Acceleration
J
Jing Sun
School of Computer Science, University of Auckland, Auckland, New Zealand
Z
Zijian Zhang
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China