🤖 AI Summary
This work addresses the limited performance of large language models (LLMs) in code reasoning tasks—such as vulnerability detection—stemming from insufficient deep semantic understanding. The authors propose ConceptCoder, the first framework to formally define “semantic concepts” in the code domain and introduce a two-stage fine-tuning approach: first identifying human-interpretable semantic concepts within code, then leveraging these concepts for downstream reasoning. Evaluated across nine open-source LLMs, ConceptCoder improves average F1 scores from 66.32 to 72.15, outperforming current state-of-the-art methods—including both fine-tuned open models and closed-source counterparts such as GPT-5.2 and Claude-Opus-4.5. Furthermore, it demonstrates strong generalization across a dataset encompassing 134 Common Weakness Enumeration (CWE) vulnerability types, effectively enabling a human-like code review mechanism.
📝 Abstract
Large language models (LLMs) have shown promising results for software engineering applications, but still struggle with code reasoning tasks such as vulnerability detection (VD). We introduce ConceptCoder, a fine-tuning method that simulates human code inspection: models are trained to first recognize code concepts and then perform reasoning on top of these concepts. In prior work, concepts are extracted by multimodal models or LLMs to explain vision and natural language models. Our work is the first to formulate concepts for code. We define code concepts as human-understandable semantic properties of code and train models to learn such concepts. Our evaluation shows that this approach significantly improves VD accuracy, from 66.32 to 72.15 F1 on average over 9 open-source LLMs. ConceptCoder achieves the best VD performance compared to state-of-the-art (SOTA) baselines, including fine-tuned SOTA open-source LLMs and prompted proprietary models such as GPT-5.2 and Claude-Opus-4.5. Our approach also scales: concepts defined from four types of vulnerabilities benefit general vulnerability datasets with 134 CWEs. We further demonstrate that concept-based fine-tuning generalizes beyond VD and improves branch prediction. We release our code and datasets at https://figshare.com/s/1decab8232c653b44f71.