🤖 AI Summary
This study addresses the unclear impact of code representation on false positives in cross-lingual large language model (LLM)-based vulnerability detection. Through systematic investigation of using raw source code versus pruned abstract syntax trees (ASTs) during training and inference, the authors identify the model’s reliance on surface-level syntactic cues as the primary cause of false positives. To mitigate this without retraining, they propose a cross-representation probing method that leverages structured AST representations to reduce misclassifications arising from inter-language syntactic discrepancies. Evaluated on Qwen3-8B and Llama 3.1-8B-Instruct fine-tuned on C/C++ and tested on Java and Python benchmarks, the approach reduces Qwen3-8B’s false positive rate from 1.0 to 0.583—converting 37.2% of false positives into true negatives—and achieves only a 2.9 percentage point higher false positive rate on Python than on Java, demonstrating both effectiveness and cross-lingual robustness.
📝 Abstract
How code representation format shapes false positive behaviour in cross-language LLM vulnerability detection remains poorly understood. We systematically vary training intensity and code representation format, comparing raw source text with pruned Abstract Syntax Trees at both training time and inference time, across two 8B-parameter LLMs (Qwen3-8B and Llama 3.1-8B-Instruct) fine-tuned on C/C++ data from the NIST Juliet Test Suite (v1.3) and evaluated on Java (OWASP Benchmark v1.2) and Python (BenchmarkPython v0.1).
Cross-language FPR reflects the joint effect of training-time and inference-time representation, not either alone. Text fine-tuning drives FPR upward monotonically (Qwen3-8B: 0.763 zero-shot, 0.866 pilot, 1.000 full-scale) while F1 remains stable (0.637-0.688), masking the collapse. We argue surface-cue memorisation is the primary mechanism: text fine-tuning encodes C/C++-specific API names and syntactic idioms as vulnerability triggers that fire indiscriminately on target-language code. A cross-representation probe, applying text-trained weights to AST-encoded input without retraining, isolates this: Qwen3-8B FPR drops from 0.866 to 0.583, and 37.2% of false positives revert to true negatives under AST input alone. Direct AST fine-tuning does not preserve the benefit (FPR at least 0.970), as flat linearisation introduces structural surface cues of its own. The pattern replicates across both model families. On BenchmarkPython the AST probe yields FPR=0.554, within 2.9 percentage points of the Java result, despite maximal surface-syntax differences, substantially weakening a domain-shift explanation. These findings motivate a pre-deployment consistency gate, running alerts through both text and AST paths, as a retraining-free filter for false-positive-sensitive settings, at the cost of reduced recall.