Discovering 100+ Compiler Defects in 72 Hours via LLM-Driven Semantic Logic Recomposition

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing compiler fuzzing approaches struggle to preserve the critical semantic logic necessary to trigger deep-seated bugs, resulting in insufficient program diversity and limited defect discovery. To address this, this work proposes FeatureFuzz, the first semantics-driven compiler fuzzer that decouples semantic patterns from historical bug reports into reusable feature units—each comprising natural language descriptions and code examples—and leverages large language models to synthesize and instantiate programs guided by these semantics. Evaluated on GCC and LLVM, FeatureFuzz uncovered 106 real-world bugs within 72 hours (76 confirmed), achieving 2.78× more crashes than the best baseline tool within 24 hours, thereby substantially enhancing both semantic diversity and bug-triggering capability.

Technology Category

Application Category

📝 Abstract

Compilers constitute the foundational root-of-trust in software supply chains; however, their immense complexity inevitably conceals critical defects. Recent research has attempted to leverage historical bugs to design new mutation operators or fine-tune models to increase program diversity for compiler fuzzing.We observe, however, that bugs manifest primarily based on the semantics of input programs rather than their syntax. Unfortunately, current approaches, whether relying on syntactic mutation or general Large Language Model (LLM) fine-tuning, struggle to preserve the specific semantics found in the logic of bug-triggering programs. Consequently, these critical semantic triggers are often lost, resulting in a limitation of the diversity of generated programs. To explicitly reuse such semantics, we propose FeatureFuzz, a compiler fuzzer that combines features to generate programs. We define a feature as a decoupled primitive that encapsulates a natural language description of a bug-prone invariant, such as an out-of-bounds array access, alongside a concrete code witness of its realization. FeatureFuzz operates via a three-stage workflow: it first extracts features from historical bug reports, synthesizes coherent groups of features, and finally instantiates these groups into valid programs for compiler fuzzing. We evaluated FeatureFuzz on GCC and LLVM. Over 24-hour campaigns, FeatureFuzz uncovered 167 unique crashes, which is 2.78x more than the second-best fuzzer. Furthermore, through a 72-hour fuzzing campaign, FeatureFuzz identified 113 bugs in GCC and LLVM, 97 of which have already been confirmed by compiler developers, validating the approach's ability to stress-test modern compilers effectively.

Problem

Research questions and friction points this paper is trying to address.

compiler defects

semantic logic

fuzzing

program diversity

bug-triggering semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic logic recomposition

feature-based fuzzing

compiler testing