🤖 AI Summary
Existing LLM-driven compiler fuzzing tools suffer from low test program validity (47.0%–49.4%), primarily due to frequent syntactic, semantic, and runtime errors—severely limiting deep testing of optimization passes and backend components. To address this, we propose FeedbackFuzz: a feedback-driven closed-loop framework leveraging local large language models, which integrates compiler diagnostics and runtime validation to automatically localize errors, perform precise repairs, and iteratively regenerate test programs. Our approach boosts test program validity to 96.6%–97.3%, with an average per-program processing time of only 2.9–3.5 seconds. Moreover, it significantly enhances coverage—achieving absolute code coverage improvements of up to 9.2% for critical optimization components such as vectorization. FeedbackFuzz seamlessly integrates into LLVM/Clang’s black-box, gray-box, and white-box fuzzing pipelines without requiring modifications to the compiler infrastructure.
📝 Abstract
Existing LLM-based compiler fuzzers often produce syntactically or semantically invalid test programs, limiting their effectiveness in exercising compiler optimizations and backend components. We introduce ReFuzzer, a framework for refining LLM-generated test programs by systematically detecting and correcting compilation and runtime violations (e.g. division by zero or array out-of-bounds accesses). ReFuzzer employs a feedback loop with a local LLM to validate and filter erroneous programs before execution, improving fuzzing effectiveness beyond crash detection and enabling the generation of diverse yet valid test programs.
We evaluated ReFuzzer's effectiveness across black-, grey- and white-box fuzzing approaches targeting LLVM/Clang. ReFuzzer improved test programs' validity from 47.0-49.4% to 96.6-97.3%, with an average processing time of 2.9-3.5 s per test program on a dual-GPU machine. Further, refuzzing significantly increased code coverage in critical optimization and IR generation components. For example, vectorization coverage had an absolute improvement of 9.2%, 2.3%, and 7.1% in black-, grey-, and white-box fuzzing, enhancing testing effectiveness.