ReFuzzer: Feedback-Driven Approach to Enhance Validity of LLM-Generated Test Programs

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing LLM-driven compiler fuzzing tools suffer from low test program validity (47.0%–49.4%), primarily due to frequent syntactic, semantic, and runtime errors—severely limiting deep testing of optimization passes and backend components. To address this, we propose FeedbackFuzz: a feedback-driven closed-loop framework leveraging local large language models, which integrates compiler diagnostics and runtime validation to automatically localize errors, perform precise repairs, and iteratively regenerate test programs. Our approach boosts test program validity to 96.6%–97.3%, with an average per-program processing time of only 2.9–3.5 seconds. Moreover, it significantly enhances coverage—achieving absolute code coverage improvements of up to 9.2% for critical optimization components such as vectorization. FeedbackFuzz seamlessly integrates into LLVM/Clang’s black-box, gray-box, and white-box fuzzing pipelines without requiring modifications to the compiler infrastructure.

Technology Category

Application Category

📝 Abstract

Existing LLM-based compiler fuzzers often produce syntactically or semantically invalid test programs, limiting their effectiveness in exercising compiler optimizations and backend components. We introduce ReFuzzer, a framework for refining LLM-generated test programs by systematically detecting and correcting compilation and runtime violations (e.g. division by zero or array out-of-bounds accesses). ReFuzzer employs a feedback loop with a local LLM to validate and filter erroneous programs before execution, improving fuzzing effectiveness beyond crash detection and enabling the generation of diverse yet valid test programs. We evaluated ReFuzzer's effectiveness across black-, grey- and white-box fuzzing approaches targeting LLVM/Clang. ReFuzzer improved test programs' validity from 47.0-49.4% to 96.6-97.3%, with an average processing time of 2.9-3.5 s per test program on a dual-GPU machine. Further, refuzzing significantly increased code coverage in critical optimization and IR generation components. For example, vectorization coverage had an absolute improvement of 9.2%, 2.3%, and 7.1% in black-, grey-, and white-box fuzzing, enhancing testing effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Enhances validity of LLM-generated test programs

Detects and corrects compilation and runtime violations

Improves code coverage in compiler optimization components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback loop with local LLM for validation

Systematic detection and correction of violations

Enhances validity and diversity of test programs

🔎 Similar Papers

On the Challenges of Fuzzing Techniques via Large Language Models