Selective Code Generation for Functional Guarantees

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the pervasive hallucination and functional unreliability of code generation models. We propose a dynamic-analysis-driven selective code generation framework. Methodologically, we introduce FuzzEval—a novel evaluation paradigm that jointly integrates dynamic code execution with automated unit test generation—and combine it with selective prediction (abstention learning) and false discovery rate (FDR)-constrained optimization to construct a generator capable of abstaining from uncertain predictions while controlling hallucination rates. Our key contribution is the first realization of dynamic-analysis-guided selective generation, coupled with formally verifiable functional correctness guarantees. Experiments across both open- and closed-source models demonstrate significant hallucination reduction, while maintaining high selection efficiency and precise functional correctness assessment. The framework establishes a new paradigm for trustworthy code generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show human-level performance and their specialized descendants, code generation models, play core roles in solving complex tasks, including mathematical reasoning and software development. On the downside, the hallucination of LLMs mainly hinders their applicability to systems requiring higher safety standards, thus drawing the attention of the AI community. However, the hallucination of code generation models is rarely considered. One critical bottleneck in considering code hallucination is the intricate property of code to identify whether generated code has the intended functionality due to its un-natural form, different to natural languages. Handful of unit tests have been considered to address this issue, but scaling-up its size is extremely expensive. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, which leverages the emph{executable nature} of code. Given generated unit tests from true code for measuring functional correctness of generated code, we propose to learn a emph{selective code generator}, which abstains from answering for unsure generation, to control the rate of code hallucination among non-abstaining answers in terms of a false discovery rate. This learning algorithm provides a controllability guarantee, providing trustworthiness of code generation. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this evaluation paradigm emph{FuzzEval}. We demonstrate the efficacy of our selective code generator over open and closed code generators, showing clear benefit of leveraging generated unit tests along with the controllability of code hallucination and reasonable selection efficiency via our selective code generator.
Problem

Research questions and friction points this paper is trying to address.

Addressing code hallucination in LLM-based code generation
Automating unit test generation for functional correctness
Ensuring controllability and trustworthiness in code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generates unit tests using dynamic code analysis
Learns selective code generator to control hallucination rate
Uses FuzzEval for precise code evaluation in learning
🔎 Similar Papers
No similar papers found.