Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) frequently exhibit hallucinations and weak self-reflection in code generation, primarily due to reliance on erroneous premises. Method: We propose FPBench—the first evaluation framework explicitly targeting faulty premises—comprising three systematically constructed, controllable types of erroneous premises. It integrates multi-dimensional metrics: code correctness, error-correction capability, and cognitive consistency, enabling hybrid automated and human evaluation across 15 state-of-the-art LLMs. Contribution/Results: Our study uncovers, for the first time, a “reasoning–verification–correction” tripartite cognitive dissociation in LLMs under faulty premises; identifies diminishing returns in resource allocation beyond a marginal-efficiency inflection point; and empirically demonstrates that most models lack proactive verification mechanisms. FPBench establishes a reproducible evaluation paradigm, delivers theoretical insights into premise-aware reasoning, and offers concrete pathways toward more reliable, human-centered code generation.

Technology Category

Application Category

📝 Abstract

With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises trigger a point of diminishing returns in resource investment, leading to blindly increasing length fails to enhance quality; (3) The three types of faulty premises respectively activate distinct defect patterns in models, revealing a triple dissociation in the cognitive mechanisms of code generation models. This study not only highlights the urgent need for LLMs to proactively verify premises in code generation but also, through the proposed FPBench framework and multi-dimensional evaluation system, provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM code generation under faulty premises

Assessing self-scrutiny deficiencies in LLMs with FPBench

Identifying defect patterns triggered by premise errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

FPBench evaluates LLMs on faulty premises

Multi-dimensional metrics assess code generation

Identifies distinct defect patterns in models

🔎 Similar Papers

No similar papers found.

Authors to Follow