Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation methods—such as majority voting and test-based validation—exhibit high consensus failure rates and unreliable abstention when the probability of sampling a correct solution is low, when multiple semantically non-equivalent valid solutions exist, or when no correct solution is present. This paper introduces Semantic Triangulation, a novel framework that constructs cross-variant consistency verification via semantics-preserving, verifiable program transformations. It achieves, for the first time, genuine semantic consensus in scenarios with multiple non-equivalent valid solutions and significantly improves identification of low-probability correct solutions (down to 0.14). The method integrates semantics-preserving transformations, sampling-consistency checks, generated-test validation, and majority voting. Evaluated on LiveCodeBench and CodeElo, it boosts code reliability by 21% over high-confidence selection (threshold = 0.5) and achieves state-of-the-art performance on multi-solution tasks.

Technology Category

Application Category

📝 Abstract
When generating code from natural language prompts, an LLM samples programs from a probability distribution, many of which might be incorrect. Sample consensus techniques - such as majority voting or validation against generated tests or specifications - aim to identify a correct program in the sample or abstain if none is valid. However, existing methods often fail to select a correct solution when its sampling probability is low, or when the problem permits multiple valid but non-equivalent solutions. Additionally, they often fail to abstain when no correct solution is present in the sample. To overcome these limitations, we introduce semantic triangulation, which transforms a programming problem in a way that non-trivially alters its semantics while preserving an exact, verifiable mapping between solutions before and after transformation. We theoretically establish that verifying consistency across such problem transformations increases confidence that generated programs reflect accurate generalization rather than spurious statistical correlations, enabling more reliable sample consensus and abstention. On the LiveCodeBench and CodeElo benchmarks, using GPT-4o and DeepSeek-V3 models, semantic triangulation increases reliability of generated code by 21% compared to the method that selects only high-confidence solutions with the probability threshold 0.5, while being able to pinpoint correct solutions at sampling probabilities as low as 0.14. Apart from that, it is also the only approach to consistently form true consensus on tasks with multiple valid but non-equivalent solutions.
Problem

Research questions and friction points this paper is trying to address.

LLM-generated code often contains incorrect programs from probability sampling
Existing methods fail with low-probability correct solutions or multiple valid solutions
Current techniques cannot reliably abstain when no correct solution exists
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic triangulation transforms programming problems non-trivially
Verifying consistency across transformations increases solution confidence
Enables reliable consensus even with low-probability correct solutions
🔎 Similar Papers
No similar papers found.
Y
Yihan Dai
Peking University, China
S
Sijie Liang
Beijing Forestry University, China
H
Haotian Xu
Peking University, China
P
Peichu Xie
Independent, China
Sergey Mechtaev
Sergey Mechtaev
Peking University
Program RepairProgram Analysis