🤖 AI Summary
This work addresses the limitations of existing uncertainty estimation methods in code generation, which typically adopt natural language processing approaches and overlook the structural and executable nature of code, leading to unreliable error detection. The paper introduces the first triaxial uncertainty estimation framework tailored for code generation, leveraging three orthogonal dimensions grounded in code-specific characteristics: lexical uncertainty (measured via Top-K token entropy), algorithmic consistency (assessed through pseudocode alignment), and functional correctness (evaluated by behavioral consistency). Experiments across five state-of-the-art code large language models demonstrate that integrating these three axes improves average AUROC from 0.696 to 0.776. Notably, the Top-K entropy metric under single-pass inference on Qwen3-14B matches the performance of multi-pass baselines while reducing computational cost by more than threefold.
📝 Abstract
Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.