Correctness Assessment of Code Generated by Large Language Models Using Internal Representations

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) for code generation lack real-time, interpretable correctness assessment. Method: This paper proposes OPENIA, the first white-box framework that systematically uncovers strong correlations between intermediate-layer activations of code LLMs (e.g., DeepSeek-Coder, CodeLlama, MagicCoder) and code correctness, enabling the first internal-state-based dynamic evaluation paradigm. It employs multi-model cross-layer feature extraction, white-box activation analysis, and cross-benchmark consistency modeling to perform online verification during code generation. Results: OPENIA achieves a 2× improvement in accuracy on standalone code generation tasks and boosts F1-score by 46% in repository-specialized settings—significantly outperforming zero-shot and classification baselines. By moving beyond conventional post-hoc black-box validation, OPENIA establishes an interpretable, adaptive foundation for trustworthy code generation.

Technology Category

Application Category

📝 Abstract
Ensuring the correctness of code generated by Large Language Models (LLMs) presents a significant challenge in AI-driven software development. Existing approaches predominantly rely on black-box (closed-box) approaches that evaluate correctness post-generation, failing to utilize the rich insights embedded in the LLMs' internal states during code generation. In this paper, we introduce OPENIA, a novel white-box (open-box) framework that leverages these internal representations to assess the correctness of LLM-generated code. OPENIA systematically analyzes the intermediate states of representative open-source LLMs specialized for code, including DeepSeek-Coder, CodeLlama, and MagicCoder, across diverse code generation benchmarks. Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. Building on these insights, OPENIA uses a white-box/open-box approach to make informed predictions about code correctness, offering significant advantages in adaptability and robustness over traditional classification-based methods and zero-shot approaches. Experimental results demonstrate that OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios. By unlocking the potential of in-process signals, OPENIA paves the way for more proactive and efficient quality assurance mechanisms in LLM-assisted code generation.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Code Accuracy
Internal Model Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

White-box Approach
Large Language Models
Code Accuracy Evaluation
T
Tuan-Dung Bui
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi
Thanh Trong Vu
Thanh Trong Vu
VNU
Software EngineeringAutomated Software Engineering
Thu-Trang Nguyen
Thu-Trang Nguyen
VNU University of Engineering and Technology
Automated Software EngineeringProgram AnalysisCode GenerationAI
S
Son Nguyen
Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi
Hieu Dinh Vo
Hieu Dinh Vo
VNU
Software architectureProgram analysis