Assessing Code Understanding in LLMs

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited understanding of compiler-level semantics-preserving program transformations—e.g., copy propagation and constant folding—critical for reliable code reasoning. Method: We propose a formal-verification–based empirical evaluation framework for semantic equivalence judgment, leveraging LLVM and other compiler toolchains to automatically generate robust test cases and self-supervised training signals. Contribution/Results: Experiments reveal high failure rates: 41% without context and 29% even with simple generic context—exposing fundamental blind spots in deep code semantic modeling. To address this, we introduce the first LLM–compiler co-enhanced training paradigm, wherein compiler-generated semantic equivalence pairs explicitly reinforce model robustness. This work establishes a rigorous methodology for quantitatively assessing code understanding capabilities and provides a scalable, tool-integrated pathway toward trustworthy code AI.

Technology Category

Application Category

📝 Abstract

We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41% of cases when no context is provided and in 29% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to understand transformed code semantics

Identifying LLMs' failure rates in semantic equivalence judgments

Proposing integration with code-optimization tools for improved accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical evaluation of LLMs in code understanding

Integration with code-optimization tools for accuracy

Enhanced training for robust program understanding

🔎 Similar Papers

What can Large Language Models Capture about Code Functional Equivalence?