EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of dedicated evaluation benchmarks for assessing large language models’ (LLMs) deep program semantic understanding—particularly their ability to reason about code equivalence. We introduce EquiBench, the first specialized benchmark for this task. It comprises 2,400 program pairs across four programming languages, covering six categories of semantic transformations (equivalent vs. non-equivalent). Departing from syntax-based similarity metrics, EquiBench employs static program analysis, LLVM compilation scheduling, and STO super-optimization to generate high-difficulty semantic equivalence instances. It establishes a multilingual, multi-category binary classification evaluation framework. Evaluation across 17 state-of-the-art LLMs reveals that the top-performing model (OpenAI o3-mini) achieves 78.0% overall accuracy—substantially above the random baseline (50%)—yet drops to 62.3% and 68.8% on the two most challenging categories, exposing critical limitations in deep semantic reasoning. EquiBench thus provides a rigorous new standard and diagnostic tool for evaluating code reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs, underpins a broad range of applications, including software refactoring, testing, and optimization. We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models (LLMs). We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. These pairs are systematically generated through program analysis, compiler scheduling, and superoptimization, covering nontrivial structural transformations that demand deep semantic reasoning beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In the most challenging categories, the best accuracies are 62.3% and 68.8%, only modestly above the 50% random baseline for binary classification, indicating significant room for improvement in current models' code reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' code reasoning via equivalence checking
Evaluating performance across diverse program transformations
Identifying gaps in LLMs' semantic reasoning abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Equivalence checking evaluates LLMs
EquiBench dataset spans four languages
LLMs show room for improvement
🔎 Similar Papers
No similar papers found.