CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code reasoning benchmarks predominantly rely on synthetic or educational data and focus on coarse-grained I/O prediction, failing to assess large language models’ fine-grained semantic understanding in realistic software engineering (SE) scenarios. Method: We introduce the first fine-grained code semantic reasoning benchmark derived from real-world open-source projects (Python, C, Java), innovatively generating high-quality reasoning tasks from dynamic execution traces. Our framework includes a reproducible tracing infrastructure and an automated ground-truth construction toolkit. The methodology integrates test-driven trace collection, cross-language AST analysis, chain-of-thought prompting, and in-context learning evaluation. Contribution/Results: Experiments reveal significant performance bottlenecks for state-of-the-art code LMs on these tasks; chain-of-thought and in-context learning yield only marginal improvements—exposing fundamental limitations in their underlying semantic modeling capabilities.

Technology Category

Application Category

📝 Abstract
Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at https://codesense-bench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on fine-grained code reasoning in real-world SE tasks
Bridging the gap between synthetic and real-world code datasets
Assessing LLM limitations in understanding fine-grained code semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world code projects for fine-grained reasoning
Execution traces for ground truth dataset
Execution tracing framework for future benchmarks
🔎 Similar Papers