Turning the Tide: Repository-based Code Reflection

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing code evaluation benchmarks (e.g., HumanEval) focus on single-file code generation and fail to assess program understanding and modification capabilities in realistic, multi-file repository settings. Method: We introduce LiveRepoReflection—the first reflective, repository-level benchmark for real-world codebases—comprising 1,888 high-difficulty, multi-file test cases across six programming languages, rigorously curated to prevent data contamination. We propose a two-turn dialog-based training paradigm to construct the instruction-tuning dataset RepoReflection-Instruct and present RepoReflectionCoder, a model integrating code generation with error-driven repair to enable dynamic contextual reasoning and multilingual comprehension. Contribution/Results: We release an open-source leaderboard featuring 40+ models, significantly advancing the evaluation of large language models’ capabilities in error localization, reflective reasoning, and iterative repair within authentic software development workflows—thereby bridging intelligent coding systems with industrial engineering practice.

Technology Category

Application Category

📝 Abstract

Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.

Problem

Research questions and friction points this paper is trying to address.

Evaluating code reflection in multi-file repositories

Addressing data contamination in dynamic benchmarks

Enhancing code understanding and generation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing LiveRepoReflection benchmark for multi-file repositories

Creating RepoReflection-Instruct dataset for instruction-tuning

Training RepoReflectionCoder via two-turn dialogue process

🔎 Similar Papers

No similar papers found.