🤖 AI Summary
Existing benchmarks predominantly rely on synthetic data, failing to directly assess large language models’ (LLMs) ability to edit code according to user instructions in realistic development settings. Method: We introduce RealCodeEdit—the first instruction-based code editing benchmark grounded in authentic development scenarios—comprising 545 real-world programming tasks spanning multiple natural and programming languages. It innovatively incorporates context-dependent problems requiring models to jointly reason over full code context, highlighted regions, and cursor position, enabling the first systematic evaluation of IDE-level instruction execution. Data is derived from real user instructions and code contexts, curated via multilingual parsing, contextual modeling, and rigorous human validation. Contribution/Results: Evaluation across 40 state-of-the-art LLMs reveals that only five achieve >60% accuracy; instruction type and contextual information significantly impact performance—inducing up to 11% accuracy variance—highlighting critical bottlenecks in current models’ real-world code editing capabilities.
📝 Abstract
Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of problems where only 5 models score over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.