🤖 AI Summary
Existing code retrieval benchmarks are limited to matching natural language queries with isolated code snippets, failing to address the practical demands of intelligent coding agents that must locate relevant files and functions within real-world repositories while filtering noise. This work proposes the first repository-level code retrieval evaluation framework tailored for agent-centric tasks, introducing a comprehensive benchmark comprising 180K queries and 106K context relevance labels across three tiers: code comprehension, issue localization, and edit-context retrieval. The dataset is constructed from SWE-bench instances and curated search tasks, and evaluated using supervised fine-tuned embedding models. Experiments reveal a significant performance drop of conventional models in this setting, while fine-tuned variants achieve substantial gains, highlighting considerable room for improvement in repository-scale code retrieval.
📝 Abstract
Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and filter similar in-repository distractors. Existing code retrieval benchmarks mainly evaluate docstring-to-function or snippet-level matching, thereby missing this requirement-driven repository search problem. To address this gap, we introduce CORE-Bench, a comprehensive benchmark for code retrieval in the era of agentic coding. CORE-Bench evaluates code retrieval ability at three levels: code understanding, issue-to-edit localization, and broader context retrieval. Built from curated code-search tasks and SWE-bench-series instances, CORE-Bench contains over 180K queries and 106K broader-context relevance labels. Experiments with representative embedding models show a sharp drop from traditional code search to code retrieval in agentic coding settings. Simple supervised fine-tuning of existing embedding models significantly improves performance in this setting, suggesting substantial room for further progress.