SWE-Explore: Benchmarking How Coding Agents Explore Repositories

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing benchmarks for code agents lack fine-grained evaluation of repository exploration capabilities, such as contextual retrieval, code localization, and fault diagnosis. This work proposes SWE-Explore, a novel benchmark that decouples exploration into independently assessable tasks: given a code repository and a problem, the agent must return a ranked list of relevant code regions within a fixed line budget. Line-level ground truth is distilled from real solution trajectories, enabling a three-dimensional evaluation framework measuring coverage, ranking quality, and contextual efficiency. Experiments demonstrate that agent-based exploration significantly outperforms traditional retrieval methods; while modern approaches excel at file-level localization, line-level coverage and efficient ranking remain critical differentiators among top-performing explorers.

📝 Abstract

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

Problem

Research questions and friction points this paper is trying to address.

repository exploration

coding agents

benchmarking

code localization

context retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

repository exploration

code localization

line-level coverage