A Benchmark for Localizing Code and Non-Code Issues in Software Projects

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks (e.g., SWE-Bench, LocBench) focus solely on pull requests and code locations, neglecting non-code artifacts—such as commit histories, issue comments, configuration files, and documentation—that are critical for fine-grained problem localization in real-world software maintenance. To address this gap, we propose MULocBench, the first diverse, code–non-code integrated benchmark for problem localization, comprising 1,100 real-world issues across open-source projects and supporting both file-level and function-level evaluation. Its construction systematically aggregates heterogeneous, multi-source evidence. We comprehensively evaluate five LLM prompting strategies and state-of-the-art localization methods. Results reveal severe limitations: the best-performing method achieves only 39.2% Acc@5 and 38.7% F1 at the file level—highlighting a critical capability gap in practical settings. MULocBench thus establishes a more challenging and realistic evaluation platform to advance future research in automated software maintenance.

Technology Category

Application Category

📝 Abstract
Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.
Problem

Research questions and friction points this paper is trying to address.

Benchmark lacks diversity in issue types and file locations
Current localization methods show below 40% performance metrics
Addresses gap in localizing both code and non-code issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced MULocBench dataset for diverse issue localization
Evaluated state-of-the-art methods and LLM prompting strategies
Revealed performance limitations below 40% for file localization
🔎 Similar Papers
No similar papers found.