An Empirical Study on LLM-based Agents for Automated Bug Fixing

📅 2024-11-15

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

186K/year

🤖 AI Summary

Existing LLM-based and non-LLM automatic program repair (APR) systems lack systematic, standardized performance evaluation—particularly rigorous cross-system benchmarking of state-of-the-art approaches. Method: We conduct the first comprehensive evaluation of seven leading open- and closed-source APR systems—including both agent-based and non-agent baselines—on SWE-bench Lite, analyzing them along three dimensions: solution coverage, fault localization accuracy, and necessity of dynamic reproduction. We introduce a novel framework integrating environment-aware interactive debugging, iterative validation, and fine-grained (file- and line-level) fault localization. Results: Empirical analysis reveals that 23% of defects are resolvable only via dynamic reproduction; fault localization accuracy varies significantly across systems (up to a 41 percentage-point gap); and agent performance is fundamentally constrained by suboptimal synergy between base model capabilities and workflow design. Our findings provide quantifiable, dual-axis diagnostic insights—highlighting both environmental interaction and procedural robustness—as actionable guidance for APR system optimization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing. We first assess each system's overall performance, noting instances solvable by all or none of these sytems, and explore why some instances are uniquely solved by specific system types. We also compare fault localization accuracy at file and line levels and evaluate bug reproduction capabilities, identifying instances solvable only through dynamic reproduction. Through analysis, we concluded that further optimization is needed in both the LLM itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.

Problem

Research questions and friction points this paper is trying to address.

Systematically analyzing performance variations among top LLM-based bug-fixing agents

Evaluating repair systems on SWE-bench benchmark for automated bug resolution

Identifying optimization needs in LLM capabilities and Agentic flow design

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agents automate bug fixing

Agents interact with development environments iteratively

System analyzes performance variations across repair systems

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair