🤖 AI Summary
Current software patching agents face inherent trade-offs among repair accuracy, stability, and cost: agent-based planning achieves high precision but suffers from high computational expense and instability, whereas human-guided planning is efficient and robust yet constrained by critical procedural bottlenecks. To address this, we propose an intelligent patch generation framework tailored for SWE-Bench, centered on a novel five-stage, human-inspired planning pipeline—including a distinctive fine-grained correction phase—that jointly optimizes controllability, robustness, and efficiency. The framework synergistically integrates large language models with non-ML tooling, with stage-specific designs for reproduction, localization, generation, validation, and correction. Experiments demonstrate state-of-the-art performance on SWE-Bench among open-source approaches, achieving per-instance repair costs under $1 and markedly improved stability. Ablation studies confirm the efficacy of each component.
📝 Abstract
Recent research builds various patching agents that combine large language models (LLMs) with non-ML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-Bench. Based on how to determine the patching workflows, existing patching agents can be categorized as agent-based planning methods, which rely on LLMs for planning, and human-based planning methods, which follow a pre-defined workflow. At a high level, agent-based planning methods achieve high patching performance but with a high cost and limited stability. Human-based planning methods, on the other hand, are more stable and efficient but have key workflow limitations that compromise their patching performance. In this paper, we propose PatchPilot, an agentic patcher that strikes a balance between patching efficacy, stability, and cost-efficiency. PatchPilot proposes a novel human-based planning workflow with five components: reproduction, localization, generation, validation, and refinement (where refinement is unique to PatchPilot). We introduce novel and customized designs to each component to optimize their effectiveness and efficiency. Through extensive experiments on the SWE-Bench benchmarks, PatchPilot shows a superior performance than existing open-source methods while maintaining low cost (less than 1$ per instance) and ensuring higher stability. We also conduct a detailed ablation study to validate the key designs in each component.