π€ AI Summary
This study investigates why a large number of fix-oriented pull requests (PRs) submitted by AI coding agents remain unmerged. Leveraging the AIDEV POP dataset, the authors conduct a quantitative analysis of 8,106 AI-generated PRs complemented by over 100 hours of manual qualitative examination of 326 unmerged PRs, resulting in the first structured taxonomy of 12 failure categories. The findings reveal that test failures and issues already resolved by other PRs are the primary reasons for non-merging, whereas build or deployment failures are relatively rare. This work systematically uncovers critical limitations of AI agents in real-world software maintenance and provides empirical insights and actionable directions for improving AI coding agents and enhancing humanβAI collaboration in software development.
π Abstract
Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.