🤖 AI Summary
This study addresses the lack of systematic evaluation of code quality in real-world projects for pull requests (PRs) generated by AI coding agents that have been successfully merged. Analyzing 1,210 merged AI-generated Python fix PRs, the authors employ differential static analysis via SonarQube and normalize findings against multiple dimensions, including code churn, to assess newly introduced issues across five AI agents. Their findings reveal that successful merging does not guarantee code quality: code smells dominate and are often high-severity, while bugs—though less frequent—are frequently critical. After normalization by PR size, the disparity in issue density across agents markedly diminishes, suggesting that problem prevalence stems primarily from PR scale rather than inherent agent capability. This work underscores significant quality risks in AI-generated code and calls for standardized quality assurance mechanisms.
📝 Abstract
The increasing adoption of AI coding agents has increased the number of agent-generated pull requests (PRs) merged with little or no human intervention. Although such PRs promise productivity gains, their post-merge code quality remains underexplored, as prior work has largely relied on benchmarks and controlled tasks rather than large-scale post-merge analyses. To address this gap, we analyze 1,210 merged agent-generated bug-fix PRs from Python repositories in the AIDev dataset. Using SonarQube, we perform a differential analysis between base and merged commits to identify code quality issues newly introduced by PR changes. We examine issue frequency, density, severity, and rule-level prevalence across five agents. Our results show that apparent differences in raw issue counts across agents largely disappear after normalizing by code churn, indicating that higher issue counts are primarily driven by larger PRs. Across all agents, code smells dominate, particularly at critical and major severities, while bugs are less frequent but often severe. Overall, our findings show that merge success does not reliably reflect post-merge code quality, highlighting the need for systematic quality checks for agent-generated bug-fix PRs.