🤖 AI Summary
This study addresses the lack of systematic empirical research on the quality and security implications of code refactoring submitted by AI agents in real-world software projects. It presents the first quantitative analysis of AI-generated Python refactoring pull requests from the AIDev dataset, evaluating changes across maintainability, code quality, and security dimensions using tools such as PyQu, Pylint, and Bandit. The work further establishes a mapping between 24 common refactoring operations and potential issues. Findings reveal that 22.5% of changes improved code quality—primarily usability—yet 24.17% introduced new Pylint violations and 4.7% introduced security vulnerabilities. Although 73.5% of pull requests were merged, indicating high developer acceptance, the results underscore an urgent need for stronger quality and security gating mechanisms in AI-assisted development workflows.
📝 Abstract
As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change.
Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.