🤖 AI Summary
Prior work lacks empirical evidence comparing AI coding agents and human developers on real-world performance optimization tasks. Method: We conduct the first large-scale empirical study using the AIDev dataset, analyzing 324 AI agent pull requests (PRs) and 83 human PRs via chi-square tests, PR metadata mining, code change pattern classification, and manual annotation. Contribution/Results: (1) AI agents exhibit significantly lower performance verification rates than humans (45.7% vs. 63.6%, *p* = 0.007), revealing a systemic verification deficit; (2) their optimization strategies closely align with human patterns, indicating reliability at the strategic level. This work establishes the first empirically grounded benchmark for AI agents in performance optimization, identifies “verification gap” as the critical trust bottleneck, and provides both theoretical foundations and practical pathways toward building verifiable, trustworthy AI code optimizers.
📝 Abstract
Performance optimization is a critical yet challenging aspect of software development, often requiring a deep understanding of system behavior, algorithmic tradeoffs, and careful code modifications. Although recent advances in AI coding agents have accelerated code generation and bug fixing, little is known about how these agents perform on real-world performance optimization tasks. We present the first empirical study comparing agent- and human-authored performance optimization commits, analyzing 324 agent-generated and 83 human-authored PRs from the AIDev dataset across adoption, maintainability, optimization patterns, and validation practices. We find that AI-authored performance PRs are less likely to include explicit performance validation than human-authored PRs (45.7% vs. 63.6%, $p=0.007$). In addition, AI-authored PRs largely use the same optimization patterns as humans. We further discuss limitations and opportunities for advancing agentic code optimization.