🤖 AI Summary
This study addresses emerging software supply chain security risks introduced by AI coding agents during dependency updates. Through a large-scale empirical analysis comparing 117,062 dependency-related commits made by AI agents and human developers, the work systematically reveals that AI agents more frequently select dependency versions containing known vulnerabilities, and the resulting vulnerabilities are significantly harder to remediate. Specifically, AI contributions introduced a net increase of 98 vulnerabilities, whereas human developers reduced vulnerabilities by 1,316. Moreover, 36.8% of vulnerabilities introduced by AI require a major-version upgrade for remediation—substantially higher than the 12.9% observed for human-introduced vulnerabilities. To mitigate these risks, the paper proposes integrating vulnerability screening and registry-aware protection mechanisms at the pull request stage to enhance the security of AI-assisted software development.
📝 Abstract
AI coding agents increasingly modify real software repositories and make dependency decisions, including adding, removing, or updating third-party packages. These choices can materially affect security posture and maintenance burden, yet repository-level evaluations largely emphasize test passing and executability without explicitly scoring whether systems (i) reuse existing dependencies, (ii) avoid unnecessary additions, or (iii) select versions that satisfy security and policy constraints. We propose DepDec-Bench, a benchmark for evaluating dependency decision-making beyond functional correctness. To ground DepDec-Bench in real-world behavior, we conduct a preliminary study of 117,062 dependency changes from agent- and human-authored pull requests across seven ecosystems. We show that coding agents frequently make dependency decisions with security consequences that remain invisible to test-focused evaluation: agents select PR-time known-vulnerable versions (2.46%) and exhibit net-negative security impact overall (net impact -98 vs. +1,316 for humans). These observations inform DepDec-Bench task families and metrics that evaluate safe version selection, reuse discipline, and restraint against dependency bloat alongside test passing.