🤖 AI Summary
This work addresses the problem that severe data noise—such as incompleteness, redundancy, and semantic distortion—in pull request (PR) descriptions limits the performance of AI-based PR summarization models. We present the first systematic quantification of how data cleaning impacts PR description generation quality. To this end, we propose four interpretable, PR-text-aware heuristic cleaning rules grounded in semantic understanding, and apply them to over 169,000 raw PR instances. Extensive experiments are conducted on state-of-the-art models including BART, T5, PRSummarizer, and iTAPE. Results show that cleaning yields average ROUGE-F1 improvements of 8.5%–8.7%. Human evaluation further confirms significant gains in readability and relevance of generated summaries. Our contribution is twofold: (1) a novel semantic-aware cleaning methodology tailored to PR text characteristics, and (2) a reproducible, practice-oriented framework for constructing high-quality training data in collaborative software development settings.
📝 Abstract
Pull Requests (PRs) are central to collaborative coding, summarizing code changes for reviewers. However, many PR descriptions are incomplete, uninformative, or have out-of-context content, compromising developer workflows and hindering AI-based generation models trained on commit messages and original descriptions as"ground truth."This study examines the prevalence of"noisy"PRs and evaluates their impact on state-of-the-art description generation models. To do so, we propose four cleaning heuristics to filter noise from an initial dataset of 169K+ PRs drawn from 513 GitHub repositories. We train four models-BART, T5, PRSummarizer, and iTAPE-on both raw and cleaned datasets. Performance is measured via ROUGE-1, ROUGE-2, and ROUGE-L metrics, alongside a manual evaluation to assess description quality improvements from a human perspective. Cleaning the dataset yields significant gains: average F1 improvements of 8.6% (ROUGE-1), 8.7% (ROUGE-2), and 8.5% (ROUGE-L). Manual assessment confirms higher readability and relevance in descriptions generated by the best-performing model, BART when trained on cleaned data. Dataset refinement markedly enhances PR description generation, offering a foundation for more accurate AI-driven tools and guidelines to assist developers in crafting high-quality PR descriptions.