Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the problem that severe data noise—such as incompleteness, redundancy, and semantic distortion—in pull request (PR) descriptions limits the performance of AI-based PR summarization models. We present the first systematic quantification of how data cleaning impacts PR description generation quality. To this end, we propose four interpretable, PR-text-aware heuristic cleaning rules grounded in semantic understanding, and apply them to over 169,000 raw PR instances. Extensive experiments are conducted on state-of-the-art models including BART, T5, PRSummarizer, and iTAPE. Results show that cleaning yields average ROUGE-F1 improvements of 8.5%–8.7%. Human evaluation further confirms significant gains in readability and relevance of generated summaries. Our contribution is twofold: (1) a novel semantic-aware cleaning methodology tailored to PR text characteristics, and (2) a reproducible, practice-oriented framework for constructing high-quality training data in collaborative software development settings.

Technology Category

Application Category

📝 Abstract

Pull Requests (PRs) are central to collaborative coding, summarizing code changes for reviewers. However, many PR descriptions are incomplete, uninformative, or have out-of-context content, compromising developer workflows and hindering AI-based generation models trained on commit messages and original descriptions as"ground truth."This study examines the prevalence of"noisy"PRs and evaluates their impact on state-of-the-art description generation models. To do so, we propose four cleaning heuristics to filter noise from an initial dataset of 169K+ PRs drawn from 513 GitHub repositories. We train four models-BART, T5, PRSummarizer, and iTAPE-on both raw and cleaned datasets. Performance is measured via ROUGE-1, ROUGE-2, and ROUGE-L metrics, alongside a manual evaluation to assess description quality improvements from a human perspective. Cleaning the dataset yields significant gains: average F1 improvements of 8.6% (ROUGE-1), 8.7% (ROUGE-2), and 8.5% (ROUGE-L). Manual assessment confirms higher readability and relevance in descriptions generated by the best-performing model, BART when trained on cleaned data. Dataset refinement markedly enhances PR description generation, offering a foundation for more accurate AI-driven tools and guidelines to assist developers in crafting high-quality PR descriptions.

Problem

Research questions and friction points this paper is trying to address.

Assessing noisy PR descriptions' impact on AI generation models

Proposing cleaning heuristics to improve PR dataset quality

Evaluating performance gains from cleaned data in model outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed four cleaning heuristics for noisy PRs

Trained models on both raw and cleaned datasets

Manual evaluation confirmed improved readability and relevance

🔎 Similar Papers

No similar papers found.