🤖 AI Summary
This study addresses the problem of predicting the practical utility of code review (CR) comments to quantify their real-world value in collaborative software development. We propose a dual-path modeling framework that integrates text feature engineering—incorporating terminology, voice analysis, and code snippet identification—with a feature-agnostic approach using TF-IDF-weighted bag-of-words (BoW). To our knowledge, this is the first systematic comparison of commercial large language models (LLMs), specifically GPT-4o, against lightweight methods for CR utility prediction. We innovatively design cross-project transferable linguistic features to uncover consistent utility discrimination patterns across datasets and validate model generalizability via multi-source data fusion and transfer learning. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving state-of-the-art performance in industry settings. While GPT-4o attains high accuracy, the TF-IDF BoW variant delivers comparable effectiveness with superior efficiency and deployment friendliness.
📝 Abstract
Context: In collaborative software development, the peer code review process proves beneficial only when the reviewers provide useful comments. Objective: This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches. Method: We select three available datasets from both open-source and commercial projects. Additionally, we introduce new features from software and non-software domains. Moreover, we experiment with the presence of jargon, voice, and codes in CR comments and classify the usefulness of CR comments through featurization, bag-of-words, and transfer learning techniques. Results: Our models outperform the baseline by achieving state-of-the-art performance. Furthermore, the result demonstrates that the commercial gigantic LLM, GPT-4o, or non-commercial naive featureless approach, Bag-of-Word with TF-IDF, is more effective for predicting the usefulness of CR comments. Conclusion: The significant improvement in predicting usefulness solely from CR comments escalates research on this task. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR comments.