Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Fine-grained classification of code review comments suffers from high annotation costs and poor performance on infrequent categories. Method: This work pioneers the application of large language models (LLMs) to zero-shot and few-shot classification of 17 review comment types, leveraging prompt engineering and category semantic enhancement to mitigate long-tail distribution challenges without requiring extensive labeled data. Contribution/Results: Our approach surpasses the state-of-the-art deep learning models in overall accuracy and achieves an average +12.3% F1-score improvement across five critical low-frequency categories—marking the first demonstration of balanced performance between high- and low-frequency classes. By significantly reducing reliance on manual annotation, this work establishes a scalable, low-resource paradigm for fine-grained software engineering text analysis.

Technology Category

Application Category

📝 Abstract

Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process.

Problem

Research questions and friction points this paper is trying to address.

Classifying code review comments using Large Language Models

Overcoming limitations of supervised learning in comment classification

Improving accuracy in low-frequency comment categories with LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs for code review comment classification

Outperforming deep learning with limited training data

Balanced performance across high- and low-frequency categories

🔎 Similar Papers

Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction