Conventional Commit Classification using Large Language Models and Prompt Engineering

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the limitations of conventional commit classification, which typically relies on large annotated datasets to train task-specific models, resulting in high maintenance costs and poor generalization. For the first time, it systematically evaluates the effectiveness of training-free, open-source large language models—including Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B—using zero-shot, few-shot, and chain-of-thought prompting strategies. Experiments are conducted on a balanced dataset of 3,200 commits curated from the InfluxDB repository. Results show that few-shot prompting yields the best performance, with DeepSeek-R1-32B achieving the highest accuracy, underscoring the significant impact of model scale. Chain-of-thought prompting, however, provides no clear benefit. This work reveals both the promise and limitations of training-free LLM approaches for commit message classification.

📝 Abstract

Conventional commits provide a structured format for writing commit messages, which improves readability, software maintenance, and enables automation tools such as changelog generators and semantic versioning systems. Existing approaches to conventional commit classification typically rely on ML/DL models trained on large labeled datasets. In this paper, we investigated a training-free alternative by leveraging large language models (LLMs) through prompt engineering. Rather than building a task-specific classifier, we evaluate three prompting strategies, such as zero-shot, few-shot, and chain-of-thought, across three open-source LLMs of varying scale: Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B. Classification is performed directly on code diffs extracted from a balanced dataset of 3,200 commits mined from the InfluxDB repository, without any model fine-tuning. Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification. These findings provide practical guidance for researchers and practitioners seeking to automate commit classification without the overhead of curating and maintaining labeled training data.

Problem

Research questions and friction points this paper is trying to address.

Conventional Commits

Commit Classification

Large Language Models

Prompt Engineering

Training-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt engineering

training-free

conventional commit classification