Conventional Commit Classification using Large Language Models and Prompt Engineering

H. M. Sazzad Quadir , Sakib Al Hasan , Md. Nurul Ahad Tawhid

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords classificationcommitconventionalmodelslargepromptingachieveschain-of-thought

0 comments

The pith

Few-shot prompting with the 32B DeepSeek-R1 model achieves the highest accuracy on a balanced set of 3,200 conventional commits mined from InfluxDB, while chain-of-thought adds no benefit and larger model scale improves results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conventional commits are a standardized way to write messages about code changes so that tools can automatically create changelogs and choose version numbers. Normally, sorting these messages into categories such as bug fixes or new features requires training a machine-learning model on thousands of labeled examples. The authors instead test whether already-trained large language models can do the job when given only natural-language instructions. They compare three prompting styles: simply asking the model to classify (zero-shot), showing it a handful of examples first (few-shot), and asking it to explain its reasoning before answering (chain-of-thought). Three open-source models of different sizes are used, and all tests run on the same set of 3,200 balanced commits taken from the InfluxDB project, using both the commit message and the code diff. Results indicate that few-shot prompting works best across models, chain-of-thought brings no extra accuracy for this task, and the largest model (DeepSeek-R1-32B) performs strongest overall. The approach avoids any model fine-tuning or new labeled-data collection.

Core claim

Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification.

Load-bearing premise

That the balanced 3,200-commit dataset mined from a single repository (InfluxDB) is representative of conventional commits in general and that the LLM classifications can be trusted as ground truth without reported details on human validation or inter-rater agreement.

Figures

Figures reproduced from arXiv: 2605.02033 by H. M. Sazzad Quadir, Md. Nurul Ahad Tawhid, Sakib Al Hasan.

**Figure 1.** Figure 1: Overview of the proposed methodology view at source ↗

**Figure 2.** Figure 2: Conventional commit format and an example view at source ↗

**Figure 3.** Figure 3: Prompt design example core classification process, in which the semantic understanding of the language models is applied to determine the intent of each code change. The model’s classification output is recorded for further analysis. After the classification process, the models’ predicted commit labels are compared with the ground-truth labels in the dataset. This comparison allows us to determine whethe… view at source ↗

read the original abstract

Conventional commits provide a structured format for writing commit messages, which improves readability, software maintenance, and enables automation tools such as changelog generators and semantic versioning systems. Existing approaches to conventional commit classification typically rely on ML/DL models trained on large labeled datasets. In this paper, we investigated a training-free alternative by leveraging large language models (LLMs) through prompt engineering. Rather than building a task-specific classifier, we evaluate three prompting strategies, such as zero-shot, few-shot, and chain-of-thought, across three open-source LLMs of varying scale: Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B. Classification is performed directly on code diffs extracted from a balanced dataset of 3,200 commits mined from the InfluxDB repository, without any model fine-tuning. Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification. These findings provide practical guidance for researchers and practitioners seeking to automate commit classification without the overhead of curating and maintaining labeled training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical performance numbers rather than any derivation; no numeric free parameters are fitted, no new entities are postulated, and the sole domain assumption is that LLMs can classify short technical text via prompting without task-specific training.

axioms (1)

domain assumption Large language models can perform text classification tasks through carefully designed prompts without fine-tuning or additional training.
This assumption underpins the decision to evaluate zero-shot, few-shot, and chain-of-thought strategies directly on the commit data.

pith-pipeline@v0.9.0 · 5536 in / 1408 out tokens · 50008 ms · 2026-05-08T19:35:03.596876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages · 1 internal anchor

[1]

What makes a good commit message?

Y . Tianet al., “What makes a good commit message?” inProceedings of the 44th International Conference on Software Engineering (ICSE), 2022, pp. 2389–2401

2022
[2]

Conventional commits 1.0.0,

Conventional Commits, “Conventional commits 1.0.0,” https://www. conventionalcommits.org/, 2014

2014
[3]

A first look at conventional commits classification,

Q. Zeng, Y . Zhang, Z. Qiu, and H. Liu, “A first look at conventional commits classification,” in2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE). IEEE Computer Society, 2025, pp. 2277–2289

2025
[4]

Multi-label classification of commit messages using transfer learning,

M. U. Sarwar, S. Zafar, and M. Z. Malik, “Multi-label classification of commit messages using transfer learning,” in2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2020, pp. 291–296

2020
[5]

Automatically generating commit messages from diffs using neural machine translation,

S. Jiang, A. Armaly, and C. McMillan, “Automatically generating commit messages from diffs using neural machine translation,” inPro- ceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 135–146

2017
[6]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2023

2023
[7]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020

1901
[8]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022

2022
[9]

The dimensions of maintenance,

E. B. Swanson, “The dimensions of maintenance,” inProceedings of the 2nd International Conference on Software Engineering. IEEE Computer Society Press, 1976, pp. 492–497

1976
[10]

Auto- matic classification of large changes into maintenance categories,

A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt, “Auto- matic classification of large changes into maintenance categories,” in 2009 IEEE 17th International Conference on Program Comprehension (ICPC). IEEE, 2009, pp. 30–39

2009
[11]

Boosting automatic commit classification into maintenance activities by combining different features,

S. Levin and A. Yehudai, “Boosting automatic commit classification into maintenance activities by combining different features,” inProceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 2017, pp. 1–10

2017
[12]

Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model,

L. Ghadhab, I. Jenhani, M. W. Mkaouer, and M. Ben Messaoud, “Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model,”Information and Software Technology, vol. 135, p. 106566, 2021

2021
[13]

Commit classification into maintenance activities using in-context learning capabilities of large language models,

Y . Sazid, S. Kuri, K. S. Ahmed, and A. Satter, “Commit classification into maintenance activities using in-context learning capabilities of large language models,” inProceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024). SCITEPRESS – Science and Technology Publications, 2024, pp. 506–512

2024
[14]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023

work page internal anchor Pith review arXiv 2023
[15]

Mistral 7b,

A. Q. Jianget al., “Mistral 7b,” 2023

2023
[16]

The llama 3 herd of models,

Meta AI, “The llama 3 herd of models,” 2024

2024
[17]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025

2025