Recognition: 2 theorem links
Conventional Commit Classification using Large Language Models and Prompt Engineering
Pith reviewed 2026-05-08 19:35 UTC · model grok-4.3
The pith
Few-shot prompting with the 32B DeepSeek-R1 model achieves the highest accuracy on a balanced set of 3,200 conventional commits mined from InfluxDB, while chain-of-thought adds no benefit and larger model scale improves results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification.
Load-bearing premise
That the balanced 3,200-commit dataset mined from a single repository (InfluxDB) is representative of conventional commits in general and that the LLM classifications can be trusted as ground truth without reported details on human validation or inter-rater agreement.
Figures
read the original abstract
Conventional commits provide a structured format for writing commit messages, which improves readability, software maintenance, and enables automation tools such as changelog generators and semantic versioning systems. Existing approaches to conventional commit classification typically rely on ML/DL models trained on large labeled datasets. In this paper, we investigated a training-free alternative by leveraging large language models (LLMs) through prompt engineering. Rather than building a task-specific classifier, we evaluate three prompting strategies, such as zero-shot, few-shot, and chain-of-thought, across three open-source LLMs of varying scale: Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B. Classification is performed directly on code diffs extracted from a balanced dataset of 3,200 commits mined from the InfluxDB repository, without any model fine-tuning. Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification. These findings provide practical guidance for researchers and practitioners seeking to automate commit classification without the overhead of curating and maintaining labeled training data.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can perform text classification tasks through carefully designed prompts without fine-tuning or additional training.
Reference graph
Works this paper leans on
-
[1]
What makes a good commit message?
Y . Tianet al., “What makes a good commit message?” inProceedings of the 44th International Conference on Software Engineering (ICSE), 2022, pp. 2389–2401
2022
-
[2]
Conventional commits 1.0.0,
Conventional Commits, “Conventional commits 1.0.0,” https://www. conventionalcommits.org/, 2014
2014
-
[3]
A first look at conventional commits classification,
Q. Zeng, Y . Zhang, Z. Qiu, and H. Liu, “A first look at conventional commits classification,” in2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE). IEEE Computer Society, 2025, pp. 2277–2289
2025
-
[4]
Multi-label classification of commit messages using transfer learning,
M. U. Sarwar, S. Zafar, and M. Z. Malik, “Multi-label classification of commit messages using transfer learning,” in2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2020, pp. 291–296
2020
-
[5]
Automatically generating commit messages from diffs using neural machine translation,
S. Jiang, A. Armaly, and C. McMillan, “Automatically generating commit messages from diffs using neural machine translation,” inPro- ceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 135–146
2017
-
[6]
Large language models for software engi- neering: A systematic literature review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2023
2023
-
[7]
Language models are few-shot learners,
T. Brownet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020
1901
-
[8]
Chain-of-thought prompting elicits reasoning in large language models,
J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[9]
The dimensions of maintenance,
E. B. Swanson, “The dimensions of maintenance,” inProceedings of the 2nd International Conference on Software Engineering. IEEE Computer Society Press, 1976, pp. 492–497
1976
-
[10]
Auto- matic classification of large changes into maintenance categories,
A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt, “Auto- matic classification of large changes into maintenance categories,” in 2009 IEEE 17th International Conference on Program Comprehension (ICPC). IEEE, 2009, pp. 30–39
2009
-
[11]
Boosting automatic commit classification into maintenance activities by combining different features,
S. Levin and A. Yehudai, “Boosting automatic commit classification into maintenance activities by combining different features,” inProceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 2017, pp. 1–10
2017
-
[12]
Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model,
L. Ghadhab, I. Jenhani, M. W. Mkaouer, and M. Ben Messaoud, “Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model,”Information and Software Technology, vol. 135, p. 106566, 2021
2021
-
[13]
Commit classification into maintenance activities using in-context learning capabilities of large language models,
Y . Sazid, S. Kuri, K. S. Ahmed, and A. Satter, “Commit classification into maintenance activities using in-context learning capabilities of large language models,” inProceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024). SCITEPRESS – Science and Technology Publications, 2024, pp. 506–512
2024
-
[14]
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Mistral 7b,
A. Q. Jianget al., “Mistral 7b,” 2023
2023
-
[16]
The llama 3 herd of models,
Meta AI, “The llama 3 herd of models,” 2024
2024
-
[17]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
DeepSeek AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.