Assessing Language Models for Salient Class Identification
Pith reviewed 2026-06-26 13:27 UTC · model grok-4.3
The pith
Language models identify salient classes in commits directly from text and outperform program-analysis baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Language models prompted on raw commit text can identify salient classes in multi-class Java commits without feature engineering, graph construction, or model training, substantially outperforming the strongest reproducible state-of-the-art baseline while remaining stable across commit characteristics; a 9B-parameter open-source model under few-shot prompting matches the performance of a much larger closed-source model.
What carries the argument
Direct prompting of language models on commit messages and diffs to classify each modified class as salient or non-salient.
If this is right
- Reviewers receive an immediate starting point when a commit touches many classes.
- Salient-class identification no longer requires AST parsing or handcrafted features.
- A 9B open-source model suffices, lowering both monetary cost and data-privacy exposure compared with large closed models.
- Performance holds steady across varying commit sizes and message lengths.
Where Pith is reading between the lines
- The same prompting approach could be tested on non-Java languages to check language independence.
- Integration into review platforms could automatically highlight salient classes in the diff view.
- Few-shot examples drawn from the target project might further improve accuracy without retraining.
Load-bearing premise
The labels in the ApacheJavaCM dataset correctly mark which classes are the salient ones driving the commit changes.
What would settle it
Independent experts re-labeling a random sample of commits and finding systematic disagreement with the original labels on more than a small fraction of cases.
Figures
read the original abstract
Code review requires reviewers to understand the core intent of code changes, which becomes difficult when a commit modifies multiple classes. In such commits, one or more primarily modified classes, referred to as salient classes, may induce modifications in other classes. Accurate identification of salient classes offers reviewers an effective entry point to navigate code changes and facilitates program comprehension. Existing state-of-the-art approaches rely on complex program-analysis procedures, including Abstract Syntax Tree (AST) parsing, class relation extraction, handcrafted feature engineering, or dependency graph construction. To this end, we study whether language models (LMs) can identify salient classes directly from commits without feature engineering, graph construction, or training. We first construct a new dataset ApacheJavaCM, derived from the ApacheCM dataset, containing 7,911 commits and 25,914 labeled classes. On this dataset, we systematically evaluate whether LMs can identify salient classes directly from commits and compare with the strongest reproducible state-of-the-art (SOTA) baseline. The evaluation covers two large language models (LLMs), GPT-5.4 and DeepSeek-V3.2, one small language model (SLM), Qwen3.5-9B, and three prompting strategies: zero-shot, few-shot, and chain-of-thought. The LMs substantially outperform the baseline while remaining stable across commit characteristics and selected LMs. We also found that, for salient class identification tasks, a 9B-parameter open-source SLM, Qwen3.5-9B, under few-shot prompting, achieves performance comparable to that of a much larger closed-source LLM, GPT-5.4. These results suggest that lightweight, locally deployable SLMs are sufficient for industrial salient class identification tasks and can reduce both cost and privacy barriers associated with relying on closed-source LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that language models can identify salient classes in multi-class commits directly from commit text without program analysis, feature engineering, or training. They construct ApacheJavaCM (7,911 commits, 25,914 classes derived from ApacheCM), evaluate GPT-5.4, DeepSeek-V3.2, and Qwen3.5-9B under zero-shot, few-shot, and chain-of-thought prompting against the strongest reproducible SOTA baseline, and report that LMs substantially outperform the baseline while remaining stable across commit characteristics, with the 9B SLM under few-shot matching GPT-5.4 performance.
Significance. If the dataset labels prove reliable, the work would demonstrate that small open-source LMs suffice for a practical code-review assistance task, lowering cost and privacy barriers compared with large closed-source models. The multi-model, multi-prompting evaluation and stability analysis across commit characteristics would strengthen the case for deployable SE tools.
major comments (2)
- [Dataset construction] Dataset construction section: The ApacheJavaCM labels are described only as 'derived from ApacheCM' with no operational definition of salience, derivation procedure, commit selection criteria, or validation (inter-annotator agreement, comparison to original ApacheCM labels). All performance claims, including outperformance and stability, rest on these 25,914 labels being accurate unbiased ground truth; without this evidence the evaluation results cannot be assessed.
- [Evaluation / Results] Evaluation and results sections: The abstract and main text report outperformance and stability but supply no concrete metrics, error bars, baseline reproduction details, or statistical tests. This prevents verification of the magnitude of gains and the cross-characteristic stability claim.
minor comments (1)
- Example prompts for the three strategies (zero-shot, few-shot, CoT) would improve reproducibility; consider adding them to an appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that will improve the verifiability of the work.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: The ApacheJavaCM labels are described only as 'derived from ApacheCM' with no operational definition of salience, derivation procedure, commit selection criteria, or validation (inter-annotator agreement, comparison to original ApacheCM labels). All performance claims, including outperformance and stability, rest on these 25,914 labels being accurate unbiased ground truth; without this evidence the evaluation results cannot be assessed.
Authors: We agree that the current manuscript provides insufficient detail on dataset construction. In the revised version we will expand the relevant section to explicitly state the operational definition of salience from ApacheCM, the precise derivation steps used to create ApacheJavaCM (including commit selection criteria), and any validation statistics available from the source dataset. This will allow readers to assess label reliability directly. revision: yes
-
Referee: [Evaluation / Results] Evaluation and results sections: The abstract and main text report outperformance and stability but supply no concrete metrics, error bars, baseline reproduction details, or statistical tests. This prevents verification of the magnitude of gains and the cross-characteristic stability claim.
Authors: We acknowledge the absence of concrete numerical results, error bars, baseline reproduction details, and statistical tests in the submitted manuscript. In revision we will add a dedicated results subsection containing the full performance metrics (precision, recall, F1), error bars or confidence intervals, explicit baseline reproduction protocol, and statistical significance tests supporting both the outperformance claims and the stability analysis across commit characteristics. revision: yes
Circularity Check
No circularity: empirical evaluation on external dataset labels against reproducible baseline
full rationale
The paper constructs ApacheJavaCM from ApacheCM, then measures LM performance (zero/few-shot/CoT prompting on GPT-5.4, DeepSeek-V3.2, Qwen3.5-9B) directly against the 25,914 provided labels and a SOTA baseline. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation. Claims reduce to standard comparison of model outputs vs. held-out labels, with no reduction by construction to the inputs themselves. The label-accuracy assumption is a validity concern, not a circularity mechanism.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The ApacheCM dataset provides a reliable base for deriving accurate salient class labels in ApacheJavaCM
Reference graph
Works this paper leans on
-
[1]
Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Challenges of Modern Code Review. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 712–721
2013
-
[2]
Mike Barnett, Christian Bird, João Brunet, and Shuvendu K. Lahiri. 2015. Helping Developers Help Themselves: Automatic Decomposition of Code Review Changesets. InProceedings of the 37th International Conference on Software Engineering (ICSE). ACM, 134–144
2015
-
[3]
Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey. 2016. Investigating Technical and Non- Technical Factors Influencing Modern Code Review.Empirical Software Engineering21, 3 (2016), 932–959
2016
-
[4]
Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley
Amiangshu Bosu, Jeffrey C. Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley. 2017. Process Aspects and Social Dynamics of Contemporary Code Review: Insights from Open Source Development and Industrial Practice at Microsoft.IEEE Transactions on Software Engineering43, 1 (2017), 56–75
2017
-
[5]
CACM Staff. 2019. CodeFlow: Improving the Code Review Process at Microsoft.Commun. ACM62, 2 (2019), 36–44
2019
-
[6]
Davide Chicco and Giuseppe Jurman. 2020. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation.BMC Genomics21 (2020), 6
2020
-
[7]
Giuseppe Crupi, Rosalia Tufano, and Gabriele Bavota. 2026. Improving Code Generation via Small Language Model- as-a-judge. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1–12
2026
-
[8]
Martin Dias, Alberto Bacchelli, Georgios Gousios, Damien Cassou, and Stephane Ducasse. 2015. Untangling Fine- Grained Code Changes. InProceedings of the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 341–350
2015
-
[9]
Jinhao Dong, Yiling Lou, Dan Hao, and Lin Tan. 2023. Revisiting Learning-Based Commit Message Generation. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 794–805
2023
-
[10]
Jinhao Dong, Yiling Lou, Qihao Zhu, Zeyu Sun, Zhilin Li, Wenjie Zhang, and Dan Hao. 2022. FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation. InProceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 970–981
2022
-
[11]
Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 803–815
2024
-
[12]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 25th Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings. ACL, 1536–1547. Assessing Language Mode...
2020
-
[13]
Beat Fluri, Michael Würsch, Martin Pinzger, and Harald C. Gall. 2007. Change Distilling: Tree Differencing for Fine-Grained Source Code Change Extraction.IEEE Transactions on Software Engineering33, 11 (2007), 725–743
2007
-
[14]
Gerrit. 2026. Gerrit Code Review: Software Documentation. https://www.gerritcodereview.com/
2026
-
[15]
Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, and Pekka Abrahamsson
-
[16]
Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks.Journal of Systems and Software236 (2026), 112815
2026
-
[17]
Hassan and Richard C
Ahmed E. Hassan and Richard C. Holt. 2004. Predicting Change Propagation in Software Systems. InProceedings of the 20th IEEE International Conference on Software Maintenance (ICSM). IEEE, 284–293
2004
-
[18]
Hattori and Michele Lanza
Lile P. Hattori and Michele Lanza. 2008. On the Nature of Commits. InProceedings of the 23rd IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). IEEE, 63–71
2008
-
[19]
Yuan Huang, Xiangping Chen, Zhiyong Liu, Xiaonan Luo, and Zibin Zheng. 2017. Using Discriminative Feature in Software Entities for Relevance Identification of Code Changes.Journal of Software: Evolution and Process29, 7 (2017), e1859
2017
-
[20]
Yuan Huang, Nan Jia, Xiangping Chen, Kai Hong, and Zibin Zheng. 2018. Salient-Class Location: Help Developers Understand Code Change in Code Review. InProceedings of the Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 770–774
2018
-
[21]
Yuan Huang, Nan Jia, Xiangping Chen, Kai Hong, and Zibin Zheng. 2022. Code Review Knowledge Perception: Fusing Multi-Features for Salient-Class Location.IEEE Transactions on Software Engineering48, 5 (2022), 1463–1479
2022
-
[22]
Yuan Huang, Jinyu Jiang, Xiapu Luo, Xiangping Chen, Zibin Zheng, Nan Jia, and Gang Huang. 2022. Change-Patterns Mapping: A Boosting Way for Change Impact Analysis.IEEE Transactions on Software Engineering48, 7 (2022), 2376–2398
2022
-
[23]
Yuan Huang, Zhicao Tang, Xiangping Chen, Changlin Yang, Zibin Zheng, and Xiaocong Zhou. 2026. Commit Messages Generation Based on Core Changes.ACM Transactions on Software Engineering and Methodology35, 5 (2026), 1–32
2026
-
[24]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generating Commit Messages from Diffs Using Neural Machine Translation. InProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 135–146
2017
-
[25]
Oleksii Kononenko, Olga Baysal, and Michael W. Godfrey. 2016. Code Review Quality: How Developers See It. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1028–1038
2016
-
[26]
Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W. Godfrey. 2015. Investigating Code Review Quality: Do People and Participation Matter?. InProceedings of the 31st IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 111–120
2015
-
[27]
Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. 2016. An Empirical Study of the Impact of Modern Code Review Practices on Software Quality.Empirical Software Engineering21, 5 (2016), 2146–2189
2016
-
[28]
Jiahao Ren, Jianming Chang, Lulu Wang, Zaixing Zhang, and Bixin Li. 2024. Graph-Based Salient Class Classification in Commits. InProceedings of the 24th IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 620–631
2024
-
[29]
Rigby and Christian Bird
Peter C. Rigby and Christian Bird. 2013. Convergent Contemporary Software Peer Review Practices. InProceedings of the Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 202–212
2013
-
[30]
Rigby, Daniel M
Peter C. Rigby, Daniel M. German, and Margaret-Anne D. Storey. 2008. Open Source Software Peer Review Practices: A Case Study of the Apache Server. InProceedings of the 30th International Conference on Software Engineering (ICSE). ACM, 541–550
2008
-
[31]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern Code Review: A Case Study at Google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). ACM, 181–190
2018
-
[32]
Yida Tao, Yingnong Dang, Tao Xie, Dongmei Zhang, and Sunghun Kim. 2012. How Do Software Engineers Understand Code Changes? An Exploratory Study in Industry. InProceedings of the 20th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE). ACM, 1–11
2012
-
[33]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida. 2017. Review participation in modern code review: An empirical study of the android, Qt, and OpenStack projects.Empirical Software Engineering22, 2 (2017), 768–817
2017
-
[34]
Bo Xiong, Chaoran Cai, Chong Wang, and Peng Liang. 2026. Replication Package for the Paper: Assessing Language Models for Salient Class Identification. https://github.com/riverBag/LLM4SalientClass
2026
-
[35]
Bo Xiong, Linghao Zhang, Zongen Ren, Chong Wang, and Peng Liang. 2026. CoRaCMG: Contextual Retrieval- Augmented Framework for Commit Message Generation.Information and Software Technology196 (2026), 108169
2026
-
[36]
Bo Xiong, Linghao Zhang, Chong Wang, and Peng Liang. 2025. Contextual Code Retrieval for Commit Message Generation: A Preliminary Study. InProceedings of the 19th ACM/IEEE International Symposium on Empirical Software 22 Xiong et al. Engineering and Measurement (ESEM). IEEE, 358–364
2025
-
[37]
Kaiyan Zhang, Jianyu Wang, Ermo Hua, Biqing Qi, Ning Ding, and Bowen Zhou. 2024. CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 4295–4312
2024
-
[38]
Qingyu Zhang, Puzhuo Liu, Peng Di, and Chenxiong Qian. 2025. CodeFuse-CommitEval: Towards Benchmarking LLM’s Power on Commit Message and Code Change Inconsistency Detection.arXiv preprint arXiv:2511.19875(2025)
arXiv 2025
-
[39]
Tianyi Zhang, Myoungkyu Song, Jorge Pinedo, and Miryung Kim. 2015. Interactive Code Review for Systematic Changes. InProceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 111–122
2015
-
[40]
Yuxia Zhang, Zhiqing Qiu, Klaas-Jan Stol, Wenhui Zhu, Jiaxin Zhu, Yingchen Tian, and Hui Liu. 2024. Automatic Commit Message Generation: A Critical Review and Directions for Future Work.IEEE Transactions on Software Engineering50, 4 (2024), 816–835
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.