Recognition: no theorem link
Fine-Tuning Models for Automated Code Review Feedback
Pith reviewed 2026-05-14 20:36 UTC · model grok-4.3
The pith
Parameter-efficient fine-tuning of Code Llama produces feedback on buggy Java code that students rate as effective as ChatGPT.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Parameter-efficient fine-tuning of the open Code Llama model, using a dataset distilled from a larger model, produces feedback on buggy Java code that scores higher on BLEU, ROUGE, and BERTScore than prompt-engineered versions, receives stronger manual annotations, and is rated by students as equally effective to ChatGPT for guiding learning.
What carries the argument
Parameter-efficient fine-tuning (PEFT) applied to Code Llama, which updates only a small subset of parameters on a feedback dataset to improve output quality for code review tasks.
Load-bearing premise
The combination of BLEU, ROUGE, BERTScore, manual annotation, and student ratings reliably measures the actual educational value and long-term learning impact of the generated feedback.
What would settle it
A controlled study in which students using the fine-tuned feedback show no measurable gain in debugging accuracy or speed on new tasks compared with students using prompt-engineered feedback or no feedback.
Figures
read the original abstract
Large Language Models have introduced new possibilities for programming education through personalized support, content creation, and automated feedback. While recent studies have demonstrated the potential for feedback generation, many techniques rely on proprietary models, raising concerns about cost, computational demands, and the ethical implications of sharing student code. Open LLMs provide an alternative approach, but they do not currently have the capabilities of proprietary models. To address this problem, we investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering, both of which distil knowledge from a dataset derived from a large, more capable model, can be used to adapt and enhance the quality of feedback generated by the open LLM Code Llama. Feedback quality on buggy Java code was assessed using a combination of student evaluation, manual annotation and the automated metrics BLEU, ROUGE, and BERTScore. Our findings indicate that PEFT leads to notable improvements in feedback quality and significantly outperforms prompt engineering, providing an avenue for developing freely deployable feedback tools that can be effectively used to guide student learning. Student evaluation indicates that learners value the PEFT model's feedback and see it as being equally effective as the proprietary ChatGPT model. Participants suggested that incorporating additional explanation for technical terms in the PEFT model's feedback could be more beneficial. This study demonstrates that fine-tuned models can effectively support critical thinking and guide the design of scalable pedagogical systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines whether parameter-efficient fine-tuning (PEFT) of Code Llama, using a dataset distilled from a larger model, can improve automated feedback on buggy Java code compared to prompt engineering. Evaluation combines automated metrics (BLEU, ROUGE, BERTScore), manual annotation, and student ratings; the central claim is that PEFT yields notable quality gains, significantly outperforms prompting, and produces feedback students rate as equally effective to ChatGPT for guiding learning.
Significance. If the evaluation methodology holds, the work offers a concrete path toward open, low-cost, deployable feedback tools for programming education that avoid proprietary-model costs and data-sharing risks. The multi-channel evaluation (automatic + human + student) is a strength, and the finding that PEFT beats prompting is potentially useful for practitioners.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.
- [§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.
- [Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.
minor comments (2)
- [Abstract] The abstract states 'notable improvements' and 'significantly outperforms' without defining the thresholds or reporting exact score tables; a summary table of all metric values with standard deviations would improve clarity.
- [Discussion] Participant suggestions about adding explanations for technical terms are noted but not followed up with any revised prompt or fine-tuning experiment; this could be moved to future work or briefly tested.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity, completeness, and transparency while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.
Authors: We agree that the evaluation measures perceived effectiveness and metric similarity rather than objective learning gains. The student ratings were collected from participants who used the feedback on their own buggy Java code and judged the PEFT output as equally helpful to ChatGPT for guiding their immediate debugging process. We have revised the abstract to state 'students rated the PEFT feedback as equally effective' and added an explicit limitations paragraph acknowledging the lack of pre/post skill measures or longitudinal data. Future controlled experiments are suggested as follow-up work. revision: yes
-
Referee: [§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.
Authors: This omission was an oversight. We have expanded §3 with the complete PEFT details (LoRA rank 16, alpha 32, target modules q_proj and v_proj, 3 epochs, learning rate 2e-4) and §4 with dataset sizes (4,800 distilled training examples, 600 evaluation examples). We have also added paired statistical tests, p-values, and effect sizes for all metric comparisons in the results section to demonstrate robustness. revision: yes
-
Referee: [Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.
Authors: We concur that automated metrics are imperfect proxies and therefore complemented them with expert manual annotation and direct student ratings. We have extended the results discussion to include observed correlations between the automated scores and human judgments. A dedicated ablation linking metric gains to debugging performance on new held-out problems was not performed; we have added this as a noted limitation and direction for future work. revision: partial
Circularity Check
No significant circularity in empirical evaluation chain
full rationale
The paper reports an empirical comparison of PEFT fine-tuning versus prompt engineering on Code Llama for generating code review feedback. Training uses knowledge distillation from a larger model to create reference outputs, but evaluation relies on independent external benchmarks (BLEU, ROUGE, BERTScore) plus separate manual annotations and student ratings that are not algebraically or definitionally derived from the fine-tuned model's parameters or outputs. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of the quality claims. The central result (PEFT improves metrics and matches ChatGPT in student perception) is therefore not forced by construction from its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- PEFT configuration
axioms (1)
- domain assumption BLEU, ROUGE, BERTScore and student ratings are valid proxies for feedback quality and learning value
Reference graph
Works this paper leans on
-
[1]
Marzieh Ahmadzadeh, Dave Elliman, and Colin Higgins. 2005. An analysis of patterns of debugging among novice computer science students. InProceedings of the 10th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education(Caparica, Portugal)(ITiCSE ’05). Association for Computing Machinery, New York, NY, USA, 84–88. doi:10.1145/106...
-
[2]
Zishan Ahmed, Shakib Sadat Shanto, and Akinul Islam Jony. 2024. Potentiality of generative AI tools in higher education: Evaluating ChatGPT’s viability as a teaching assistant for introductory programming courses.STEM Education4, 3 (2024), 165–182. doi:10.3934/steme.2024011
-
[3]
Amjad Altadmri and Neil C.C. Brown. 2015. 37 Million Compilations: Investigat- ing Novice Programming Mistakes in Large-Scale Student Data. InProceedings of the 46th ACM Technical Symposium on Computer Science Education(Kansas City, Missouri, USA)(SIGCSE ’15). Association for Computing Machinery, New York, NY, USA, 522–527. doi:10.1145/2676723.2677258
-
[4]
Virginia Braun and Victoria Clarke. 2006. Using thematic analy- sis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101. arXiv:https://doi.org/10.1191/1478088706qp063oa doi:10.1191/ 1478088706qp063oa
-
[5]
Neil C.C. Brown and Amjad Altadmri. 2014. Investigating novice programming mistakes: educator beliefs vs. student data. InProceedings of the Tenth Annual Conference on International Computing Education Research(Glasgow, Scotland, United Kingdom)(ICER ’14). Association for Computing Machinery, New York, NY, USA, 43–50. doi:10.1145/2632320.2632343
-
[6]
Eason Chen, Ray Huang, Han-Shin Chen, Yuen-Hsien Tseng, and Liang-Yi Li
-
[7]
arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863
GPTutor: a ChatGPT-powered programming tool for code explanation. arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863
-
[8]
Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All syntax errors are not equal. InProceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education(Haifa, Israel)(ITiCSE ’12). Association for Computing Machinery, New York, NY, USA, 75–80. doi:10.1145/2325296. 2325318
-
[9]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] https: //arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Andrew Ettles, Andrew Luxton-Reilly, and Paul Denny. 2018. Common logic errors made by novice programmers. InProceedings of the 20th Australasian Computing Education Conference(Brisbane, Queensland, Australia)(ACE ’18). Association for Computing Machinery, New York, NY, USA, 83–89. doi:10.1145/ 3160489.3160493
-
[11]
Ian Finlayson and Stephen Davies. 2024. Jguardrail: A Framework for Identifying Possible Errors in Student Java Code.J. Comput. Sci. Coll.40, 3 (Oct. 2024), 322–333
2024
-
[12]
Maria Hristova, Ananya Misra, Megan Rutter, and Rebecca Mercuri. 2003. Identi- fying and correcting Java programming errors for introductory computer science students. InProceedings of the 34th SIGCSE Technical Symposium on Computer Science Education(Reno, Navada, USA)(SIGCSE ’03). Association for Computing Machinery, New York, NY, USA, 153–156. doi:10.11...
-
[13]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
J. Jackson, M. Cobb, and C. Carver. 2005. Identifying Top Java Errors for Novice Programmers. InProceedings Frontiers in Education 35th Annual Conference. T4C– T4C. doi:10.1109/FIE.2005.1611967
- [15]
-
[16]
Charles Koutcheme. 2022. Towards Open Natural Language Feedback Generation for Novice Programmers using Large Language Models. InProceedings of the 22nd Koli Calling International Conference on Computing Education Research(Koli, Finland)(Koli Calling ’22). Association for Computing Machinery, New York, NY, USA, Article 29, 2 pages. doi:10.1145/3564721.3565955
- [17]
-
[18]
Charles Koutcheme and Arto Hellas. 2024. Propagating Large Language Models Programming Feedback. InProceedings of the Eleventh ACM Conference on Learn- ing @ Scale(Atlanta, GA, USA)(L@S ’24). Association for Computing Machinery, New York, NY, USA, 366–370. doi:10.1145/3657604.3664665
-
[19]
Smitha S Kumar, Michael Lones, Manuel Maarek, and Hind Zantout. 2025. Navi- gating the landscape of automated feedback generation techniques for program- ming exercises.ACM Trans. Comput. Educ.(Sept. 2025). doi:10.1145/3764593 Just Accepted
-
[20]
Chris Langhout and Maurício Aniche. 2021. Atoms of Confusion in Java. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). 25–35. doi:10.1109/ICPC52881.2021.00012
- [21]
-
[22]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
2004
- [23]
-
[24]
Davin McCall and Michael Kölling. 2019. A New Look at Novice Programmer Errors.ACM Trans. Comput. Educ.19, 4, Article 38 (July 2019), 30 pages. doi:10. 1145/3335814
2019
-
[25]
Davin McCall and Michael Kölling. 2014. Meaningful categorisation of novice pro- grammer errors. In2014 IEEE Frontiers in Education Conference (FIE) Proceedings. 1–8. doi:10.1109/FIE.2014.7044420
-
[26]
Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2024. Au- tomated Grading and Feedback Tools for Programming Education: A System- atic Review.ACM Trans. Comput. Educ.24, 1, Article 10 (Feb. 2024), 43 pages. doi:10.1145/3636515
-
[27]
Susanne Narciss. 2008. Feedback strategies for interactive learning tasks. In Handbook of research on educational communications and technology. Routledge, 125–143
2008
-
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadelphia, Penn- sylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135
-
[29]
Olga Petrovska, Lee Clift, Faron Moller, and Rebecca Pearsall. 2024. Incorporating Generative AI into Software Development Education. InProceedings of the 8th Conference on Computing Education Practice(Durham, United Kingdom)(CEP ’24). Association for Computing Machinery, New York, NY, USA, 37–40. doi:10. 1145/3633053.3633057
-
[30]
Farman Ali Pirzado, Awais Ahmed, Román Alejandro Mendoza-Urdiales, and Hugo Terashima-Marin. 2024. Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature.IEEE Access12 (2024), 112605–112625. doi:10.1109/ ACCESS.2024.3443621
-
[31]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology175 (2024), 107523. doi:10.1016/j.infsof.2024. 107523
-
[32]
James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProceedings of t...
-
[33]
Yizhou Qian and James Lehman. 2017. Students’ Misconceptions and Other Difficulties in Introductory Programming: A Literature Review.ACM Trans. Comput. Educ.18, 1, Article 1 (Oct. 2017), 24 pages. doi:10.1145/3077618
-
[34]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Carlos Alexandre Gouvea da Silva, Felipe Negrelle Ramos, Rafael Veiga de Moraes, and Edson Leonardo dos Santos. 2024. ChatGPT: Challenges and Benefits in Software Programming for Higher Education.Sustainability16, 3 (2024). doi:10. 3390/su16031245
2024
-
[36]
Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, and Jake Renzella. 2025. Narrowing the Gap: Supervised Fine-Tuning of Open- Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools. arXiv:2507.05305 [cs.CY] https://arxiv.org/abs/2507.05305
-
[37]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, et al . 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 1482–1494. doi:10.1109/ICSE48619.2023.00129 Conference acronym ’XX, June 03–05, 2018, Woodstoc...
- [39]
-
[40]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.