pith. machine review for the scientific record. sign in

arxiv: 2605.12610 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Fine-Tuning Models for Automated Code Review Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:36 UTC · model grok-4.3

classification 💻 cs.SE
keywords parameter-efficient fine-tuningautomated code feedbackprogramming educationCode Llamaopen-source LLMsprompt engineeringstudent evaluationbuggy Java code
0
0 comments X

The pith

Parameter-efficient fine-tuning of Code Llama produces feedback on buggy Java code that students rate as effective as ChatGPT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can generate personalized feedback for programming students, yet proprietary versions create concerns over cost, computation, and sharing student code. The paper tests whether parameter-efficient fine-tuning or prompt engineering can adapt the open Code Llama model to close the quality gap. Feedback on buggy Java code is measured with BLEU, ROUGE, BERTScore, manual annotation, and direct student ratings. Results show that fine-tuning yields clearer and more useful comments that students judge comparable to ChatGPT. This opens a route to free, locally deployable tools that support student learning without external model access.

Core claim

Parameter-efficient fine-tuning of the open Code Llama model, using a dataset distilled from a larger model, produces feedback on buggy Java code that scores higher on BLEU, ROUGE, and BERTScore than prompt-engineered versions, receives stronger manual annotations, and is rated by students as equally effective to ChatGPT for guiding learning.

What carries the argument

Parameter-efficient fine-tuning (PEFT) applied to Code Llama, which updates only a small subset of parameters on a feedback dataset to improve output quality for code review tasks.

Load-bearing premise

The combination of BLEU, ROUGE, BERTScore, manual annotation, and student ratings reliably measures the actual educational value and long-term learning impact of the generated feedback.

What would settle it

A controlled study in which students using the fine-tuned feedback show no measurable gain in debugging accuracy or speed on new tasks compared with students using prompt-engineered feedback or no feedback.

Figures

Figures reproduced from arXiv: 2605.12610 by Hind Zantout, Manuel Maarek, Michael A Lones, Smitha S Kumar.

Figure 1
Figure 1. Figure 1: Overview of methodology, showing (A) the bug type [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of KM accuracy, KH helpfulness, num [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of student-assessed scores for differ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Large Language Models have introduced new possibilities for programming education through personalized support, content creation, and automated feedback. While recent studies have demonstrated the potential for feedback generation, many techniques rely on proprietary models, raising concerns about cost, computational demands, and the ethical implications of sharing student code. Open LLMs provide an alternative approach, but they do not currently have the capabilities of proprietary models. To address this problem, we investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering, both of which distil knowledge from a dataset derived from a large, more capable model, can be used to adapt and enhance the quality of feedback generated by the open LLM Code Llama. Feedback quality on buggy Java code was assessed using a combination of student evaluation, manual annotation and the automated metrics BLEU, ROUGE, and BERTScore. Our findings indicate that PEFT leads to notable improvements in feedback quality and significantly outperforms prompt engineering, providing an avenue for developing freely deployable feedback tools that can be effectively used to guide student learning. Student evaluation indicates that learners value the PEFT model's feedback and see it as being equally effective as the proprietary ChatGPT model. Participants suggested that incorporating additional explanation for technical terms in the PEFT model's feedback could be more beneficial. This study demonstrates that fine-tuned models can effectively support critical thinking and guide the design of scalable pedagogical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines whether parameter-efficient fine-tuning (PEFT) of Code Llama, using a dataset distilled from a larger model, can improve automated feedback on buggy Java code compared to prompt engineering. Evaluation combines automated metrics (BLEU, ROUGE, BERTScore), manual annotation, and student ratings; the central claim is that PEFT yields notable quality gains, significantly outperforms prompting, and produces feedback students rate as equally effective to ChatGPT for guiding learning.

Significance. If the evaluation methodology holds, the work offers a concrete path toward open, low-cost, deployable feedback tools for programming education that avoid proprietary-model costs and data-sharing risks. The multi-channel evaluation (automatic + human + student) is a strength, and the finding that PEFT beats prompting is potentially useful for practitioners.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.
  2. [§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.
  3. [Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.
minor comments (2)
  1. [Abstract] The abstract states 'notable improvements' and 'significantly outperforms' without defining the thresholds or reporting exact score tables; a summary table of all metric values with standard deviations would improve clarity.
  2. [Discussion] Participant suggestions about adding explanations for technical terms are noted but not followed up with any revised prompt or fine-tuning experiment; this could be moved to future work or briefly tested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity, completeness, and transparency while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.

    Authors: We agree that the evaluation measures perceived effectiveness and metric similarity rather than objective learning gains. The student ratings were collected from participants who used the feedback on their own buggy Java code and judged the PEFT output as equally helpful to ChatGPT for guiding their immediate debugging process. We have revised the abstract to state 'students rated the PEFT feedback as equally effective' and added an explicit limitations paragraph acknowledging the lack of pre/post skill measures or longitudinal data. Future controlled experiments are suggested as follow-up work. revision: yes

  2. Referee: [§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.

    Authors: This omission was an oversight. We have expanded §3 with the complete PEFT details (LoRA rank 16, alpha 32, target modules q_proj and v_proj, 3 epochs, learning rate 2e-4) and §4 with dataset sizes (4,800 distilled training examples, 600 evaluation examples). We have also added paired statistical tests, p-values, and effect sizes for all metric comparisons in the results section to demonstrate robustness. revision: yes

  3. Referee: [Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.

    Authors: We concur that automated metrics are imperfect proxies and therefore complemented them with expert manual annotation and direct student ratings. We have extended the results discussion to include observed correlations between the automated scores and human judgments. A dedicated ablation linking metric gains to debugging performance on new held-out problems was not performed; we have added this as a noted limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper reports an empirical comparison of PEFT fine-tuning versus prompt engineering on Code Llama for generating code review feedback. Training uses knowledge distillation from a larger model to create reference outputs, but evaluation relies on independent external benchmarks (BLEU, ROUGE, BERTScore) plus separate manual annotations and student ratings that are not algebraically or definitionally derived from the fine-tuned model's parameters or outputs. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of the quality claims. The central result (PEFT improves metrics and matches ChatGPT in student perception) is therefore not forced by construction from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard NLP overlap metrics plus student self-report capture educational effectiveness, plus the unstated premise that the distilled dataset from the larger model is representative and unbiased.

free parameters (1)
  • PEFT configuration
    Rank, learning rate, and other fine-tuning hyperparameters chosen to adapt the model but not reported in the abstract.
axioms (1)
  • domain assumption BLEU, ROUGE, BERTScore and student ratings are valid proxies for feedback quality and learning value
    Invoked when the paper concludes that PEFT improves quality based on these measures.

pith-pipeline@v0.9.0 · 5545 in / 1201 out tokens · 58097 ms · 2026-05-14T20:36:31.782456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    Marzieh Ahmadzadeh, Dave Elliman, and Colin Higgins. 2005. An analysis of patterns of debugging among novice computer science students. InProceedings of the 10th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education(Caparica, Portugal)(ITiCSE ’05). Association for Computing Machinery, New York, NY, USA, 84–88. doi:10.1145/106...

  2. [2]

    Zishan Ahmed, Shakib Sadat Shanto, and Akinul Islam Jony. 2024. Potentiality of generative AI tools in higher education: Evaluating ChatGPT’s viability as a teaching assistant for introductory programming courses.STEM Education4, 3 (2024), 165–182. doi:10.3934/steme.2024011

  3. [3]

    Amjad Altadmri and Neil C.C. Brown. 2015. 37 Million Compilations: Investigat- ing Novice Programming Mistakes in Large-Scale Student Data. InProceedings of the 46th ACM Technical Symposium on Computer Science Education(Kansas City, Missouri, USA)(SIGCSE ’15). Association for Computing Machinery, New York, NY, USA, 522–527. doi:10.1145/2676723.2677258

  4. [4]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analy- sis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101. arXiv:https://doi.org/10.1191/1478088706qp063oa doi:10.1191/ 1478088706qp063oa

  5. [5]

    Brown and Amjad Altadmri

    Neil C.C. Brown and Amjad Altadmri. 2014. Investigating novice programming mistakes: educator beliefs vs. student data. InProceedings of the Tenth Annual Conference on International Computing Education Research(Glasgow, Scotland, United Kingdom)(ICER ’14). Association for Computing Machinery, New York, NY, USA, 43–50. doi:10.1145/2632320.2632343

  6. [6]

    Eason Chen, Ray Huang, Han-Shin Chen, Yuen-Hsien Tseng, and Liang-Yi Li

  7. [7]

    arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863

    GPTutor: a ChatGPT-powered programming tool for code explanation. arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863

  8. [8]

    Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All syntax errors are not equal. InProceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education(Haifa, Israel)(ITiCSE ’12). Association for Computing Machinery, New York, NY, USA, 75–80. doi:10.1145/2325296. 2325318

  9. [9]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] https: //arxiv.org/abs/2305.14314

  10. [10]

    Andrew Ettles, Andrew Luxton-Reilly, and Paul Denny. 2018. Common logic errors made by novice programmers. InProceedings of the 20th Australasian Computing Education Conference(Brisbane, Queensland, Australia)(ACE ’18). Association for Computing Machinery, New York, NY, USA, 83–89. doi:10.1145/ 3160489.3160493

  11. [11]

    Ian Finlayson and Stephen Davies. 2024. Jguardrail: A Framework for Identifying Possible Errors in Student Java Code.J. Comput. Sci. Coll.40, 3 (Oct. 2024), 322–333

  12. [12]

    Maria Hristova, Ananya Misra, Megan Rutter, and Rebecca Mercuri. 2003. Identi- fying and correcting Java programming errors for introductory computer science students. InProceedings of the 34th SIGCSE Technical Symposium on Computer Science Education(Reno, Navada, USA)(SIGCSE ’03). Association for Computing Machinery, New York, NY, USA, 153–156. doi:10.11...

  13. [13]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

  14. [14]

    Jackson, M

    J. Jackson, M. Cobb, and C. Carver. 2005. Identifying Top Java Errors for Novice Programmers. InProceedings Frontiers in Education 35th Annual Conference. T4C– T4C. doi:10.1109/FIE.2005.1611967

  15. [15]

    Nadja Just, Janet Siegmund, and Belinda Schantong. 2025. From Bugs to Break- throughs: Novice Errors in CS2. arXiv:2502.14438 [cs.SE] https://arxiv.org/abs/ 2502.14438

  16. [16]

    Charles Koutcheme. 2022. Towards Open Natural Language Feedback Generation for Novice Programmers using Large Language Models. InProceedings of the 22nd Koli Calling International Conference on Computing Education Research(Koli, Finland)(Koli Calling ’22). Association for Computing Machinery, New York, NY, USA, Article 29, 2 pages. doi:10.1145/3564721.3565955

  17. [17]

    Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, and Paul Denny. 2024. Open Source Language Models Can Provide Feed- back: Evaluating LLMs’ Ability to Help Students Using GPT-4-As-A-Judge. arXiv:2405.05253 [cs.CL] https://arxiv.org/abs/2405.05253

  18. [18]

    Charles Koutcheme and Arto Hellas. 2024. Propagating Large Language Models Programming Feedback. InProceedings of the Eleventh ACM Conference on Learn- ing @ Scale(Atlanta, GA, USA)(L@S ’24). Association for Computing Machinery, New York, NY, USA, 366–370. doi:10.1145/3657604.3664665

  19. [19]

    Smitha S Kumar, Michael Lones, Manuel Maarek, and Hind Zantout. 2025. Navi- gating the landscape of automated feedback generation techniques for program- ming exercises.ACM Trans. Comput. Educ.(Sept. 2025). doi:10.1145/3764593 Just Accepted

  20. [20]

    Chris Langhout and Maurício Aniche. 2021. Atoms of Confusion in Java. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). 25–35. doi:10.1109/ICPC52881.2021.00012

  21. [21]

    Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, and Arto Hellas. 2024. LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education. arXiv:2411.10455 [cs.CY] https://arxiv.org/abs/2411.10455

  22. [22]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

  23. [23]

    Jorge Machado. 2025. Toward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLMs. arXiv:2505.10603 [cs.CY] https://arxiv.org/ abs/2505.10603

  24. [24]

    Davin McCall and Michael Kölling. 2019. A New Look at Novice Programmer Errors.ACM Trans. Comput. Educ.19, 4, Article 38 (July 2019), 30 pages. doi:10. 1145/3335814

  25. [25]

    Davin McCall and Michael Kölling. 2014. Meaningful categorisation of novice pro- grammer errors. In2014 IEEE Frontiers in Education Conference (FIE) Proceedings. 1–8. doi:10.1109/FIE.2014.7044420

  26. [26]

    Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2024. Au- tomated Grading and Feedback Tools for Programming Education: A System- atic Review.ACM Trans. Comput. Educ.24, 1, Article 10 (Feb. 2024), 43 pages. doi:10.1145/3636515

  27. [27]

    Susanne Narciss. 2008. Feedback strategies for interactive learning tasks. In Handbook of research on educational communications and technology. Routledge, 125–143

  28. [28]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadelphia, Penn- sylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135

  29. [29]

    Olga Petrovska, Lee Clift, Faron Moller, and Rebecca Pearsall. 2024. Incorporating Generative AI into Software Development Education. InProceedings of the 8th Conference on Computing Education Practice(Durham, United Kingdom)(CEP ’24). Association for Computing Machinery, New York, NY, USA, 37–40. doi:10. 1145/3633053.3633057

  30. [30]

    Farman Ali Pirzado, Awais Ahmed, Román Alejandro Mendoza-Urdiales, and Hugo Terashima-Marin. 2024. Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature.IEEE Access12 (2024), 112605–112625. doi:10.1109/ ACCESS.2024.3443621

  31. [31]

    Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology175 (2024), 107523. doi:10.1016/j.infsof.2024. 107523

  32. [32]

    Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N

    James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProceedings of t...

  33. [33]

    Yizhou Qian and James Lehman. 2017. Students’ Misconceptions and Other Difficulties in Introductory Programming: A Literature Review.ACM Trans. Comput. Educ.18, 1, Article 1 (Oct. 2017), 24 pages. doi:10.1145/3077618

  34. [34]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

  35. [35]

    Carlos Alexandre Gouvea da Silva, Felipe Negrelle Ramos, Rafael Veiga de Moraes, and Edson Leonardo dos Santos. 2024. ChatGPT: Challenges and Benefits in Software Programming for Higher Education.Sustainability16, 3 (2024). doi:10. 3390/su16031245

  36. [36]

    Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, and Jake Renzella. 2025. Narrowing the Gap: Supervised Fine-Tuning of Open- Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools. arXiv:2507.05305 [cs.CY] https://arxiv.org/abs/2507.05305

  37. [37]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, et al . 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

  38. [38]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 1482–1494. doi:10.1109/ICSE48619.2023.00129 Conference acronym ’XX, June 03–05, 2018, Woodstoc...

  39. [39]

    Jiawei Xu, Ying Ding, and Yi Bu. 2025. Position: Open and Closed Large Language Models in Healthcare. arXiv:2501.09906 [cs.CY] https://arxiv.org/abs/2501.09906

  40. [40]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675