arxiv: 2605.12610 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Fine-Tuning Models for Automated Code Review Feedback

Smitha S Kumar , Michael A Lones , Manuel Maarek , Hind Zantout

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:36 UTC · model grok-4.3

classification 💻 cs.SE

keywords parameter-efficient fine-tuningautomated code feedbackprogramming educationCode Llamaopen-source LLMsprompt engineeringstudent evaluationbuggy Java code

0 comments

The pith

Parameter-efficient fine-tuning of Code Llama produces feedback on buggy Java code that students rate as effective as ChatGPT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can generate personalized feedback for programming students, yet proprietary versions create concerns over cost, computation, and sharing student code. The paper tests whether parameter-efficient fine-tuning or prompt engineering can adapt the open Code Llama model to close the quality gap. Feedback on buggy Java code is measured with BLEU, ROUGE, BERTScore, manual annotation, and direct student ratings. Results show that fine-tuning yields clearer and more useful comments that students judge comparable to ChatGPT. This opens a route to free, locally deployable tools that support student learning without external model access.

Core claim

Parameter-efficient fine-tuning of the open Code Llama model, using a dataset distilled from a larger model, produces feedback on buggy Java code that scores higher on BLEU, ROUGE, and BERTScore than prompt-engineered versions, receives stronger manual annotations, and is rated by students as equally effective to ChatGPT for guiding learning.

What carries the argument

Parameter-efficient fine-tuning (PEFT) applied to Code Llama, which updates only a small subset of parameters on a feedback dataset to improve output quality for code review tasks.

Load-bearing premise

The combination of BLEU, ROUGE, BERTScore, manual annotation, and student ratings reliably measures the actual educational value and long-term learning impact of the generated feedback.

What would settle it

A controlled study in which students using the fine-tuned feedback show no measurable gain in debugging accuracy or speed on new tasks compared with students using prompt-engineered feedback or no feedback.

Figures

Figures reproduced from arXiv: 2605.12610 by Hind Zantout, Manuel Maarek, Michael A Lones, Smitha S Kumar.

**Figure 2.** Figure 2: Comparison of KM accuracy, KH helpfulness, num [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Distributions of student-assessed scores for differ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Large Language Models have introduced new possibilities for programming education through personalized support, content creation, and automated feedback. While recent studies have demonstrated the potential for feedback generation, many techniques rely on proprietary models, raising concerns about cost, computational demands, and the ethical implications of sharing student code. Open LLMs provide an alternative approach, but they do not currently have the capabilities of proprietary models. To address this problem, we investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering, both of which distil knowledge from a dataset derived from a large, more capable model, can be used to adapt and enhance the quality of feedback generated by the open LLM Code Llama. Feedback quality on buggy Java code was assessed using a combination of student evaluation, manual annotation and the automated metrics BLEU, ROUGE, and BERTScore. Our findings indicate that PEFT leads to notable improvements in feedback quality and significantly outperforms prompt engineering, providing an avenue for developing freely deployable feedback tools that can be effectively used to guide student learning. Student evaluation indicates that learners value the PEFT model's feedback and see it as being equally effective as the proprietary ChatGPT model. Participants suggested that incorporating additional explanation for technical terms in the PEFT model's feedback could be more beneficial. This study demonstrates that fine-tuned models can effectively support critical thinking and guide the design of scalable pedagogical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEFT on Code Llama produces Java feedback students rate as good as ChatGPT's and better than prompting, but the learning-impact claim rests on ratings rather than outcome measures.

read the letter

The main point is that fine-tuning Code Llama with PEFT gives feedback on buggy Java code that students rate roughly equal to ChatGPT while beating prompt engineering on the same base model. The comparison is direct and uses a mix of BLEU, ROUGE, BERTScore, manual annotation, and student judgments, which lines up in the same direction. That targeted result on an open model for this exact task is the concrete addition here, and the practical angle on cost and data privacy is handled sensibly by distilling from a larger model for training data. The multi-channel evaluation is a step above papers that rely only on automatic scores. The setup shows clear thinking about adapting existing PEFT methods to a real classroom need without new frameworks. The soft spots are straightforward. The abstract leaves out dataset size, exact PEFT configuration, and any statistical tests, so the size of the gains is hard to gauge from what's given. More importantly, the claim that the feedback guides student learning depends on short-term ratings and similarity metrics. Those do not test whether students debug better on fresh problems or retain the material later, and there are no pre/post skill measures or follow-ups. The educational-value conclusion therefore sits on an assumption that immediate preference equals lasting impact. This paper is for people working on open tools for programming education or on domain-specific adaptation of code models. A reader who needs a working example of PEFT for feedback generation would get usable details from it. It deserves a serious referee because the experiment is focused and the evaluation channels are independent of the model's own outputs. I would send it to review and ask for the missing methods numbers plus a clearer statement on what the student ratings actually demonstrate about learning.

Referee Report

3 major / 2 minor

Summary. The paper examines whether parameter-efficient fine-tuning (PEFT) of Code Llama, using a dataset distilled from a larger model, can improve automated feedback on buggy Java code compared to prompt engineering. Evaluation combines automated metrics (BLEU, ROUGE, BERTScore), manual annotation, and student ratings; the central claim is that PEFT yields notable quality gains, significantly outperforms prompting, and produces feedback students rate as equally effective to ChatGPT for guiding learning.

Significance. If the evaluation methodology holds, the work offers a concrete path toward open, low-cost, deployable feedback tools for programming education that avoid proprietary-model costs and data-sharing risks. The multi-channel evaluation (automatic + human + student) is a strength, and the finding that PEFT beats prompting is potentially useful for practitioners.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.
[§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.
[Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.

minor comments (2)

[Abstract] The abstract states 'notable improvements' and 'significantly outperforms' without defining the thresholds or reporting exact score tables; a summary table of all metric values with standard deviations would improve clarity.
[Discussion] Participant suggestions about adding explanations for technical terms are noted but not followed up with any revised prompt or fine-tuning experiment; this could be moved to future work or briefly tested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity, completeness, and transparency while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the claim that the PEFT model is 'equally effective' to ChatGPT for guiding student learning rests on student ratings and surface metrics (BLEU/ROUGE/BERTScore) without any pre/post skill measures, control for prior knowledge, or longitudinal follow-up. These instruments capture immediate preference and n-gram overlap but do not test whether students internalize feedback or improve on new problems.

Authors: We agree that the evaluation measures perceived effectiveness and metric similarity rather than objective learning gains. The student ratings were collected from participants who used the feedback on their own buggy Java code and judged the PEFT output as equally helpful to ChatGPT for guiding their immediate debugging process. We have revised the abstract to state 'students rated the PEFT feedback as equally effective' and added an explicit limitations paragraph acknowledging the lack of pre/post skill measures or longitudinal data. Future controlled experiments are suggested as follow-up work. revision: yes
Referee: [§3 and §4] §3 (Method) and §4 (Dataset): the exact PEFT configuration (LoRA rank, alpha, target modules, training epochs, learning rate) and the size of the distilled training set and evaluation set are not reported. Without these numbers and without statistical tests (p-values, effect sizes, confidence intervals) on the metric deltas, it is impossible to assess whether the reported improvements are robust or reproducible.

Authors: This omission was an oversight. We have expanded §3 with the complete PEFT details (LoRA rank 16, alpha 32, target modules q_proj and v_proj, 3 epochs, learning rate 2e-4) and §4 with dataset sizes (4,800 distilled training examples, 600 evaluation examples). We have also added paired statistical tests, p-values, and effect sizes for all metric comparisons in the results section to demonstrate robustness. revision: yes
Referee: [Results] Results section: the automated metrics are known to correlate only weakly with pedagogical value; the paper does not include an ablation showing that higher BLEU/ROUGE/BERTScore actually predicts better student debugging performance on held-out problems.

Authors: We concur that automated metrics are imperfect proxies and therefore complemented them with expert manual annotation and direct student ratings. We have extended the results discussion to include observed correlations between the automated scores and human judgments. A dedicated ablation linking metric gains to debugging performance on new held-out problems was not performed; we have added this as a noted limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper reports an empirical comparison of PEFT fine-tuning versus prompt engineering on Code Llama for generating code review feedback. Training uses knowledge distillation from a larger model to create reference outputs, but evaluation relies on independent external benchmarks (BLEU, ROUGE, BERTScore) plus separate manual annotations and student ratings that are not algebraically or definitionally derived from the fine-tuned model's parameters or outputs. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of the quality claims. The central result (PEFT improves metrics and matches ChatGPT in student perception) is therefore not forced by construction from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard NLP overlap metrics plus student self-report capture educational effectiveness, plus the unstated premise that the distilled dataset from the larger model is representative and unbiased.

free parameters (1)

PEFT configuration
Rank, learning rate, and other fine-tuning hyperparameters chosen to adapt the model but not reported in the abstract.

axioms (1)

domain assumption BLEU, ROUGE, BERTScore and student ratings are valid proxies for feedback quality and learning value
Invoked when the paper concludes that PEFT improves quality based on these measures.

pith-pipeline@v0.9.0 · 5545 in / 1201 out tokens · 58097 ms · 2026-05-14T20:36:31.782456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 34 canonical work pages · 5 internal anchors

[1]

Marzieh Ahmadzadeh, Dave Elliman, and Colin Higgins. 2005. An analysis of patterns of debugging among novice computer science students. InProceedings of the 10th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education(Caparica, Portugal)(ITiCSE ’05). Association for Computing Machinery, New York, NY, USA, 84–88. doi:10.1145/106...

work page doi:10.1145/1067445.1067472 2005
[2]

Zishan Ahmed, Shakib Sadat Shanto, and Akinul Islam Jony. 2024. Potentiality of generative AI tools in higher education: Evaluating ChatGPT’s viability as a teaching assistant for introductory programming courses.STEM Education4, 3 (2024), 165–182. doi:10.3934/steme.2024011

work page doi:10.3934/steme.2024011 2024
[3]

Amjad Altadmri and Neil C.C. Brown. 2015. 37 Million Compilations: Investigat- ing Novice Programming Mistakes in Large-Scale Student Data. InProceedings of the 46th ACM Technical Symposium on Computer Science Education(Kansas City, Missouri, USA)(SIGCSE ’15). Association for Computing Machinery, New York, NY, USA, 522–527. doi:10.1145/2676723.2677258

work page doi:10.1145/2676723.2677258 2015
[4]

Virginia Braun and Victoria Clarke. 2006. Using thematic analy- sis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101. arXiv:https://doi.org/10.1191/1478088706qp063oa doi:10.1191/ 1478088706qp063oa

work page doi:10.1191/1478088706qp063oa 2006
[5]

Brown and Amjad Altadmri

Neil C.C. Brown and Amjad Altadmri. 2014. Investigating novice programming mistakes: educator beliefs vs. student data. InProceedings of the Tenth Annual Conference on International Computing Education Research(Glasgow, Scotland, United Kingdom)(ICER ’14). Association for Computing Machinery, New York, NY, USA, 43–50. doi:10.1145/2632320.2632343

work page doi:10.1145/2632320.2632343 2014
[6]

Eason Chen, Ray Huang, Han-Shin Chen, Yuen-Hsien Tseng, and Liang-Yi Li
[7]

arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863

GPTutor: a ChatGPT-powered programming tool for code explanation. arXiv:2305.01863 [cs.HC] https://arxiv.org/abs/2305.01863

work page arXiv
[8]

Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All syntax errors are not equal. InProceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education(Haifa, Israel)(ITiCSE ’12). Association for Computing Machinery, New York, NY, USA, 75–80. doi:10.1145/2325296. 2325318

work page doi:10.1145/2325296 2012
[9]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] https: //arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Andrew Ettles, Andrew Luxton-Reilly, and Paul Denny. 2018. Common logic errors made by novice programmers. InProceedings of the 20th Australasian Computing Education Conference(Brisbane, Queensland, Australia)(ACE ’18). Association for Computing Machinery, New York, NY, USA, 83–89. doi:10.1145/ 3160489.3160493

work page arXiv 2018
[11]

Ian Finlayson and Stephen Davies. 2024. Jguardrail: A Framework for Identifying Possible Errors in Student Java Code.J. Comput. Sci. Coll.40, 3 (Oct. 2024), 322–333

2024
[12]

Maria Hristova, Ananya Misra, Megan Rutter, and Rebecca Mercuri. 2003. Identi- fying and correcting Java programming errors for introductory computer science students. InProceedings of the 34th SIGCSE Technical Symposium on Computer Science Education(Reno, Navada, USA)(SIGCSE ’03). Association for Computing Machinery, New York, NY, USA, 153–156. doi:10.11...

work page doi:10.1145/611892.611956 2003
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Jackson, M

J. Jackson, M. Cobb, and C. Carver. 2005. Identifying Top Java Errors for Novice Programmers. InProceedings Frontiers in Education 35th Annual Conference. T4C– T4C. doi:10.1109/FIE.2005.1611967

work page doi:10.1109/fie.2005.1611967 2005
[15]

Nadja Just, Janet Siegmund, and Belinda Schantong. 2025. From Bugs to Break- throughs: Novice Errors in CS2. arXiv:2502.14438 [cs.SE] https://arxiv.org/abs/ 2502.14438

work page arXiv 2025
[16]

Charles Koutcheme. 2022. Towards Open Natural Language Feedback Generation for Novice Programmers using Large Language Models. InProceedings of the 22nd Koli Calling International Conference on Computing Education Research(Koli, Finland)(Koli Calling ’22). Association for Computing Machinery, New York, NY, USA, Article 29, 2 pages. doi:10.1145/3564721.3565955

work page doi:10.1145/3564721.3565955 2022
[17]

Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, and Paul Denny. 2024. Open Source Language Models Can Provide Feed- back: Evaluating LLMs’ Ability to Help Students Using GPT-4-As-A-Judge. arXiv:2405.05253 [cs.CL] https://arxiv.org/abs/2405.05253

work page arXiv 2024
[18]

Charles Koutcheme and Arto Hellas. 2024. Propagating Large Language Models Programming Feedback. InProceedings of the Eleventh ACM Conference on Learn- ing @ Scale(Atlanta, GA, USA)(L@S ’24). Association for Computing Machinery, New York, NY, USA, 366–370. doi:10.1145/3657604.3664665

work page doi:10.1145/3657604.3664665 2024
[19]

Smitha S Kumar, Michael Lones, Manuel Maarek, and Hind Zantout. 2025. Navi- gating the landscape of automated feedback generation techniques for program- ming exercises.ACM Trans. Comput. Educ.(Sept. 2025). doi:10.1145/3764593 Just Accepted

work page doi:10.1145/3764593 2025
[20]

Chris Langhout and Maurício Aniche. 2021. Atoms of Confusion in Java. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). 25–35. doi:10.1109/ICPC52881.2021.00012

work page doi:10.1109/icpc52881.2021.00012 2021
[21]

Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, and Arto Hellas. 2024. LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education. arXiv:2411.10455 [cs.CY] https://arxiv.org/abs/2411.10455

work page arXiv 2024
[22]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

2004
[23]

Jorge Machado. 2025. Toward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLMs. arXiv:2505.10603 [cs.CY] https://arxiv.org/ abs/2505.10603

work page arXiv 2025
[24]

Davin McCall and Michael Kölling. 2019. A New Look at Novice Programmer Errors.ACM Trans. Comput. Educ.19, 4, Article 38 (July 2019), 30 pages. doi:10. 1145/3335814

2019
[25]

Davin McCall and Michael Kölling. 2014. Meaningful categorisation of novice pro- grammer errors. In2014 IEEE Frontiers in Education Conference (FIE) Proceedings. 1–8. doi:10.1109/FIE.2014.7044420

work page doi:10.1109/fie.2014.7044420 2014
[26]

Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2024. Au- tomated Grading and Feedback Tools for Programming Education: A System- atic Review.ACM Trans. Comput. Educ.24, 1, Article 10 (Feb. 2024), 43 pages. doi:10.1145/3636515

work page doi:10.1145/3636515 2024
[27]

Susanne Narciss. 2008. Feedback strategies for interactive learning tasks. In Handbook of research on educational communications and technology. Routledge, 125–143

2008
[28]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadelphia, Penn- sylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[29]

Olga Petrovska, Lee Clift, Faron Moller, and Rebecca Pearsall. 2024. Incorporating Generative AI into Software Development Education. InProceedings of the 8th Conference on Computing Education Practice(Durham, United Kingdom)(CEP ’24). Association for Computing Machinery, New York, NY, USA, 37–40. doi:10. 1145/3633053.3633057

work page arXiv 2024
[30]

Farman Ali Pirzado, Awais Ahmed, Román Alejandro Mendoza-Urdiales, and Hugo Terashima-Marin. 2024. Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature.IEEE Access12 (2024), 112605–112625. doi:10.1109/ ACCESS.2024.3443621

work page arXiv 2024
[31]

Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology175 (2024), 107523. doi:10.1016/j.infsof.2024. 107523

work page doi:10.1016/j.infsof.2024 2024
[32]

Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N

James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProceedings of t...

work page doi:10.1145/3623762.3633499 2023
[33]

Yizhou Qian and James Lehman. 2017. Students’ Misconceptions and Other Difficulties in Introductory Programming: A Literature Review.ACM Trans. Comput. Educ.18, 1, Article 1 (Oct. 2017), 24 pages. doi:10.1145/3077618

work page doi:10.1145/3077618 2017
[34]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Carlos Alexandre Gouvea da Silva, Felipe Negrelle Ramos, Rafael Veiga de Moraes, and Edson Leonardo dos Santos. 2024. ChatGPT: Challenges and Benefits in Software Programming for Higher Education.Sustainability16, 3 (2024). doi:10. 3390/su16031245

2024
[36]

Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, and Jake Renzella. 2025. Narrowing the Gap: Supervised Fine-Tuning of Open- Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools. arXiv:2507.05305 [cs.CY] https://arxiv.org/abs/2507.05305

work page arXiv 2025
[37]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, et al . 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 1482–1494. doi:10.1109/ICSE48619.2023.00129 Conference acronym ’XX, June 03–05, 2018, Woodstoc...

work page doi:10.1109/icse48619.2023.00129 2023
[39]

Jiawei Xu, Ying Ding, and Yi Bu. 2025. Position: Open and Closed Large Language Models in Healthcare. arXiv:2501.09906 [cs.CY] https://arxiv.org/abs/2501.09906

work page arXiv 2025
[40]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020