arxiv: 2605.02215 · v3 · submitted 2026-05-04 · 💻 cs.SE

Recognition: no theorem link

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

Fazle Rabbi, Jinqiu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:51 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated program repairLLM robustnesscode transformationsbenchmarksyntax variationssoftware engineeringHumanEval-Java-Bug

0 comments

The pith

Current LLM-based automated program repair models lack robustness to minor syntactic variations in code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for automated program repair test LLMs only on single canonical forms of buggy code and therefore do not capture the syntactic differences that appear in real software. This paper builds HEJ-Robust by applying eight semantics-preserving transformations to the HumanEval-Java-Bug dataset, producing 1,450 new instances. When five fine-tuned LLMs are evaluated on these instances, their repair success rates fall by more than 50 percent on several transformations. A reader would care because the result shows that models judged successful on standard tests may still break on equivalent code written in slightly different style.

Core claim

The paper constructs HEJ-Robust from HumanEval-Java-Bug by means of eight semantics-preserving code transformations and evaluates five fine-tuned LLMs on the resulting 1,450 instances, documenting performance drops exceeding 50 percent under multiple transformations and concluding that current LLM-based repair models are not robust to minor syntactic variations.

What carries the argument

HEJ-Robust benchmark, formed by eight semantics-preserving transformations applied to HumanEval-Java-Bug to generate varied but equivalent buggy-code instances.

If this is right

Models that succeed on standard benchmarks can still fail when presented with equivalent code that uses different but valid syntax.
Practical deployment of LLM repair tools will encounter inconsistent results across codebases that follow varied formatting conventions.
Benchmarks for automated program repair should include multiple syntactic forms to give a more accurate picture of model capability.
Training procedures for repair LLMs may need to expose models to transformed code variants to reduce sensitivity to surface syntax.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed sensitivity could limit the usefulness of these models inside IDEs where developers routinely edit code in personal styles.
Retraining or fine-tuning the same models on the transformed instances might recover much of the lost performance.
Similar robustness gaps are likely to appear in related tasks such as code generation or test-case synthesis.

Load-bearing premise

The eight chosen semantics-preserving transformations sufficiently represent the syntactic variations that occur in real-world software development.

What would settle it

Measuring the same models on a fresh collection of real-world buggy methods that naturally differ in syntax from their canonical versions and checking whether the performance drops remain above 50 percent.

read the original abstract

Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark from HumanEval-Java-Bug with eight transformations and reports over 50% drops in LLM repair performance, but needs to show the transformations leave the failing tests and correct patches unchanged.

read the letter

This paper's main message is that LLM-based automated program repair models are brittle to small syntactic rewrites of the same buggy code. The authors start from HumanEval-Java-Bug, apply eight semantics-preserving transformations to produce 1,450 instances, and test five fine-tuned models, finding performance drops above 50% on several transformed sets. That is the concrete result they put forward.

Referee Report

1 major / 2 minor

Summary. The paper introduces HEJ-Robust, a robustness benchmark for LLM-based automated program repair constructed from HumanEval-Java-Bug by applying eight semantics-preserving code transformations to produce 1,450 instances. It evaluates five fine-tuned LLMs on the benchmark and reports performance drops exceeding 50% under several transformations, concluding that current LLM-based APR models lack robustness to minor syntactic variations.

Significance. If the transformations are empirically validated as semantics-preserving, this benchmark and the reported drops would provide concrete evidence of a practical limitation in LLM-based APR, highlighting the gap between performance on canonical benchmarks and real-world syntactic variability. The construction of a dedicated, multi-model evaluation resource is a clear strength that could support future robustness studies.

major comments (1)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The claim that the eight transformations are semantics-preserving is central to attributing performance drops to lack of robustness, yet the manuscript provides no empirical validation such as re-running the HumanEval-Java test suite on transformed instances to confirm identical failing tests or verifying that the original correct patches remain applicable. Without this check, the >50% drops could reflect altered problem difficulty rather than syntactic sensitivity.

minor comments (2)

[§4 (Evaluation)] §4 (Evaluation): Clarify the precise success metric used for repair (e.g., pass@1, test-suite pass rate) and report whether statistical significance tests were applied to the performance drops.
[Tables/Figures] Tables/Figures: Include per-transformation breakdown with raw counts or confidence intervals alongside the aggregate >50% claim to improve interpretability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the single major comment below and have prepared revisions to incorporate the suggested validation.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] The claim that the eight transformations are semantics-preserving is central to attributing performance drops to lack of robustness, yet the manuscript provides no empirical validation such as re-running the HumanEval-Java test suite on transformed instances to confirm identical failing tests or verifying that the original correct patches remain applicable. Without this check, the >50% drops could reflect altered problem difficulty rather than syntactic sensitivity.

Authors: We agree that explicit empirical validation would strengthen the central claim. The eight transformations were selected from established semantics-preserving operations documented in prior work on code refactoring and mutation (variable renaming, equivalent statement reordering, etc.). Nevertheless, the original manuscript did not report a direct check that the transformed instances produce identical failing tests and remain fixable by the original patches. In the revised manuscript we will add a dedicated validation subsection: we re-execute the HumanEval-Java test suites on all 1,450 transformed instances, confirm that the set of failing tests is unchanged, and verify that the ground-truth patches continue to pass. This will empirically rule out changes in problem difficulty and support the robustness interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper constructs HEJ-Robust by applying eight claimed semantics-preserving transformations to HumanEval-Java-Bug instances and measures LLM repair performance drops via direct evaluation. No mathematical derivations, equations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing premises for any result. The central claim (performance drops >50%) follows from explicit measurement on the constructed instances rather than reducing to any input by definition or construction. This is a standard self-contained empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the base dataset and the validity of the transformations as semantics-preserving; no free parameters or invented entities are introduced.

axioms (2)

domain assumption HumanEval-Java-Bug contains representative examples of buggy Java code suitable for robustness testing.
Used as the source for all transformations and evaluations.
domain assumption The eight transformations preserve program semantics while introducing syntactic variation.
Stated directly in the abstract as the basis for the benchmark.

pith-pipeline@v0.9.0 · 5404 in / 1318 out tokens · 52557 ms · 2026-05-11T00:51:48.072732+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 7.0

Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 5.0

A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang
[2]

Unified pre-training for program understanding and generation.arXiv preprint, arXiv:2103.06333, 2021

Unified pre-training for program understanding and generation. arXiv:2103.06333 [cs.SE] https://arxiv.org/abs/2103.06333

work page arXiv
[3]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

work page arXiv 2024
[4]

Max Brunsfeld and contributors. 2024. tree-sitter. https://github.com/tree-sitter/ tree-sitter. Accessed: 2024-05-23

2024
[5]

Naturalizing

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. 2022. NatGen: Generative pre-training by "Naturalizing" source code. arXiv:2206.07585 [cs.SE] https://arxiv.org/abs/2206.07585

work page arXiv 2022
[6]

Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-Modal Learning of Editing Source Code. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Piscataway, NJ, USA, 443–455

2021
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Cheng Cheng and Jinqiu Yang. 2025. CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, Piscataway, NJ, USA, 01–10

2025
[9]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1469–1481

2023
[10]

Fonds de recherche du Québec. 2024. FRQNT-NSERC NOVA Program, Grant No. 2024-NOVA-346499. https://doi.org/10.69777/346499

work page doi:10.69777/346499 2024
[11]

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of code lan- guage models on automated program repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1430–1442

2023
[12]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1161–1173

2021
[13]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceed- ings of the 2014 International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 437–440

2014
[14]

Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In2013 35th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 802–811

2013
[15]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair.IEEE Transactions on Software Engineering38, 1 (2012), 54–72. https://doi.org/10.1109/TSE.2011.104

work page doi:10.1109/tse.2011.104 2012
[16]

Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. 2025. Evaluating the generalizability of llms in automated program repair. In2025 IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, Piscataway, NJ, USA, 91–95

2025
[17]

Junjie Li, Fazle Rabbi, Cheng Cheng, Aseem Sangalay, Yuan Tian, and Jinqiu Yang. 2026. An exploratory study on fine-tuning large language models for secure code generation.Empirical Software Engineering31, 4 (2026), 81. https: //doi.org/10.1007/s10664-026-10803-9

work page doi:10.1007/s10664-026-10803-9 2026
[19]

Junjie Li, Fazle Rabbi, Bo Yang, Song Wang, and Jinqiu Yang. 2025. Secure-Instruct: An Automated Pipeline for Synthesizing Instruction-Tuning Datasets Using LLMs for Secure Code Generation. arXiv:2510.07189 [cs.SE] https://arxiv.org/abs/2510. 07189

work page arXiv 2025
[20]

Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias unveiled: Investigat- ing social bias in LLM-generated code. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. AAAI Press, Washington, DC, USA, 27491–27499

2025
[21]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: combining context-aware neural translation models using ensemble for program repair. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 101–114

2020
[22]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, 311–318

2002
[23]

Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A search-based testing framework for deep neural networks of source code embedding. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Piscataway, NJ, USA, 36–46

2021
[24]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. InProceedings of the 2015 International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 24–36

2015
[25]

Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation. arXiv:2504.19108 [cs.SE] https: //arxiv.org/abs/2504.19108

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Fazle Rabbi, Lin Ling, Song Wang, and Jinqiu Yang. 2026. Social Bias in LLM- Generated Code: Benchmark and Mitigation.arXiv preprint(2026). https: //arxiv.org/abs/2605.00382 arXiv:2605.00382

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Fazle Rabbi, Soumit Kanti Saha, Tri Minh Triet Pham, Song Wang, and Jinqiu Yang. 2025. BabelCoder: Agentic Code Translation with Specification Alignment. arXiv:2512.06902 [cs.SE] https://arxiv.org/abs/2512.06902

work page arXiv 2025
[28]

Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang. 2026. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation.arXiv preprint(2026). https://arxiv.org/abs/2605.02195 arXiv:2605.02195

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Mod- els with respect to semantic-preserving program transformations.Information and Software Technology135 (2021), 106552

2021
[30]

Md Rafiqul Islam Rabin, Ke Wang, and Mohammad Amin Alipour. 2019. Testing neural program analyzers. arXiv:1908.10711 [cs.SE] https://arxiv.org/abs/1908. 10711

work page arXiv 2019
[31]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297

work page internal anchor Pith review arXiv 2020
[32]

Soumit Kanti Saha, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2024. Specification- Driven Code Translation Powered by Large Language Models: How Far Are We? arXiv:2412.04590 [cs.SE] https://arxiv.org/abs/2412.04590

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29

2019
[34]

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. 2023. ReCode: Robustness evaluation of code generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lingu...

2023
[35]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier- aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv:2109.00859 [cs.SE] https://arxiv.org/abs/2109.00859

work page arXiv 2021
[36]

Bo Yang and Jinqiu Yang. 2020. Exploring the differences between plausible and correct patches at fine-grained level. In2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF). IEEE, Piscataway, NJ, USA, 1–8

2020
[37]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-trained Models of Code. arXiv:2201.08698 [cs.SE] https://arxiv.org/abs/2201.08698

work page arXiv 2022
[38]

Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2022. CoditT5: Pretraining for Source Code and Natural Language Editing. arXiv:2208.05446 [cs.SE] https://arxiv.org/abs/2208.05446

work page arXiv 2022
[39]

Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv:2310.08879 [cs.SE] https://arxiv.org/abs/2310.08879

work page arXiv 2023
[40]

Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 341–353. Received 2026-02-15; accept...

2021