Recognition: no theorem link
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
Pith reviewed 2026-05-11 00:51 UTC · model grok-4.3
The pith
Current LLM-based automated program repair models lack robustness to minor syntactic variations in code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper constructs HEJ-Robust from HumanEval-Java-Bug by means of eight semantics-preserving code transformations and evaluates five fine-tuned LLMs on the resulting 1,450 instances, documenting performance drops exceeding 50 percent under multiple transformations and concluding that current LLM-based repair models are not robust to minor syntactic variations.
What carries the argument
HEJ-Robust benchmark, formed by eight semantics-preserving transformations applied to HumanEval-Java-Bug to generate varied but equivalent buggy-code instances.
If this is right
- Models that succeed on standard benchmarks can still fail when presented with equivalent code that uses different but valid syntax.
- Practical deployment of LLM repair tools will encounter inconsistent results across codebases that follow varied formatting conventions.
- Benchmarks for automated program repair should include multiple syntactic forms to give a more accurate picture of model capability.
- Training procedures for repair LLMs may need to expose models to transformed code variants to reduce sensitivity to surface syntax.
Where Pith is reading between the lines
- The observed sensitivity could limit the usefulness of these models inside IDEs where developers routinely edit code in personal styles.
- Retraining or fine-tuning the same models on the transformed instances might recover much of the lost performance.
- Similar robustness gaps are likely to appear in related tasks such as code generation or test-case synthesis.
Load-bearing premise
The eight chosen semantics-preserving transformations sufficiently represent the syntactic variations that occur in real-world software development.
What would settle it
Measuring the same models on a fresh collection of real-world buggy methods that naturally differ in syntax from their canonical versions and checking whether the performance drops remain above 50 percent.
read the original abstract
Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HEJ-Robust, a robustness benchmark for LLM-based automated program repair constructed from HumanEval-Java-Bug by applying eight semantics-preserving code transformations to produce 1,450 instances. It evaluates five fine-tuned LLMs on the benchmark and reports performance drops exceeding 50% under several transformations, concluding that current LLM-based APR models lack robustness to minor syntactic variations.
Significance. If the transformations are empirically validated as semantics-preserving, this benchmark and the reported drops would provide concrete evidence of a practical limitation in LLM-based APR, highlighting the gap between performance on canonical benchmarks and real-world syntactic variability. The construction of a dedicated, multi-model evaluation resource is a clear strength that could support future robustness studies.
major comments (1)
- [§3 (Benchmark Construction)] §3 (Benchmark Construction): The claim that the eight transformations are semantics-preserving is central to attributing performance drops to lack of robustness, yet the manuscript provides no empirical validation such as re-running the HumanEval-Java test suite on transformed instances to confirm identical failing tests or verifying that the original correct patches remain applicable. Without this check, the >50% drops could reflect altered problem difficulty rather than syntactic sensitivity.
minor comments (2)
- [§4 (Evaluation)] §4 (Evaluation): Clarify the precise success metric used for repair (e.g., pass@1, test-suite pass rate) and report whether statistical significance tests were applied to the performance drops.
- [Tables/Figures] Tables/Figures: Include per-transformation breakdown with raw counts or confidence intervals alongside the aggregate >50% claim to improve interpretability.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address the single major comment below and have prepared revisions to incorporate the suggested validation.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction)] The claim that the eight transformations are semantics-preserving is central to attributing performance drops to lack of robustness, yet the manuscript provides no empirical validation such as re-running the HumanEval-Java test suite on transformed instances to confirm identical failing tests or verifying that the original correct patches remain applicable. Without this check, the >50% drops could reflect altered problem difficulty rather than syntactic sensitivity.
Authors: We agree that explicit empirical validation would strengthen the central claim. The eight transformations were selected from established semantics-preserving operations documented in prior work on code refactoring and mutation (variable renaming, equivalent statement reordering, etc.). Nevertheless, the original manuscript did not report a direct check that the transformed instances produce identical failing tests and remain fixable by the original patches. In the revised manuscript we will add a dedicated validation subsection: we re-execute the HumanEval-Java test suites on all 1,450 transformed instances, confirm that the set of failing tests is unchanged, and verify that the ground-truth patches continue to pass. This will empirically rule out changes in problem difficulty and support the robustness interpretation. revision: yes
Circularity Check
No circularity: direct empirical benchmark evaluation
full rationale
The paper constructs HEJ-Robust by applying eight claimed semantics-preserving transformations to HumanEval-Java-Bug instances and measures LLM repair performance drops via direct evaluation. No mathematical derivations, equations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing premises for any result. The central claim (performance drops >50%) follows from explicit measurement on the constructed instances rather than reducing to any input by definition or construction. This is a standard self-contained empirical benchmark study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HumanEval-Java-Bug contains representative examples of buggy Java code suitable for robustness testing.
- domain assumption The eight transformations preserve program semantics while introducing syntactic variation.
Forward citations
Cited by 3 Pith papers
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
Reference graph
Works this paper leans on
-
[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang
-
[2]
Unified pre-training for program understanding and generation.arXiv preprint, arXiv:2103.06333, 2021
Unified pre-training for program understanding and generation. arXiv:2103.06333 [cs.SE] https://arxiv.org/abs/2103.06333
- [3]
-
[4]
Max Brunsfeld and contributors. 2024. tree-sitter. https://github.com/tree-sitter/ tree-sitter. Accessed: 2024-05-23
2024
-
[5]
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. 2022. NatGen: Generative pre-training by "Naturalizing" source code. arXiv:2206.07585 [cs.SE] https://arxiv.org/abs/2206.07585
-
[6]
Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-Modal Learning of Editing Source Code. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Piscataway, NJ, USA, 443–455
2021
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Cheng Cheng and Jinqiu Yang. 2025. CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, Piscataway, NJ, USA, 01–10
2025
-
[9]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1469–1481
2023
-
[10]
Fonds de recherche du Québec. 2024. FRQNT-NSERC NOVA Program, Grant No. 2024-NOVA-346499. https://doi.org/10.69777/346499
-
[11]
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of code lan- guage models on automated program repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1430–1442
2023
-
[12]
Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 1161–1173
2021
-
[13]
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceed- ings of the 2014 International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 437–440
2014
-
[14]
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In2013 35th International Conference on Software Engineering (ICSE). IEEE, Piscataway, NJ, USA, 802–811
2013
-
[15]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair.IEEE Transactions on Software Engineering38, 1 (2012), 54–72. https://doi.org/10.1109/TSE.2011.104
-
[16]
Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. 2025. Evaluating the generalizability of llms in automated program repair. In2025 IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, Piscataway, NJ, USA, 91–95
2025
-
[17]
Junjie Li, Fazle Rabbi, Cheng Cheng, Aseem Sangalay, Yuan Tian, and Jinqiu Yang. 2026. An exploratory study on fine-tuning large language models for secure code generation.Empirical Software Engineering31, 4 (2026), 81. https: //doi.org/10.1007/s10664-026-10803-9
- [19]
-
[20]
Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias unveiled: Investigat- ing social bias in LLM-generated code. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. AAAI Press, Washington, DC, USA, 27491–27499
2025
-
[21]
Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: combining context-aware neural translation models using ensemble for program repair. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 101–114
2020
-
[22]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, 311–318
2002
-
[23]
Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A search-based testing framework for deep neural networks of source code embedding. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Piscataway, NJ, USA, 36–46
2021
-
[24]
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. InProceedings of the 2015 International Symposium on Software Testing and Analysis. ACM, New York, NY, USA, 24–36
2015
-
[25]
Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation. arXiv:2504.19108 [cs.SE] https: //arxiv.org/abs/2504.19108
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Fazle Rabbi, Lin Ling, Song Wang, and Jinqiu Yang. 2026. Social Bias in LLM- Generated Code: Benchmark and Mitigation.arXiv preprint(2026). https: //arxiv.org/abs/2605.00382 arXiv:2605.00382
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [27]
-
[28]
Fazle Rabbi, Soumit Kanti Saha, and Jinqiu Yang. 2026. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation.arXiv preprint(2026). https://arxiv.org/abs/2605.02195 arXiv:2605.02195
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Mod- els with respect to semantic-preserving program transformations.Information and Software Technology135 (2021), 106552
2021
- [30]
-
[31]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297
work page internal anchor Pith review arXiv 2020
-
[32]
Soumit Kanti Saha, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2024. Specification- Driven Code Translation Powered by Large Language Models: How Far Are We? arXiv:2412.04590 [cs.SE] https://arxiv.org/abs/2412.04590
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29
2019
-
[34]
Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. 2023. ReCode: Robustness evaluation of code generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lingu...
2023
- [35]
-
[36]
Bo Yang and Jinqiu Yang. 2020. Exploring the differences between plausible and correct patches at fine-grained level. In2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF). IEEE, Piscataway, NJ, USA, 1–8
2020
- [37]
- [38]
-
[39]
Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv:2310.08879 [cs.SE] https://arxiv.org/abs/2310.08879
-
[40]
Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 341–353. Received 2026-02-15; accept...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.