arxiv: 2605.02195 · v3 · submitted 2026-05-04 · 💻 cs.SE

Recognition: no theorem link

Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

Fazle Rabbi , Soumit Kanti Saha , Jinqiu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLMcode translationfalse failuresevaluationbenchmarkscompilation flagsruntime environmentsfalse negatives

0 comments

The pith

Many reported failures in LLM code translation result from evaluation environment issues rather than translation errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a substantial portion of failures in automated code translation by large language models come from mistakes in how the results are tested, not from the models producing wrong code. This would matter because it suggests that models are performing better than current numbers indicate, and that evaluation methods need improvement to accurately track advances. The authors reach this conclusion by studying thousands of translations in five languages from three different models and benchmarks, sorting the problems into those caused by the testing setup and those tied to specific models.

Core claim

A significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. This is demonstrated in a large-scale study across five programming languages and three benchmarks covering 6164 translations from GPT-4o, DeepSeek-Coder, and Magicoder, with common false negatives categorized into pipeline-induced and model-dependent types.

What carries the argument

The identification and categorization of false negatives in LLM code translation evaluations into pipeline-induced failures and model-dependent behaviors.

Load-bearing premise

That translations marked as false failures will prove functionally correct when the evaluation environment is correctly configured.

What would settle it

Re-executing the translations under properly set compilation flags, linked libraries, and runtime configurations and finding that they pass the tests as expected.

read the original abstract

Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that a chunk of LLM code translation failures come from eval pipeline problems like bad flags and missing libs rather than bad translations, backed by a multi-benchmark study.

read the letter

The main point is that reported failures in these benchmarks often trace to setup errors in the evaluation pipeline instead of the models producing incorrect logic. They ran this across 6164 translations from GPT-4o, DeepSeek-Coder, and Magicoder on Avatar, CodeNet, and EvalPlus, covering C, C++, Java, Python, and Go. The split they draw between pipeline-induced failures that hit any model and model-dependent ones gives a usable way to sort the cases.

Referee Report

3 major / 2 minor

Summary. The paper claims that a significant portion of failures in LLM-based code translation are false negatives caused by evaluation-induced errors, such as improper compilation flags, missing library links, and unconfigured runtime environments, rather than incorrect logic in the translations. This is supported by a large-scale empirical analysis of 6,164 translations from three LLMs (GPT-4o, DeepSeek-Coder, Magicoder) across five languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), categorizing failures into pipeline-induced and model-dependent types.

Significance. If the central claim holds, this work would be significant for the field of automated code translation by demonstrating that current evaluation practices may be inflating failure rates due to setup issues. It provides a concrete categorization of failure types and calls for transparent, configuration-aware standards, which could improve the reliability of benchmarks and accelerate genuine progress in LLM code translation capabilities. The scale of the study across multiple models and languages adds to its potential impact.

major comments (3)

[Methodology] Methodology section: The manuscript does not provide sufficient detail on the process used to confirm that reconfigured translations are functionally equivalent to the source code. For instance, it is unclear what specific tests or equivalence checks were performed after adjusting compilation flags and runtime environments to ensure no new errors were introduced, which is load-bearing for validating the false failure identification.
[Results] Results section: The paper distinguishes pipeline-induced failures from model-dependent ones but lacks a quantitative breakdown, such as percentages or counts per category across the 6164 cases, which is necessary to substantiate the significance of the findings.
[Data collection] Data collection and selection: The criteria for selecting or excluding the 6164 translations, including how initial failures were identified and any filtering steps applied, are not fully specified, raising questions about potential selection bias in the analysis of false negatives.

minor comments (2)

[Abstract] The abstract mentions 'a significant number' of false failures but does not provide a specific count or percentage; including this would give readers an immediate sense of scale.
[Terminology] Ensure consistent terminology between 'false failures' and 'false negatives' throughout the manuscript to prevent reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that will enhance the clarity, rigor, and transparency of our empirical analysis without altering the core findings.

read point-by-point responses

Referee: [Methodology] Methodology section: The manuscript does not provide sufficient detail on the process used to confirm that reconfigured translations are functionally equivalent to the source code. For instance, it is unclear what specific tests or equivalence checks were performed after adjusting compilation flags and runtime environments to ensure no new errors were introduced, which is load-bearing for validating the false failure identification.

Authors: We agree that additional detail is warranted to make the validation process fully reproducible. In the revised manuscript, we will expand the Methodology section with a new subsection titled 'Equivalence Verification Protocol.' This will explicitly describe that, after adjusting compilation flags, library links, and runtime environments, we re-ran the full benchmark-provided test suites (e.g., unit tests from Avatar, CodeNet, and EvalPlus) and verified that output matched the source code's expected behavior. For cases without comprehensive tests, we performed differential testing by executing both source and translated programs on identical inputs and comparing outputs. We also conducted manual review of a random sample of 100 reconfigured cases to confirm no semantic alterations were introduced by the configuration changes. These steps ensure the identified false failures reflect evaluation issues rather than introduced errors. revision: yes
Referee: [Results] Results section: The paper distinguishes pipeline-induced failures from model-dependent ones but lacks a quantitative breakdown, such as percentages or counts per category across the 6164 cases, which is necessary to substantiate the significance of the findings.

Authors: We acknowledge that while aggregate claims are made, a granular quantitative breakdown would better substantiate the scale of the issue. In the revision, we will add a new table (Table 3) in the Results section that reports exact counts and percentages for pipeline-induced versus model-dependent failures. This table will break down the 6,164 translations by language, benchmark, and model, showing: (i) total failures, (ii) number reclassified as pipeline-induced after reconfiguration, (iii) remaining model-dependent failures, and (iv) the proportion of false negatives. We will also include per-category statistics (e.g., compilation flag issues vs. library linking) to allow readers to assess the relative impact. revision: yes
Referee: [Data collection] Data collection and selection: The criteria for selecting or excluding the 6164 translations, including how initial failures were identified and any filtering steps applied, are not fully specified, raising questions about potential selection bias in the analysis of false negatives.

Authors: We appreciate the opportunity to clarify this process. The 6,164 translations constitute the complete set generated by applying the three LLMs to all relevant source programs in the Avatar, CodeNet, and EvalPlus benchmarks across the five languages, with no post-hoc exclusions of successful or failed cases. Initial failures were defined strictly as translations that either failed to compile or failed to pass the benchmark's provided test cases when evaluated under default (unreconfigured) settings. In the revised manuscript, we will add a new 'Data Collection and Filtering' subsection that details: the exact LLM prompts and decoding parameters used, the failure detection script logic, and confirmation that the only pre-translation filter was removal of source programs that were themselves invalid or non-compilable. This will explicitly address and rule out selection bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis

full rationale

The paper is a purely empirical study that analyzes 6,164 existing translations from three LLMs across five languages and three benchmarks. It identifies false failures by inspecting compilation flags, library links, and runtime configurations, then categorizes them as pipeline-induced versus model-dependent. No equations, derivations, fitted parameters, or predictions appear; claims rest on direct observation and manual verification rather than any self-referential construction or load-bearing self-citation. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study and introduces no free parameters, new axioms beyond standard assumptions in software engineering research, or invented entities.

axioms (1)

domain assumption The chosen benchmarks (Avatar, CodeNet, EvalPlus) are representative of real-world code translation tasks.
Conclusions about false failures rest on these benchmarks being suitable proxies for practical use cases.

pith-pipeline@v0.9.0 · 5469 in / 1267 out tokens · 39260 ms · 2026-05-11T01:08:43.682063+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 accept novelty 7.0

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. 2023. Avatar: A parallel corpus for java-python program translation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 2268–2281

work page 2023
[2]

Manish Bhattarai, Javier E Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, and Daniel O’Malley. 2024. Enhancing code translation in language models with few-shot learning via retrieval-augmented generation. arXiv:2407.19619 [cs.SE] https://arxiv.org/abs/2407.19619

work page arXiv 2024
[3]

Cheng Cheng and Jinqiu Yang. 2025. CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, Piscataway, NJ, USA, 01–10

work page 2025
[4]

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code genera- tion via chatgpt.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–38

work page 2024
[5]

Mehmet Emre, Ryan Schroeder, Kyle Dewey, and Ben Hardekopf. 2021. Trans- lating C to safer Rust.Proceedings of the ACM on Programming Languages5, OOPSLA (2021), 1–29

work page 2021
[6]

Fonds de recherche du Québec. 2024. FRQNT-NSERC NOVA Program, Grant No. 2024-NOVA-346499. https://doi.org/10.69777/346499

work page doi:10.69777/346499 2024
[7]

Shubham Gandhi, Manasi Patwardhan, Jyotsana Khatri, Lovekesh Vig, and Raveendra Kumar Medicherla. 2024. Translation of low-resource COBOL to logically correct and readable Java leveraging high-resource Java refinement. In Proceedings of the 1st International Workshop on Large Language Models for Code. ACM, New York, NY, USA, 46–53

work page 2024
[8]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. 2024. DeepSeek-Coder: when the large language model meets programming–the rise of code intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Minghua He, Fangkai Yang, Pu Zhao, Wenjie Yin, Yu Kang, Qingwei Lin, Sara- van Rajmohan, Dongmei Zhang, and Qi Zhang. 2025. ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation. arXiv:2501.18460 [cs.SE] https://arxiv.org/abs/2501.18460

work page arXiv 2025
[10]

Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation. arXiv:2410.24117 [cs.SE] https://arxiv.org/abs/2410.24117

work page arXiv 2024
[11]

Junjie Li, Fazle Rabbi, Cheng Cheng, Aseem Sangalay, Yuan Tian, and Jinqiu Yang. 2026. An exploratory study on fine-tuning large language models for secure code generation.Empirical Software Engineering31, 4 (2026), 81. https: //doi.org/10.1007/s10664-026-10803-9

work page doi:10.1007/s10664-026-10803-9 2026
[13]

Junjie Li, Fazle Rabbi, Bo Yang, Song Wang, and Jinqiu Yang. 2025. Secure-Instruct: An Automated Pipeline for Synthesizing Instruction-Tuning Datasets Using LLMs for Secure Code Generation. arXiv:2510.07189 [cs.SE] https://arxiv.org/abs/2510. 07189

work page arXiv 2025
[14]

Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias unveiled: Investigat- ing social bias in LLM-Generated Code. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. AAAI Press, Washington, DC, USA, 27491–27499

work page 2025
[15]

Michael Ling, Yijun Yu, Haitao Wu, Yuan Wang, James R Cordy, and Ahmed E Hassan. 2022. In Rust we trust: a transpiler from unsafe C to safer Rust. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. ACM, New York, NY, USA, 354–355

work page 2022
[16]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2024), 21558–21572

work page 2024
[17]

Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe R Cogo, and Bram Adams. 2024. InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM- based Code Translation. arXiv:2411.01063 [cs.SE] https://arxiv.org/abs/2411. 01063

work page arXiv 2024
[18]

Vikram Nitin, Rahul Krishna, and Baishakhi Ray. 2024. SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications. arXiv:2405.18574 [cs.SE] https://arxiv.org/abs/2405.18574

work page arXiv 2024
[19]

Guangsheng Ou, Mingwei Liu, Yuxuan Chen, Xueying Du, Shengbo Wang, Zekai Zhang, Xin Peng, and Zibin Zheng. 2025. Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-Augmented. arXiv:2503.18305 [cs.SE] https://arxiv.org/abs/2503.18305

work page arXiv 2025
[20]

Jialing Pan, Adrien Sadé, Jin Kim, Eric Soriano, Guillem Sole, and Sylvain Flamant

work page
[21]

Stelocoder: a decoder-only llm for multi-language to pyth on code translation,

SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation. arXiv:2310.15539 [cs.CL] https://arxiv.org/abs/2310.15539

work page arXiv
[22]

Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...

work page 2024
[23]

Zhiyuan Pan, Xing Hu, Xin Xia, and Xiaohu Yang. 2024. Enhancing repository-level code generation with integrated contextual information. arXiv:2406.03283 [cs.SE] https://arxiv.org/abs/2406.03283

work page arXiv 2024
[24]

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al

work page
[25]

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, and 1 others

Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv:2105.12655 [cs.SE] https://arxiv.org/abs/2105.12655

work page arXiv
[26]

Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation. arXiv:2504.19108 [cs.SE] https: //arxiv.org/abs/2504.19108

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Fazle Rabbi, Lin Ling, Song Wang, and Jinqiu Yang. 2026. Social Bias in LLM- Generated Code: Benchmark and Mitigation. arXiv:2605.00382 [cs.SE] https: //arxiv.org/abs/2605.00382

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Fazle Rabbi, Soumit Kanti Saha, Tri Minh Triet Pham, Song Wang, and Jinqiu Yang. 2025. BabelCoder: Agentic Code Translation with Specification Alignment. arXiv:2512.06902 [cs.SE] https://arxiv.org/abs/2512.06902

work page arXiv 2025
[29]

Fazle Rabbi and Jinqiu Yang. 2026. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair.arXiv preprint(2026). https://arxiv.org/ abs/2605.02215 arXiv:2605.02215

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Soumit Kanti Saha, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2024. Specification- Driven Code Translation Powered by Large Language Models: How Far Are We? arXiv:2412.04590 [cs.SE] https://arxiv.org/abs/2412.04590

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

2003.Modernizing legacy systems: software technologies, engineering processes, and business practices

Robert C Seacord, Daniel Plakosh, and Grace A Lewis. 2003.Modernizing legacy systems: software technologies, engineering processes, and business practices. Addison-Wesley Professional, Boston, MA, USA

work page 2003
[32]

Harry M Sneed. 2010. Migrating from COBOL to Java. In2010 IEEE International Conference on Software Maintenance. IEEE, Piscataway, NJ, USA, 1–7

work page 2010
[33]

Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching code llms to use autocompletion tools in repository-level code generation. arXiv:2401.06391 [cs.SE] https://arxiv.org/abs/2401.06391

work page arXiv 2024
[34]

Yanli Wang, Yanlin Wang, Suiquan Wang, Daya Guo, Jiachi Chen, John Grundy, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, et al . 2024. Re- poTransBench: A Real-World Benchmark for Repository-Level Code Translation. arXiv:2412.17744 [cs.SE] https://arxiv.org/abs/2412.17744

work page arXiv 2024
[35]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang

work page
[36]

Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Magicoder: Empowering code generation with oss-instruct. arXiv:2312.02120 [cs.SE] https://arxiv.org/abs/2312.02120

work page arXiv
[37]

Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and unleashing the power of large language models in automated code translation.Proceedings of the ACM on Software Engineering1, FSE (2024), 1585–1608

work page 2024
[38]

Nguyen, Shaohua Wang, and Xiaohu Yang

Xin Yin, Chao Ni, Tien N. Nguyen, Shaohua Wang, and Xiaohu Yang. 2024. Rectifier: Code Translation with Corrector via LLMs. arXiv:2407.07472 [cs.SE] https://arxiv.org/abs/2407.07472

work page arXiv 2024
[39]

Qianqian Yu, Zhangjin Huang, and Naijie Gu. 2023. Pseudocode to code based on adaptive global and local information. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway, NJ, USA, 61–72

work page 2023
[40]

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou

work page
[41]

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation. arXiv:2409.19894 [cs.SE] https://arxiv.org/abs/2409.19894

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Hanliang Zhang, Cristina David, Yijun Yu, and Meng Wang. 2023. Ownership guided C to Rust translation. InInternational Conference on Computer Aided Verification. Springer, Cham, Switzerland, 459–482

work page 2023
[43]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv:2401.07339 [cs.SE] https://arxiv.org/abs/2401.07339 Received 2026-02-15; accepted 2026-03-28

work page arXiv 2024