Recognition: no theorem link
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
Pith reviewed 2026-05-11 01:56 UTC · model grok-4.3
The pith
A multi-stage training framework lets LLMs translate Java to Cangjie more effectively with scarce data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through a multi-stage LLM training framework that uses syntactic knowledge datasets and monolingual instruction data followed by error repair with a dedicated Cangjie repository and compiler feedback, the approach achieves semantic alignment and structure awareness in Java-to-Cangjie translations, yielding a 6.06% improvement in functional equivalence over state-of-the-art methods.
What carries the argument
The multi-stage training framework with iterative error repair using compiler feedback and case retrieval.
If this is right
- Each of the training stages contributes positively to the final translation performance.
- The combination of compiler feedback and error repair case retrieval effectively fixes incorrect Cangjie code.
- The method works with limited parallel data where standard fine-tuning approaches struggle.
- Functional equivalence and compilability improve compared to existing translation techniques.
Where Pith is reading between the lines
- This staged knowledge injection approach could apply to code translation involving other emerging programming languages.
- Building larger error repair repositories from real usage data might yield further gains in repair success.
- The reliance on monolingual data suggests a path for improving LLM performance on tasks with asymmetric data availability.
Load-bearing premise
The specially built syntactic knowledge datasets, monolingual instruction data, and error repair repository contain enough relevant information to teach the LLM reliable Cangjie syntax and semantics despite the absence of large parallel corpora.
What would settle it
Applying the full approach and a single-stage baseline to a fresh collection of Java programs and finding that the functional equivalence improvement falls below 3% or disappears entirely.
Figures
read the original abstract
With the rapid evolution of emerging programming language ecosystems, the demand for code translation to low-resource languages continues to grow. As Cangjie emerges as a new programming language, its ecosystem and development toolchains are rapidly expanding. Automated translation from popular programming languages to Cangjie is therefore valuable for practical development. However, constrained by both insufficient Cangjie knowledge and scarce parallel code corpora, general Large Language Models (LLMs) are prone to syntactic errors and semantic as well as structural misalignment in code translation. Existing approaches typically rely on fine-tuning with large-scale parallel data, but they cannot reliably improve compilability or semantic consistency for low-resource Cangjie languages. To tackle these challenges, we propose a multi-stage training framework of LLMs that employs the iterative error repair technique to translate Java code into Cangjie code. This training framework performs training on LLMs, gradually integrating knowledge and achieving semantic alignment as well as structure awareness. During the code translation, we also combine the compiler feedback and error repair case retrieval to repair the incorrect Cangjie code. We construct syntactic knowledge and monolingual instruction datasets to train the LLM. In addition, we also build a Cangjie error repair repository to support error repair in our approach. Experimental results show that, with limited parallel data, our approach improves functional equivalence by 6.06\% compared to the state-of-the-art approaches. Meanwhile, ablation studies confirm that each training stage positively contributes to the final performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-stage LLM training framework for Java-to-Cangjie code translation that integrates syntactic knowledge and monolingual instruction datasets during training, followed by iterative error repair that combines compiler feedback with retrieval from a constructed Cangjie error repair repository. The central empirical claim is that this approach achieves a 6.06% improvement in functional equivalence over state-of-the-art methods when only limited parallel data is available, with ablation studies indicating positive contributions from each training stage.
Significance. If the reported gains can be isolated from differences in training data and evaluation conditions, the work would provide a practical template for bootstrapping code translation support for emerging low-resource languages. The emphasis on staged knowledge injection plus retrieval-augmented repair is a reasonable response to the scarcity of parallel corpora and language-specific knowledge.
major comments (2)
- [Abstract / Experimental results] Abstract and experimental results section: the 6.06% functional-equivalence gain is presented as evidence that the multi-stage framework outperforms SOTA under limited parallel data, yet the manuscript does not explicitly state that the cited baselines were retrained or re-evaluated on the identical limited corpus together with the syntactic-knowledge, monolingual-instruction, and error-repair datasets constructed for this work. Without that confirmation, the improvement cannot be attributed to the proposed method rather than to an uneven data regime.
- [Experimental results] Experimental setup (implied by the abstract's quantitative claim): no information is supplied on baseline implementations, data splits, number of runs, statistical significance tests, or the precise definition and measurement protocol for 'functional equivalence.' These omissions make it impossible to assess whether the reported margin is robust or reproducible.
minor comments (1)
- [Abstract] The abstract refers to 'each training stage' contributing positively but provides no quantitative deltas or intermediate metrics; a table or figure summarizing stage-wise performance would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the manuscript's transparency and reproducibility.
read point-by-point responses
-
Referee: [Abstract / Experimental results] Abstract and experimental results section: the 6.06% functional-equivalence gain is presented as evidence that the multi-stage framework outperforms SOTA under limited parallel data, yet the manuscript does not explicitly state that the cited baselines were retrained or re-evaluated on the identical limited corpus together with the syntactic-knowledge, monolingual-instruction, and error-repair datasets constructed for this work. Without that confirmation, the improvement cannot be attributed to the proposed method rather than to an uneven data regime.
Authors: We agree that explicit confirmation is required to isolate the contribution of our method. In the experiments, the SOTA baselines were re-evaluated on the identical limited parallel corpus; the syntactic-knowledge and monolingual-instruction datasets were incorporated into baseline fine-tuning where they could be applied without altering the original methods, while the Cangjie-specific error-repair repository was used only by our approach. We will revise the abstract and experimental-results section to state this re-evaluation protocol explicitly, including how each baseline was adapted to the low-resource setting. revision: yes
-
Referee: [Experimental results] Experimental setup (implied by the abstract's quantitative claim): no information is supplied on baseline implementations, data splits, number of runs, statistical significance tests, or the precise definition and measurement protocol for 'functional equivalence.' These omissions make it impossible to assess whether the reported margin is robust or reproducible.
Authors: We acknowledge the need for these details. In the revised manuscript we will add: (1) baseline implementation descriptions (models, fine-tuning hyperparameters, and any adaptations), (2) data-split ratios for the limited parallel corpus, (3) number of runs (five independent runs with reported means and standard deviations), (4) statistical significance testing (paired t-tests on functional-equivalence scores), and (5) the functional-equivalence protocol (compilation success plus passage of manually verified unit tests that check semantic equivalence). These additions will appear in a new or expanded experimental-setup subsection. revision: yes
Circularity Check
No circularity: empirical results from constructed datasets and external compiler feedback
full rationale
The paper presents an empirical multi-stage LLM training framework for Java-to-Cangjie translation. It constructs syntactic knowledge datasets, monolingual instruction data, and a Cangjie error repair repository, then trains LLMs iteratively while using compiler feedback for error repair during inference. The central claim of a 6.06% functional equivalence improvement is an experimental outcome from ablation studies and comparisons to SOTA baselines, not a mathematical derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text. The approach depends on external elements (compiler, constructed resources) rather than reducing any result to its own inputs by construction, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can progressively integrate syntactic, semantic, and structural knowledge through multi-stage training on constructed datasets
- domain assumption Compiler error messages combined with retrieval of prior repair cases can reliably correct syntactic and semantic errors in generated Cangjie code
Reference graph
Works this paper leans on
-
[1]
Nguyen, Hui Song, and Franck Chauvel
Sindre Grønstøl Haugeland, Phu H. Nguyen, Hui Song, and Franck Chauvel
-
[2]
In2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)
Migrating Monoliths to Microservices-based Customizable Multi-tenant Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair Cloud-native Apps. In2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 170–177. doi:10.1109/SEAA53835.2021.00030
-
[3]
Rahul Krishna, Anup Kalia, Saurabh Sinha, Rachel Tzoref-Brill, John Rofrano, and Jin Xiao. 2022. Transforming monolithic applications to microservices with Mono2Micro. InProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering(Melbourne, Australia)(ASE ’21). IEEE Press, 3. doi:10.1109/ASE51524.2021.9678851
-
[4]
Yamina Romani, Okba Tibermacine, and Chouki Tibermacine. 2022. Towards Migrating Legacy Software Systems to Microservice-based Architectures: a Data- Centric Process for Microservice Identification. In2022 IEEE 19th International Conference on Software Architecture Companion (ICSA-C). 15–19. doi:10.1109/ ICSA-C54293.2022.00010
-
[5]
Abhinav Chunchu. 2025. Generative AI-Driven Legacy System Modernization: Transforming Enterprise Infrastructure Through Automated Code Translation and Refactoring.Journal of Computer Science and Technology Studies7, 6 (2025), 407–414
work page 2025
- [6]
-
[7]
Andrei Homescu Khyber Sen. 2017.c2rust. Retrieved January 15, 2026 from https://github.com/immunant/c2rust
work page 2017
- [8]
-
[9]
Maximilien Noal Paul Irwin, Vahid Nasiri. 2016.JavaToCSharp. Retrieved January 15, 2026 from https://github.com/paulirwin/JavaToCSharp
work page 2016
-
[10]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical statistical machine translation for language migration. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering(Saint Petersburg, Russia)(ESEC/FSE 2013). Association for Computing Machinery, New York, NY, USA, 651–654. doi:10.1145/2491411.2494584
-
[11]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2015. Divide- and-conquer approach for multi-phase statistical migration for source code. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering(Lincoln, Nebraska)(ASE ’15). IEEE Press, 585–596. doi:10.1109/ASE. 2015.74
work page doi:10.1109/ase 2015
-
[12]
Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 2552–2562
work page 2018
- [13]
-
[14]
Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023. Code- TransOcean: A Comprehensive Multilingual Benchmark for Code Translation. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Lin- guistics, Singapore, 5067–5089. doi:10.18653/v1/202...
-
[16]
Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...
-
[17]
Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation.Proc. ACM Softw. Eng.1, FSE, Article 71 (July 2024), 24 pages. doi:10.1145/3660778
-
[18]
Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou
-
[19]
Transagent: An llm-based multi-agent system for code translation.arXiv preprint arXiv:2409.19894(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2024. Knowledge Transfer from High-Resource to Low- Resource Programming Languages for Code LLMs.Proc. ACM Program. Lang.8, OOPSLA2, Article 295 (Oct. 2024), 32 pages. doi:10.114...
-
[21]
Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lam- ple. 2020. Unsupervised translation of programming languages. InProceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 1730, 11 pages
work page 2020
- [22]
-
[23]
Ming Zhu, Mohimenul Karim, Ismini Lourentzou, and Daphne Yao. 2024. Semi- Supervised Code Translation Overcoming the Scarcity of Parallel Code Data. In Proceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 1545–1556. doi:10.1145/3...
-
[24]
Jun Wang, Chenghao Su, Yijie Ou, Yanhui Li, Jialiang Tan, Lin Chen, and Yuming Zhou. 2025. Translating to a Low-Resource Language with Compiler Feedback: A Case Study on Cangjie.IEEE Transactions on Software Engineering51, 9 (2025), 2671–2692. doi:10.1109/TSE.2025.3594908
-
[25]
Fang Liu, Jia Li, and Li Zhang. 2023. Syntax and Domain Aware Model for Unsu- pervised Program Translation. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 755–767. doi:10.1109/ICSE48619.2023.00072
-
[26]
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. 2025. Continual Learning of Large Language Models: A Comprehensive Survey.ACM Comput. Surv.58, 5, Article 120 (Nov. 2025), 42 pages. doi:10.1145/3735633
-
[27]
Prithwish Jana, Piyush Jha, Haoyang Ju, Gautham Kishore, Aryan Mahajan, and Vijay Ganesh. 2023. Cotran: An llm-based code translator using reinforcement learning with feedback from compiler and symbolic execution.Proceedings of the 27th European Conference on Artificial Intelligence392 (2023), 4011–4018. doi:10.3233/FAIA240968
-
[28]
Qi Xin. 2017. Towards addressing the patch overfitting problem. InProceedings of the 39th International Conference on Software Engineering Companion(Buenos Aires, Argentina)(ICSE-C ’17). IEEE Press, 489–490. doi:10.1109/ICSE-C.2017.42
-
[29]
Nikhil Parasaram, Earl T. Barr, and Sergey Mechtaev. 2022. Trident: Control- ling Side Effects in Automated Program Repair.IEEE Transactions on Software Engineering48, 12 (2022), 4717–4732. doi:10.1109/TSE.2021.3124323
-
[30]
Yiqing Xie, Atharva Naik, Daniel Fried, and Carolyn Rose. 2023. Data Augmen- tation for Code Translation with Comparable Corpora and Multiple References. InThe 2023 Conference on Empirical Methods in Natural Language Processing. https://openreview.net/forum?id=8NA76tz7Jj
work page 2023
-
[31]
Kyle Wong, Alfonso Amayuelas, Liangming Pan, and William Yang Wang. 2025. Investigating the transferability of code repair for low-resource programming languages. InFindings of the Association for Computational Linguistics: NAACL
work page 2025
-
[32]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3
work page 2022
- [33]
- [34]
-
[35]
Andrew Hlynskyi Max Brunsfeld, Amaan Qureshi. 2013.tree-sitter. Retrieved January 15, 2026 from https://github.com/tree-sitter/tree-sitter
work page 2013
-
[36]
Jianbo Lin, Yi Shen, Chuanyi Li, Changan Niu, and Bin Luo. 2025. OptCode- Trans: Boost LLMs on Low-Resource Programming Language Translation.2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)(2025), 67–72. https://api.semanticscholar.org/CorpusID: 279800367
work page 2025
-
[37]
Qwen Team et al. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671 2, 3 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173(2024)
work page internal anchor Pith review arXiv 2024
-
[39]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadel- phia, Pennsylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135
-
[40]
Min Xue, Artur Andrzejak, and Marla Leuther. 2024. An interpretable error correction method for enhancing code-to-code translation. InThe Twelfth Inter- national Conference on Learning Representations. https://openreview.net/forum? id=fVxIEHGnVT
work page 2024
-
[41]
Pengyu Xue, Linhao Wu, Zhen Yang, Chengyi Wang, Xiang Li, Yuxiang Zhang, Jia Li, Ruikai Jin, Yifei Pei, Zhaoyan Shen, et al. 2025. ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1421–1444
work page 2025
-
[42]
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B. Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proc. ACM Softw. Xinyue Liang, Jingxuan Zhang, Lin Li, Jun Zhang, and Junhao Chen Eng.1, FSE, Article 31 (July 2024), 24 pages. doi:10.1145/3643757
-
[43]
Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. InProceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software(Portland, Oregon, USA)(Onward! 2014). Association for Computing Machinery, New York, NY, USA, 173–184. do...
-
[44]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2014. Migrating code with statistical machine translation(ICSE Companion 2014). Association for Computing Machinery, New York, NY, USA, 544–547. doi:10.1145/2591062. 2591072
-
[45]
Tien N. Nguyen. 2016. Code migration with statistical machine translation (SoftwareMining 2016). Association for Computing Machinery, New York, NY, USA, 2. doi:10.1145/2975961.2990477
-
[46]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- chine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473(2014)
work page internal anchor Pith review arXiv 2014
-
[47]
Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. En- hancing Code Generation for Low-Resource Languages: No Silver Bullet . In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 478–488. doi:10.1109/ICPC66645. 2025.00058
-
[48]
Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New Y...
-
[49]
Bo Zhou, Jiaqi Shi, Ying Wang, Li Li, Tsz On Li, Hai Yu, and Zhiliang Zhu. 2025. Porting Software Libraries to OpenHarmony: Transitioning from TypeScript or JavaScript to ArkTS.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA064 (June 2025), 22 pages. doi:10.1145/3728941
-
[50]
Jaemin Hong and Sukyoung Ryu. 2025. Type-migrating C-to-Rust translation using a large language model.Empirical Software Engineering30, 1 (2025), 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.