Recognition: unknown
Large Language Models for Multilingual Code Intelligence: A Survey
Pith reviewed 2026-05-08 02:41 UTC · model grok-4.3
The pith
Large language models must overcome bias toward high-resource languages like Python to support reliable code intelligence in polyglot systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current research on large language models for code remains heavily biased toward high-resource languages such as Python, with noticeably weaker performance on languages like Rust and OCaml. Because real-world systems are inherently polyglot, the survey centers on two key tasks: multilingual code generation from shared natural-language requirements and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics while highlighting challenges and opportunities for trustworthy cross-language generalization.
What carries the argument
The two primary tasks—multilingual code generation from shared natural-language requirements and semantic-preserving multilingual code translation—which organize the review of methods, benchmarks, and metrics for cross-language capabilities.
If this is right
- Better multilingual generation would let developers write one natural-language specification and receive correct implementations in several languages.
- Reliable semantic-preserving translation would reduce the cost and risk of porting existing codebases between languages.
- Metrics that directly measure semantic equivalence across languages would give clearer signals for model improvement than current proxies.
- Overcoming generalization gaps would make AI coding assistants practical for the mixed-language projects that dominate industry codebases.
Where Pith is reading between the lines
- Training strategies that treat programming languages more symmetrically with natural language could reduce the current resource imbalance.
- Insights from this code-focused survey may transfer to improving multilingual capabilities in other structured output domains such as formal specifications.
- Deployment testing on actual mixed-language repositories would be needed to confirm whether the reviewed methods scale beyond isolated benchmarks.
Load-bearing premise
The representative methods, benchmarks, and metrics chosen for the survey adequately capture the current state of the field and the core difficulties of maintaining semantics across programming languages.
What would settle it
A new evaluation on a benchmark that includes low-resource languages such as OCaml or Rust where an off-the-shelf model matches its Python performance without any cross-language training or adaptation would undermine the surveyed claim that focused multilingual research is required.
Figures
read the original abstract
Large language models have transformed AI-assisted software engineering, but current research remains biased toward high-resource languages such as Python, with weaker performance in languages like Rust and OCaml. Since real-world systems are inherently polyglot, robust multilingual code intelligence is crucial. This survey focuses on two key tasks: multilingual code generation from shared natural-language requirements, and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics, and highlights challenges and opportunities for trustworthy cross-language generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey on large language models for multilingual code intelligence. It notes the bias of current research toward high-resource languages such as Python and weaker performance on languages like Rust and OCaml. The survey centers on two tasks—multilingual code generation from shared natural-language requirements and semantics-preserving code translation across languages—while reviewing representative methods, benchmarks, and metrics and discussing challenges and opportunities for trustworthy cross-language generalization.
Significance. A well-executed survey in this area would be useful for directing research on polyglot code intelligence, given that real-world software systems are inherently multilingual. By synthesizing methods, benchmarks, and metrics for the two focal tasks, the paper could help identify gaps in cross-language semantic preservation and generalization.
minor comments (2)
- [Abstract] Abstract: the claim that the survey reviews 'representative methods, benchmarks, and evaluation metrics' would be strengthened by an explicit statement of selection criteria, search strategy, or inclusion thresholds (e.g., publication venues, time window, or minimum citation count).
- [Abstract] The abstract frames the two tasks clearly but does not indicate the approximate number of papers or systems covered; adding this information would help readers gauge the survey's breadth without needing to consult the full reference list.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our survey and for recommending minor revision. We are encouraged that the work is viewed as a useful synthesis for directing research on polyglot code intelligence, and we will incorporate minor revisions in the updated manuscript.
Circularity Check
No significant circularity: survey reviews external literature without internal derivations
full rationale
This is a literature survey paper whose scope is to review representative methods, benchmarks, and metrics for multilingual code generation and translation tasks from existing work. No equations, fitted parameters, predictions, or first-principles derivations are present in the abstract or stated claims. The central content consists of citations to external papers, with no self-citation chains used to justify uniqueness theorems or ansatzes that reduce to the survey's own inputs. The selection of reviewed items is presented as representative rather than derived from any internal model, satisfying the condition for a self-contained survey with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world systems are inherently polyglot
Reference graph
Works this paper leans on
-
[1]
When large language models meet formal theorem proving: A survey
Junjie Hu, Cheng Wen, Jialun Cao, Yikun Hu, Dugang Liu, Zhi Ma, Zhiwu Xu, and Shengchao Qin. When large language models meet formal theorem proving: A survey. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026
2026
-
[2]
A survey on static code analysis with large language models
ChengWen,YuandaoCai,HuaZheng,BinYu,DugangLiu,ZhiwuXu,Kuanishbay Sadatdiynov, and Shengchao Qin. A survey on static code analysis with large language models. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026
2026
-
[3]
Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard
BigCodeProject. Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard
2026
-
[4]
Qiwei Peng, Yekun Chai, and Xuhong Li. Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024
-
[5]
Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024
Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024
-
[6]
arXiv preprint arXiv:2206.08474 , year=
Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K Reddy. Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474, 2022. Large Language Models for Multilingual Code Intelligence: A Survey 11
-
[7]
Yanli Wang, Yanlin Wang, et al. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744, 2024
-
[8]
Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence
Momoko Shiraishi, Yinzhi Cao, and Takahiro Shinagawa. Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence. 2026
2026
-
[9]
arXiv preprint arXiv:2509.12973 (2025)
Aamer Aljagthami, Mohammed Banabila, Musab Alshehri, Mohammed Kabini, and Mohammad D Alahmadi. Evaluating large language models for code translation: Effects of prompt language and prompt design.arXiv preprint arXiv:2509.12973, 2025
-
[10]
Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, and Shing-Chi Cheung. Isolating language-coding from problem-solving: Benchmarking llms with pseudo- eval.arXiv preprint arXiv:2502.19149, 2025
-
[11]
mhumaneval- a multilingual benchmark to evaluate large language models for code generation
Md Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri. mhumaneval- a multilingual benchmark to evaluate large language models for code generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11432–11461, 2025
2025
-
[12]
arXiv preprint arXiv:2507.08627 (2025)
Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, and Alexander Wong. Nl in the mid- dle: Code translation with llms and intermediate representations.arXiv preprint arXiv:2507.08627, 2025
-
[13]
Llm- assisted translation of legacy fortran codes to c++: A cross-platform study
Nishath Rajiv Ranasinghe, Shawn M Jones, Michal Kucer, Ayan Biswas, Daniel O’Malley, Alexander Most, Selma Liliane Wanna, and Ajay Sreekumar. Llm- assisted translation of legacy fortran codes to c++: A cross-platform study. In Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, pages 58–69, 2025
2025
-
[14]
Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe R Cogo, and Bram Adams. Inter- trans: Leveraging transitive intermediate translations to enhance llm-based code translation.arXiv preprint arXiv:2411.01063, 2024
-
[15]
Alessio Buscemi. A comparative study of code generation using chatgpt 3.5 across 10 programming languages.arXiv preprint arXiv:2308.04477, 2023
-
[16]
Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025
-
[17]
Indraneil Paul, Goran Glavaš, and Iryna Gurevych. Ircoder: Intermediate represen- tations make language models robust multilingual code generators.arXiv preprint arXiv:2403.03894, 2024
-
[18]
Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023
Yuxiang Wei, Zhe Wang, Jiawei Liu, et al. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023
-
[19]
Octopack: Instruction tuning code large language models
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 workshop on instruction tuning and instruction following, 2023
2023
-
[20]
Micheline Bénédicte Moumoula, Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Kader Kabore, Jacques Klein, and Tegawendé F Bissyande. Beyond lan- guage barriers: Multi-agent coordination for multi-language code generation.arXiv preprint arXiv:2509.19918, 2025
-
[21]
Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc
Tomer Bitan, Tal Kadosh, Erel Kaplan, Shira Meiri, Le Chen, Peter Morales, Niranjan Hasabnis, and Gal Oren. Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc. In2025 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2025. 12 Authors Suppressed Due to Excessive Length
2025
-
[22]
Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand, Joey Dodds, and Daniel Kroening. Matchfixagent: Language-agnostic autonomous repository-level code translation validation and repair.arXiv preprint arXiv:2509.16187, 2025
-
[23]
Chaofan Wang, Tingrui Yu, Chen Xie, Jie Wang, Dong Chen, Wenrui Zhang, Yul- ing Shi, Xiaodong Gu, and Beijun Shen. Evoc2rust: A skeleton-guided framework for project-level c-to-rust translation.arXiv preprint arXiv:2508.04295, 2025
-
[24]
Yicheng Tao, Yao Qin, and Yepang Liu. Retrieval-augmented code generation: A survey with focus on repository-level approaches.arXiv preprint arXiv:2510.04905, 2025
-
[25]
Manish Bhattarai, Miguel Cordova, Minh Vu, Javier Santos, Ismael Boureima, and Dan O’Malley. Arcs: Agentic retrieval-augmented code synthesis with iterative refinement.arXiv preprint arXiv:2504.20434, 2025
-
[26]
Enhancing code translation in language models with few- shot learning via retrieval-augmented generation
Manish Bhattarai, Javier E Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, and Daniel O’Malley. Enhancing code translation in language models with few- shot learning via retrieval-augmented generation. In2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE, 2024
2024
-
[27]
Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems
Zhi Ma, Cheng Wen, Bin Yu, and Jie Su. Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems. Information Fusion, 125:103466, 2026
2026
-
[28]
Automated ltl specification generation from industrial aerospace requirements
Zhi Ma, Xiao Liang, Cheng Wen, Rui Chen, Bin Gu, Shengchao Qin, Cong Tian, and Mengfei Yang. Automated ltl specification generation from industrial aerospace requirements. InProceedings of the 27th International Symposium on Formal Methods (FM), 2026
2026
-
[29]
Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms
Zhi Ma, Cheng Wen, Zhexin Su, Xiao Liang, Cong Tian, Shengchao Qin, and Mengfei Yang. Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, 2025
2025
-
[30]
Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval
Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...
2024
-
[31]
Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Code- transocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951, 2023
-
[32]
On the evaluation of neural code translation: Taxonomy and benchmark
MingshengJiao,TingruiYu,XuanLi,GuanjieQiu,XiaodongGu,andBeijunShen. On the evaluation of neural code translation: Taxonomy and benchmark. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1529–1541. IEEE, 2023
2023
-
[33]
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021
work page internal anchor Pith review arXiv 2021
-
[34]
arXiv preprint arXiv:2105.12655 (2021)
Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655, 2021
-
[35]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. Program synthesis with large lan- guage models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.