arxiv: 2604.25960 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.LG· cs.PL

Recognition: unknown

Large Language Models for Multilingual Code Intelligence: A Survey

Chao Jiang , Dugang Liu , Cheng Wen , Zhiwu Xu , Hua Zheng , Muhammad Sadiq , Jawwad Ahmed Shamsi , Shengchao Qin

show 1 more author

Zhong Ming

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:41 UTC · model grok-4.3

classification 💻 cs.SE cs.LGcs.PL

keywords large language modelsmultilingual code generationcode translationpolyglot systemssoftware engineeringcross-language generalizationbenchmarksevaluation metrics

0 comments

The pith

Large language models must overcome bias toward high-resource languages like Python to support reliable code intelligence in polyglot systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how large language models handle programming tasks across multiple languages. It establishes that real-world software systems typically combine several languages at once, so models need strong performance not just in Python but also in lower-resource languages such as Rust and OCaml. The survey organizes its review around two concrete tasks: generating code in different languages from the same natural-language requirements and translating code from one language to another while preserving its original meaning and behavior. It catalogs existing methods, benchmarks, and evaluation metrics, then points to open challenges in achieving consistent cross-language results.

Core claim

Current research on large language models for code remains heavily biased toward high-resource languages such as Python, with noticeably weaker performance on languages like Rust and OCaml. Because real-world systems are inherently polyglot, the survey centers on two key tasks: multilingual code generation from shared natural-language requirements and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics while highlighting challenges and opportunities for trustworthy cross-language generalization.

What carries the argument

The two primary tasks—multilingual code generation from shared natural-language requirements and semantic-preserving multilingual code translation—which organize the review of methods, benchmarks, and metrics for cross-language capabilities.

If this is right

Better multilingual generation would let developers write one natural-language specification and receive correct implementations in several languages.
Reliable semantic-preserving translation would reduce the cost and risk of porting existing codebases between languages.
Metrics that directly measure semantic equivalence across languages would give clearer signals for model improvement than current proxies.
Overcoming generalization gaps would make AI coding assistants practical for the mixed-language projects that dominate industry codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training strategies that treat programming languages more symmetrically with natural language could reduce the current resource imbalance.
Insights from this code-focused survey may transfer to improving multilingual capabilities in other structured output domains such as formal specifications.
Deployment testing on actual mixed-language repositories would be needed to confirm whether the reviewed methods scale beyond isolated benchmarks.

Load-bearing premise

The representative methods, benchmarks, and metrics chosen for the survey adequately capture the current state of the field and the core difficulties of maintaining semantics across programming languages.

What would settle it

A new evaluation on a benchmark that includes low-resource languages such as OCaml or Rust where an off-the-shelf model matches its Python performance without any cross-language training or adaptation would undermine the surveyed claim that focused multilingual research is required.

Figures

Figures reproduced from arXiv: 2604.25960 by Chao Jiang, Cheng Wen, Dugang Liu, Hua Zheng, Jawwad Ahmed Shamsi, Muhammad Sadiq, Shengchao Qin, Zhiwu Xu, Zhong Ming.

**Figure 1.** Figure 1: Statistical Data on the Collected Benchmarks view at source ↗

read the original abstract

Large language models have transformed AI-assisted software engineering, but current research remains biased toward high-resource languages such as Python, with weaker performance in languages like Rust and OCaml. Since real-world systems are inherently polyglot, robust multilingual code intelligence is crucial. This survey focuses on two key tasks: multilingual code generation from shared natural-language requirements, and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics, and highlights challenges and opportunities for trustworthy cross-language generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a survey on large language models for multilingual code intelligence. It notes the bias of current research toward high-resource languages such as Python and weaker performance on languages like Rust and OCaml. The survey centers on two tasks—multilingual code generation from shared natural-language requirements and semantics-preserving code translation across languages—while reviewing representative methods, benchmarks, and metrics and discussing challenges and opportunities for trustworthy cross-language generalization.

Significance. A well-executed survey in this area would be useful for directing research on polyglot code intelligence, given that real-world software systems are inherently multilingual. By synthesizing methods, benchmarks, and metrics for the two focal tasks, the paper could help identify gaps in cross-language semantic preservation and generalization.

minor comments (2)

[Abstract] Abstract: the claim that the survey reviews 'representative methods, benchmarks, and evaluation metrics' would be strengthened by an explicit statement of selection criteria, search strategy, or inclusion thresholds (e.g., publication venues, time window, or minimum citation count).
[Abstract] The abstract frames the two tasks clearly but does not indicate the approximate number of papers or systems covered; adding this information would help readers gauge the survey's breadth without needing to consult the full reference list.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our survey and for recommending minor revision. We are encouraged that the work is viewed as a useful synthesis for directing research on polyglot code intelligence, and we will incorporate minor revisions in the updated manuscript.

Circularity Check

0 steps flagged

No significant circularity: survey reviews external literature without internal derivations

full rationale

This is a literature survey paper whose scope is to review representative methods, benchmarks, and metrics for multilingual code generation and translation tasks from existing work. No equations, fitted parameters, predictions, or first-principles derivations are present in the abstract or stated claims. The central content consists of citations to external papers, with no self-citation chains used to justify uniqueness theorems or ansatzes that reduce to the survey's own inputs. The selection of reviewed items is presented as representative rather than derived from any internal model, satisfying the condition for a self-contained survey with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that real-world software is polyglot and that current LLM performance is biased toward high-resource languages; no free parameters, new entities, or ad-hoc axioms are introduced beyond standard background knowledge in the field.

axioms (1)

domain assumption Real-world systems are inherently polyglot
Invoked in the abstract as the motivation for focusing on multilingual code intelligence.

pith-pipeline@v0.9.0 · 5397 in / 1125 out tokens · 39047 ms · 2026-05-08T02:41:29.087380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 21 canonical work pages · 2 internal anchors

[1]

When large language models meet formal theorem proving: A survey

Junjie Hu, Cheng Wen, Jialun Cao, Yikun Hu, Dugang Liu, Zhi Ma, Zhiwu Xu, and Shengchao Qin. When large language models meet formal theorem proving: A survey. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026

2026
[2]

A survey on static code analysis with large language models

ChengWen,YuandaoCai,HuaZheng,BinYu,DugangLiu,ZhiwuXu,Kuanishbay Sadatdiynov, and Shengchao Qin. A survey on static code analysis with large language models. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026

2026
[3]

Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard

BigCodeProject. Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard

2026
[4]

Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024

Qiwei Peng, Yekun Chai, and Xuhong Li. Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024

work page arXiv 2024
[5]

Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024

Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024

work page arXiv 2024
[6]

arXiv preprint arXiv:2206.08474 , year=

Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K Reddy. Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474, 2022. Large Language Models for Multilingual Code Intelligence: A Survey 11

work page arXiv 2022
[7]

Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744, 2024

Yanli Wang, Yanlin Wang, et al. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744, 2024

work page arXiv 2024
[8]

Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence

Momoko Shiraishi, Yinzhi Cao, and Takahiro Shinagawa. Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence. 2026

2026
[9]

arXiv preprint arXiv:2509.12973 (2025)

Aamer Aljagthami, Mohammed Banabila, Musab Alshehri, Mohammed Kabini, and Mohammad D Alahmadi. Evaluating large language models for code translation: Effects of prompt language and prompt design.arXiv preprint arXiv:2509.12973, 2025

work page arXiv 2025
[10]

Isolating language-coding from problem-solving: Benchmarking llms with pseudo- eval.arXiv preprint arXiv:2502.19149, 2025

Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, and Shing-Chi Cheung. Isolating language-coding from problem-solving: Benchmarking llms with pseudo- eval.arXiv preprint arXiv:2502.19149, 2025

work page arXiv 2025
[11]

mhumaneval- a multilingual benchmark to evaluate large language models for code generation

Md Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri. mhumaneval- a multilingual benchmark to evaluate large language models for code generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11432–11461, 2025

2025
[12]

arXiv preprint arXiv:2507.08627 (2025)

Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, and Alexander Wong. Nl in the mid- dle: Code translation with llms and intermediate representations.arXiv preprint arXiv:2507.08627, 2025

work page arXiv 2025
[13]

Llm- assisted translation of legacy fortran codes to c++: A cross-platform study

Nishath Rajiv Ranasinghe, Shawn M Jones, Michal Kucer, Ayan Biswas, Daniel O’Malley, Alexander Most, Selma Liliane Wanna, and Ajay Sreekumar. Llm- assisted translation of legacy fortran codes to c++: A cross-platform study. In Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, pages 58–69, 2025

2025
[14]

Inter- trans: Leveraging transitive intermediate translations to enhance llm-based code translation.arXiv preprint arXiv:2411.01063, 2024

Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe R Cogo, and Bram Adams. Inter- trans: Leveraging transitive intermediate translations to enhance llm-based code translation.arXiv preprint arXiv:2411.01063, 2024

work page arXiv 2024
[15]

A comparative study of code generation using chatgpt 3.5 across 10 programming languages.arXiv preprint arXiv:2308.04477, 2023

Alessio Buscemi. A comparative study of code generation using chatgpt 3.5 across 10 programming languages.arXiv preprint arXiv:2308.04477, 2023

work page arXiv 2023
[16]

Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

work page arXiv 2025
[17]

Ircoder: Intermediate represen- tations make language models robust multilingual code generators.arXiv preprint arXiv:2403.03894, 2024

Indraneil Paul, Goran Glavaš, and Iryna Gurevych. Ircoder: Intermediate represen- tations make language models robust multilingual code generators.arXiv preprint arXiv:2403.03894, 2024

work page arXiv 2024
[18]

Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, et al. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

work page arXiv 2023
[19]

Octopack: Instruction tuning code large language models

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 workshop on instruction tuning and instruction following, 2023

2023
[20]

Beyond lan- guage barriers: Multi-agent coordination for multi-language code generation.arXiv preprint arXiv:2509.19918, 2025

Micheline Bénédicte Moumoula, Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Kader Kabore, Jacques Klein, and Tegawendé F Bissyande. Beyond lan- guage barriers: Multi-agent coordination for multi-language code generation.arXiv preprint arXiv:2509.19918, 2025

work page arXiv 2025
[21]

Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc

Tomer Bitan, Tal Kadosh, Erel Kaplan, Shira Meiri, Le Chen, Peter Morales, Niranjan Hasabnis, and Gal Oren. Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc. In2025 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2025. 12 Authors Suppressed Due to Excessive Length

2025
[22]

Matchfixagent: Language-agnostic autonomous repository-level code translation validation and repair.arXiv preprint arXiv:2509.16187, 2025

Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand, Joey Dodds, and Daniel Kroening. Matchfixagent: Language-agnostic autonomous repository-level code translation validation and repair.arXiv preprint arXiv:2509.16187, 2025

work page arXiv 2025
[23]

Evoc2rust: A skeleton-guided framework for project-level c-to-rust translation.arXiv preprint arXiv:2508.04295, 2025

Chaofan Wang, Tingrui Yu, Chen Xie, Jie Wang, Dong Chen, Wenrui Zhang, Yul- ing Shi, Xiaodong Gu, and Beijun Shen. Evoc2rust: A skeleton-guided framework for project-level c-to-rust translation.arXiv preprint arXiv:2508.04295, 2025

work page arXiv 2025
[24]

Retrieval-augmented code generation: A survey with focus on repository- level approaches.arXiv preprint arXiv:2510.04905, 2025

Yicheng Tao, Yao Qin, and Yepang Liu. Retrieval-augmented code generation: A survey with focus on repository-level approaches.arXiv preprint arXiv:2510.04905, 2025

work page arXiv 2025
[25]

Arcs: Agentic retrieval-augmented code synthesis with iterative refinement.arXiv preprint arXiv:2504.20434, 2025

Manish Bhattarai, Miguel Cordova, Minh Vu, Javier Santos, Ismael Boureima, and Dan O’Malley. Arcs: Agentic retrieval-augmented code synthesis with iterative refinement.arXiv preprint arXiv:2504.20434, 2025

work page arXiv 2025
[26]

Enhancing code translation in language models with few- shot learning via retrieval-augmented generation

Manish Bhattarai, Javier E Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, and Daniel O’Malley. Enhancing code translation in language models with few- shot learning via retrieval-augmented generation. In2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE, 2024

2024
[27]

Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems

Zhi Ma, Cheng Wen, Bin Yu, and Jie Su. Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems. Information Fusion, 125:103466, 2026

2026
[28]

Automated ltl specification generation from industrial aerospace requirements

Zhi Ma, Xiao Liang, Cheng Wen, Rui Chen, Bin Gu, Shengchao Qin, Cong Tian, and Mengfei Yang. Automated ltl specification generation from industrial aerospace requirements. InProceedings of the 27th International Symposium on Formal Methods (FM), 2026

2026
[29]

Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms

Zhi Ma, Cheng Wen, Zhexin Su, Xiao Liang, Cong Tian, Shengchao Qin, and Mengfei Yang. Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, 2025

2025
[30]

Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

2024
[31]

Code- transocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951, 2023

Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Code- transocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951, 2023

work page arXiv 2023
[32]

On the evaluation of neural code translation: Taxonomy and benchmark

MingshengJiao,TingruiYu,XuanLi,GuanjieQiu,XiaodongGu,andBeijunShen. On the evaluation of neural code translation: Taxonomy and benchmark. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1529–1541. IEEE, 2023

2023
[33]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

work page internal anchor Pith review arXiv 2021
[34]

arXiv preprint arXiv:2105.12655 (2021)

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655, 2021

work page arXiv 2021
[35]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. Program synthesis with large lan- guage models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review arXiv 2021