pith. machine review for the scientific record. sign in

arxiv: 2604.20835 · v2 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords zero-shot cross-PL transfercode RLParallel-SFTparallel programsfunctionality-centric latent spaceprogramming language generalizationsupervised fine-tuning
0
0 comments X

The pith

Incorporating parallel programs into supervised fine-tuning creates a more transferable initialization for reinforcement learning across programming languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern language models perform well on common programming languages but struggle with less common ones due to limited training data. The paper argues that most programming skills are universal across languages, yet reinforcement learning applied in one language does not automatically improve others. To fix this, the authors introduce Parallel-SFT, which includes functionally equivalent code snippets from multiple languages in the supervised fine-tuning phase. This step produces a model with representations focused on code functionality instead of specific syntax. Subsequent reinforcement learning on this model then transfers more effectively to programming languages never seen during training.

Core claim

The central discovery is that Parallel-SFT, by incorporating parallel programs into the supervised fine-tuning data mixture, leads to a functionality-centric latent space where equivalent programs in different languages cluster more closely. When reinforcement learning is then applied to this initialized model, it achieves better zero-shot generalization to unseen programming languages compared to models initialized with standard supervised fine-tuning.

What carries the argument

Parallel-SFT, a supervised fine-tuning approach that mixes parallel programs—functionally equivalent implementations across programming languages—into the training data to promote transferable representations.

If this is right

  • RL after Parallel-SFT improves performance on target programming languages where standard RL does not.
  • The internal representations become more organized around program functionality rather than language syntax.
  • This initialization enables effective zero-shot cross-language transfer for code generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could extend to improving transfer in other domains where skills are shared across different formats or modalities.
  • Increasing the diversity of programming languages in the parallel data might further enhance the clustering effect.
  • The approach may help models handle low-resource languages more efficiently without requiring large amounts of monolingual data.

Load-bearing premise

The assumption that adding parallel programs to the supervised fine-tuning mixture will produce an initialization with a functionality-centric latent space that supports effective reinforcement learning transfer across languages.

What would settle it

A direct comparison showing that reinforcement learning on a Parallel-SFT model performs no better on unseen programming languages than on a standard supervised fine-tuning model would falsify the claimed benefit.

read the original abstract

Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that RL for code generation in a source PL fails to transfer (and can degrade) performance on target PLs for Llama-3.1. It proposes Parallel-SFT, which augments the SFT mixture with functionally equivalent parallel programs across PLs, as a better initialization. Subsequent RL on the Parallel-SFT model is reported to yield improved zero-shot generalization to unseen PLs, supported by latent-space analysis showing tighter clustering of functionally equivalent programs and a more functionality-centric representation.

Significance. If the results are robust, the work provides a concrete, data-mixture-based technique for improving cross-PL transfer in code models, which is valuable given severe data imbalance across languages. The latent-space analysis offers mechanistic insight into why certain initializations enable RL transfer, and the focus on zero-shot RL transfer is timely. Credit is due for framing a clear empirical task and for attempting to link representation geometry to downstream transfer.

major comments (1)
  1. [§4] §4 (Experiments) and §5 (Analysis): the central claim that functional parallelism (rather than multi-PL data volume or diversity) drives the improved RL transfer and tighter clustering requires a controlled ablation. The manuscript must compare Parallel-SFT against a matched mixture that preserves total tokens, PL coverage, and example count but replaces parallel triples with independent, non-equivalent snippets in the same languages. Without this, the observed gains could be explained by increased multi-lingual exposure alone.
minor comments (2)
  1. Abstract: include at least one key quantitative result (e.g., absolute or relative improvement on a held-out PL benchmark) and basic experimental details (dataset sizes, number of PLs, statistical significance) so readers can immediately assess the magnitude of the effect.
  2. [§3] §3 (Method): specify how functional equivalence of the parallel programs is verified (e.g., test-case execution, manual review) and the provenance of the parallel data; this is load-bearing for reproducibility and for interpreting the clustering results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the need for a stronger control to isolate the role of functional parallelism. We address the major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and §5 (Analysis): the central claim that functional parallelism (rather than multi-PL data volume or diversity) drives the improved RL transfer and tighter clustering requires a controlled ablation. The manuscript must compare Parallel-SFT against a matched mixture that preserves total tokens, PL coverage, and example count but replaces parallel triples with independent, non-equivalent snippets in the same languages. Without this, the observed gains could be explained by increased multi-lingual exposure alone.

    Authors: We agree that this controlled ablation is necessary to substantiate the claim that functional equivalence (rather than multi-lingual data volume alone) drives the gains. In the revised manuscript we will add the requested comparison: a matched mixture preserving total tokens, PL coverage, and example count, but using independent non-equivalent snippets instead of parallel triples. This will allow us to directly test whether the observed improvements in RL transfer and latent-space clustering are attributable to functional parallelism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper proposes Parallel-SFT as an SFT data mixture strategy based on the hypothesis that functionally equivalent parallel programs across PLs produce a more transferable initialization for subsequent RL. This is tested via direct experiments measuring zero-shot transfer performance to unseen PLs after RL, plus representation clustering analysis. No mathematical derivations, equations, or fitted parameters exist that reduce any prediction to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim remains externally falsifiable through replication, ablations on data diversity, and held-out language evaluation, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that programming skills are largely language-independent and that parallel programs can induce a functionality-centric representation space; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption most programming skills are universal across PLs
    Explicitly stated in the abstract as the principle enabling expected transfer from source to target languages.

pith-pipeline@v0.9.0 · 5541 in / 1215 out tokens · 33086 ms · 2026-05-10T00:39:44.288293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 57 canonical work pages · 9 internal anchors

  1. [1]

    Massively multilingual neural machine translation

    Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884,...

  2. [2]

    Multilingual training for software engineering

    Toufique Ahmed and Prem Devanbu. Multilingual training for software engineering. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pages 1443--1455, 2021. https://api.semanticscholar.org/CorpusID:244909472

  3. [3]

    doi:10.1145/3510003.3510049 , year = 2022, publisher =

    Toufique Ahmed and Premkumar Devanbu. Multilingual training for software engineering. In Proceedings of the 44th International Conference on Software Engineering, ICSE '22, page 1443–1455, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi:10.1145/3510003.3510049. https://doi.org/10.1145/3510003.3510049

  4. [4]

    P olyglot: Distributed word representations for multilingual NLP

    Rami Al-Rfou ' , Bryan Perozzi, and Steven Skiena. P olyglot: Distributed word representations for multilingual NLP . In Julia Hockenmaier and Sebastian Riedel, editors, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. https://acla...

  5. [5]

    Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. Massively multilingual word embeddings, 2016. https://arxiv.org/abs/1602.01925

  6. [6]

    Massively multilingual neural machine translation in the wild: Findings and challenges

    Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019. https://arxiv.org/abs/1907.05019

  7. [7]

    Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

    Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7: 0 597--610, 2019. doi:10.1162/tacl_a_00288. https://aclanthology.org/Q19-1038/

  8. [8]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. https://arxiv.org/abs/2108.07732

  9. [9]

    Translation in the wild

    Yuri Balashov. Translation in the wild. Information, 16 0 (12), 2025. ISSN 2078-2489. doi:10.3390/info16121077. https://www.mdpi.com/2078-2489/16/12/1077

  10. [10]

    Varshney

    Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, and Lav R. Varshney. Cross-lingual transfer in programming languages: An extensive empirical study. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. https://openreview.net/forum?id=1PRBHKgQVM

  11. [11]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Ziji...

  12. [12]

    Statically contextualizing large language models with typed holes

    Andrew Blinn, Xiang Li, June Hyung Kim, and Cyrus Omar. Statically contextualizing large language models with typed holes. Proc. ACM Program. Lang., 8 0 (OOPSLA2), October 2024. doi:10.1145/3689728. https://doi.org/10.1145/3689728

  13. [13]

    Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in P a LM ' s Translation Capability

    Eleftheria Briakou, Colin Cherry, and George Foster. Searching for needles in a haystack: On the role of incidental bilingualism in P a LM ' s translation capability. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9432--...

  14. [14]

    Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49 0 (7): 0 3675–3691, July 2023....

  15. [15]

    Knowledge transfer from high-resource to low-resource programming languages for code llms

    Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms. Proc. ACM Program. Lang., 8 0 (OOPSLA2), October 2024. doi:10.1145/3689735. https://doi...

  16. [16]

    McEval : Massively multilingual code evaluation

    Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, JinKe, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, Noah Wang, Boyang Wang, Xianjie Wu, Bing Wang, Tongliang Li, Liqun Yang, Sufeng Duan, Zhaoxiang Zhang, and Zhoujun Li. McEval : Massively multilingual code evaluation. In The Thirteenth International Conference on Learning Representations, 2...

  17. [17]

    Algorithm identification in programming assignments,

    Fuxiang Chen, Fatemeh H. Fard, David Lo, and Timofey Bryksin. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ICPC '22, page 401–412, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392983. doi:10.1...

  18. [18]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  19. [19]

    Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning

    Nuo Chen, Qiushi Sun, Jianing Wang, Xiang Li, and Ming Gao. Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 577--591, Singapore, December 2023. Association for Computational L...

  20. [20]

    Monolingual or multilingual instruction tuning: Which makes a better alpaca

    Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 1347--1356, St. Julian ' s, Malta, March 2024. Association ...

  21. [21]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  22. [22]

    Cross-lingual language model pretraining

    Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019

  23. [23]

    XNLI : Evaluating Cross-lingual Sentence Representations

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2...

  24. [24]

    Unsupervised Cross-lingual Representation Learning at Scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the ...

  25. [25]

    Emerging Cross-lingual Structure in Pretrained Language Models

    Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging cross-lingual structure in pretrained language models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022--6034, Online, July 2020 b . Associatio...

  26. [26]

    Adaccd: adaptive semantic contrasts discovery based cross lingual adaptation for code clone detection

    Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, and Shouling Ji. Adaccd: adaptive semantic contrasts discovery based cross lingual adaptation for code clone detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium o...

  27. [27]

    Zero-shot cross-lingual classification using multilingual neural machine translation, 2018

    Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto Kazawa, and Wolfgang Macherey. Zero-shot cross-lingual classification using multilingual neural machine translation, 2018. https://arxiv.org/abs/1809.04686

  28. [28]

    An empirical analysis of nmt-derived interlingual embeddings and their use in parallel sentence identification

    Cristina Espa \ n a-Bonet, \'A d \'a m Csaba Varga, Alberto Barr \'o n-Cede \ n o, and Josef van Genabith. An empirical analysis of nmt-derived interlingual embeddings and their use in parallel sentence identification. IEEE Journal of Selected Topics in Signal Processing, 11: 0 1340--1350, 2017. https://api.semanticscholar.org/CorpusID:12291265

  29. [29]

    How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

  30. [30]

    Improving vector space word representations using multilingual correlation

    Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual correlation. In Shuly Wintner, Sharon Goldwater, and Stefan Riezler, editors, Proceedings of the 14th Conference of the E uropean Chapter of the Association for Computational Linguistics , pages 462--471, Gothenburg, Sweden, April 2014. Association for Computatio...

  31. [31]

    doi: 10.18653/v1/2022.acl-long.62

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891, Dublin, Ireland, May 2022. Association ...

  32. [32]

    CodeBERT:

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. C ode BERT : A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536--1547, Online, Novem...

  33. [33]

    Understanding LLM s' cross-lingual context retrieval: How good it is and where it comes from

    Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, and Shujian Huang. Understanding LLM s' cross-lingual context retrieval: How good it is and where it comes from. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural ...

  34. [34]

    AST -t5: Structure-aware pretraining for code generation and understanding, 2024

    Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung. AST -t5: Structure-aware pretraining for code generation and understanding, 2024. https://openreview.net/forum?id=TS8PXBN6B6

  35. [35]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  36. [36]

    Tasks, challenges, and paths towards AI for software engineering

    Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Tasks, challenges, and paths towards AI for software engineering. In ICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025. https://openreview.net/forum?id=5xElxBGBa3

  37. [37]

    Unixcoder: Unified cross-modal pre-training for code representation,

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. U ni X coder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212--7225, Dublin, Ireland,...

  38. [38]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. https://openreview.net/forum?id=s...

  39. [39]

    Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

    Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning, pages 4411--4421. PMLR, 2020

  40. [40]

    U nicoder: A universal language encoder by pre-training with multiple cross-lingual tasks

    Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. U nicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Internatio...

  41. [41]

    The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In Proceedings of Machine Learning Research (ICML), 2024. https://proceedings.mlr.press/v235/huh24a.html

  42. [42]

    XC ode E val: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

    Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. XC ode E val: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Asso...

  43. [43]

    Learning bilingual word representations by marginalizing alignments

    Tom \'a s Ko c isk \'y , Karl Moritz Hermann, and Phil Blunsom. Learning bilingual word representations by marginalizing alignments. In Kristina Toutanova and Hua Wu, editors, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 224--229, Baltimore, Maryland, June 2014. Association for Com...

  44. [44]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  45. [45]

    Word translation without parallel data

    Guillaume Lample, Alexis Conneau, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. In International Conference on Learning Representations, 2018 a . https://openreview.net/forum?id=H196sainb

  46. [46]

    Phrase-based & neural unsupervised machine translation

    Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc ' Aurelio Ranzato. Phrase-based & neural unsupervised machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039--5049, Brussels, Belgium, Octob...

  47. [48]

    and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , year=

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  48. [49]

    Mir, Mehdi Keshani, and Sebastian Proksch

    Yingwei Ma, Yue Yu, Shanshan Li, Zhouyang Jia, Jun Ma, Rulin Xu, Wei Dong, and Xiangke Liao. Mulcs: Towards a unified deep representation for multilingual code search. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 120--131, 2023. doi:10.1109/SANER56733.2023.00021

  49. [50]

    Introducing llama 4: Advancing multimodal intelligence, 2025

    Meta AI . Introducing llama 4: Advancing multimodal intelligence, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  50. [51]

    Exploiting Similarities among Languages for Machine Translation

    Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint, 2013. https://arxiv.org/abs/1309.4168

  51. [52]

    IRC oder: Intermediate representations make language models robust multilingual code generators

    Indraneil Paul, Goran Glava s , and Iryna Gurevych. IRC oder: Intermediate representations make language models robust multilingual code generators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15023--15041, Bangkok, Thailan...

  52. [53]

    Codeforces

    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

  53. [54]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vo...

  54. [55]

    Peters, Sebastian Ruder, and Noah A

    Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. In Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, and Marek Rei, editors, Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4...

  55. [56]

    Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual BERT ? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996--5001, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1493. ...

  56. [57]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  57. [58]

    On the role of parallel data in cross-lingual transfer learning

    Machel Reid and Mikel Artetxe. On the role of parallel data in cross-lingual transfer learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 5999--6006, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.372....

  58. [59]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  59. [60]

    Learning joint multilingual sentence representations with neural machine translation

    Holger Schwenk and Matthijs Douze. Learning joint multilingual sentence representations with neural machine translation. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih, editors, Proceedings of the 2nd Workshop on Representation Learning for NLP , p...

  60. [61]

    Multilingual Instruction Tuning With Just a Pinch of Multilinguality

    Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Multilingual instruction tuning with just a pinch of multilinguality. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 2304--2317, Bangkok, Thailand, August 2024. Association for Comp...

  61. [62]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

  62. [63]

    From unaligned to aligned: Scaling multilingual LLM s with multi-way parallel corpora

    Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, and Maosong Sun. From unaligned to aligned: Scaling multilingual LLM s with multi-way parallel corpora. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, ...

  63. [64]

    A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

    Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainer's guide to multilingual training data: Uncovering cross-lingual transfer dynamics. arXiv preprint arXiv:2504.16677, 2025

  64. [65]

    Repetition improves language model embeddings

    Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=Ahlrf2HGJR

  65. [66]

    Code translation with compiler representations

    Marc Szafraniec, Baptiste Roziere, Hugh James Leather, Patrick Labatut, Francois Charton, and Gabriel Synnaeve. Code translation with compiler representations. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=XomEU3eNeSQ

  66. [67]

    Language agnostic code embeddings

    Saiteja Utpala, Alex Gu, and Pin-Yu Chen. Language agnostic code embeddings. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 678--691, Mexico City, Mexico, June 2024. Associa...

  67. [68]

    Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation

    Haoxiang Wang, Han Zhao, and Bo Li. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10991--11002. PMLR, 2021 a

  68. [69]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. C ode T 5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696--870...

  69. [70]

    Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

    Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in E nglish? on the latent language of multilingual transformers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366--15394, Bangkok, Thaila...

  70. [71]

    Johnathan Xie, Annie S

    Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, and Qi Zhang. Reasoning or memorization? unreliable results of reinforcement learning due to data contamination, 2025 a . https://arxiv.org/abs/2507.10532

  71. [72]

    Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT

    Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833--84...

  72. [73]

    Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment

    Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, and Ahmad Beirami. Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1332--1353, Miami, Florida, USA, Novemb...

  73. [74]

    The semantic hub hypothesis: Language models share semantic representations across languages and modalities

    Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Language models share semantic representations across languages and modalities. In The Thirteenth International Conference on Learning Representations, 2025 b . https://openreview.net/forum?id=FrFQpAgnGE

  74. [75]

    CRUXEVAL - X : A benchmark for multilingual code reasoning, understanding and execution

    Ruiyang Xu, Jialun Cao, Yaojie Lu, Ming Wen, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, and Le Sun. CRUXEVAL - X : A benchmark for multilingual code reasoning, understanding and execution. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computati...

  75. [76]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  76. [77]

    Scaling laws for code: Every programming language matters.arXiv preprint arXiv:2512.13472, 2025

    Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, and Bryan Dai. Scaling laws for code: Every programming language matters, 2025 b . https://arxiv.org/abs/2512.13472