arxiv: 2604.20835 · v2 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Zhaofeng Wu , Shiqi Wang , Boya Peng , Anuj Goyal , Melanie Kambadur , Sebastian Ruder , Yoon Kim , Chloe Bi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords zero-shot cross-PL transfercode RLParallel-SFTparallel programsfunctionality-centric latent spaceprogramming language generalizationsupervised fine-tuning

0 comments

The pith

Incorporating parallel programs into supervised fine-tuning creates a more transferable initialization for reinforcement learning across programming languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern language models perform well on common programming languages but struggle with less common ones due to limited training data. The paper argues that most programming skills are universal across languages, yet reinforcement learning applied in one language does not automatically improve others. To fix this, the authors introduce Parallel-SFT, which includes functionally equivalent code snippets from multiple languages in the supervised fine-tuning phase. This step produces a model with representations focused on code functionality instead of specific syntax. Subsequent reinforcement learning on this model then transfers more effectively to programming languages never seen during training.

Core claim

The central discovery is that Parallel-SFT, by incorporating parallel programs into the supervised fine-tuning data mixture, leads to a functionality-centric latent space where equivalent programs in different languages cluster more closely. When reinforcement learning is then applied to this initialized model, it achieves better zero-shot generalization to unseen programming languages compared to models initialized with standard supervised fine-tuning.

What carries the argument

Parallel-SFT, a supervised fine-tuning approach that mixes parallel programs—functionally equivalent implementations across programming languages—into the training data to promote transferable representations.

If this is right

RL after Parallel-SFT improves performance on target programming languages where standard RL does not.
The internal representations become more organized around program functionality rather than language syntax.
This initialization enables effective zero-shot cross-language transfer for code generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could extend to improving transfer in other domains where skills are shared across different formats or modalities.
Increasing the diversity of programming languages in the parallel data might further enhance the clustering effect.
The approach may help models handle low-resource languages more efficiently without requiring large amounts of monolingual data.

Load-bearing premise

The assumption that adding parallel programs to the supervised fine-tuning mixture will produce an initialization with a functionality-centric latent space that supports effective reinforcement learning transfer across languages.

What would settle it

A direct comparison showing that reinforcement learning on a Parallel-SFT model performs no better on unseen programming languages than on a standard supervised fine-tuning model would falsify the claimed benefit.

read the original abstract

Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that RL for code generation in a source PL fails to transfer (and can degrade) performance on target PLs for Llama-3.1. It proposes Parallel-SFT, which augments the SFT mixture with functionally equivalent parallel programs across PLs, as a better initialization. Subsequent RL on the Parallel-SFT model is reported to yield improved zero-shot generalization to unseen PLs, supported by latent-space analysis showing tighter clustering of functionally equivalent programs and a more functionality-centric representation.

Significance. If the results are robust, the work provides a concrete, data-mixture-based technique for improving cross-PL transfer in code models, which is valuable given severe data imbalance across languages. The latent-space analysis offers mechanistic insight into why certain initializations enable RL transfer, and the focus on zero-shot RL transfer is timely. Credit is due for framing a clear empirical task and for attempting to link representation geometry to downstream transfer.

major comments (1)

[§4] §4 (Experiments) and §5 (Analysis): the central claim that functional parallelism (rather than multi-PL data volume or diversity) drives the improved RL transfer and tighter clustering requires a controlled ablation. The manuscript must compare Parallel-SFT against a matched mixture that preserves total tokens, PL coverage, and example count but replaces parallel triples with independent, non-equivalent snippets in the same languages. Without this, the observed gains could be explained by increased multi-lingual exposure alone.

minor comments (2)

Abstract: include at least one key quantitative result (e.g., absolute or relative improvement on a held-out PL benchmark) and basic experimental details (dataset sizes, number of PLs, statistical significance) so readers can immediately assess the magnitude of the effect.
[§3] §3 (Method): specify how functional equivalence of the parallel programs is verified (e.g., test-case execution, manual review) and the provenance of the parallel data; this is load-bearing for reproducibility and for interpreting the clustering results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the need for a stronger control to isolate the role of functional parallelism. We address the major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments) and §5 (Analysis): the central claim that functional parallelism (rather than multi-PL data volume or diversity) drives the improved RL transfer and tighter clustering requires a controlled ablation. The manuscript must compare Parallel-SFT against a matched mixture that preserves total tokens, PL coverage, and example count but replaces parallel triples with independent, non-equivalent snippets in the same languages. Without this, the observed gains could be explained by increased multi-lingual exposure alone.

Authors: We agree that this controlled ablation is necessary to substantiate the claim that functional equivalence (rather than multi-lingual data volume alone) drives the gains. In the revised manuscript we will add the requested comparison: a matched mixture preserving total tokens, PL coverage, and example count, but using independent non-equivalent snippets instead of parallel triples. This will allow us to directly test whether the observed improvements in RL transfer and latent-space clustering are attributable to functional parallelism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper proposes Parallel-SFT as an SFT data mixture strategy based on the hypothesis that functionally equivalent parallel programs across PLs produce a more transferable initialization for subsequent RL. This is tested via direct experiments measuring zero-shot transfer performance to unseen PLs after RL, plus representation clustering analysis. No mathematical derivations, equations, or fitted parameters exist that reduce any prediction to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim remains externally falsifiable through replication, ablations on data diversity, and held-out language evaluation, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that programming skills are largely language-independent and that parallel programs can induce a functionality-centric representation space; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption most programming skills are universal across PLs
Explicitly stated in the abstract as the principle enabling expected transfer from source to target languages.

pith-pipeline@v0.9.0 · 5541 in / 1215 out tokens · 33086 ms · 2026-05-10T00:39:44.288293+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 57 canonical work pages · 9 internal anchors

[1]

Massively multilingual neural machine translation

Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884,...

work page doi:10.18653/v1/n19-1388 2019
[2]

Multilingual training for software engineering

Toufique Ahmed and Prem Devanbu. Multilingual training for software engineering. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pages 1443--1455, 2021. https://api.semanticscholar.org/CorpusID:244909472

2022
[3]

doi:10.1145/3510003.3510049 , year = 2022, publisher =

Toufique Ahmed and Premkumar Devanbu. Multilingual training for software engineering. In Proceedings of the 44th International Conference on Software Engineering, ICSE '22, page 1443–1455, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi:10.1145/3510003.3510049. https://doi.org/10.1145/3510003.3510049

work page doi:10.1145/3510003.3510049 2022
[4]

P olyglot: Distributed word representations for multilingual NLP

Rami Al-Rfou ' , Bryan Perozzi, and Steven Skiena. P olyglot: Distributed word representations for multilingual NLP . In Julia Hockenmaier and Sebastian Riedel, editors, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. https://acla...

2013
[5]

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. Massively multilingual word embeddings, 2016. https://arxiv.org/abs/1602.01925

work page arXiv 2016
[6]

Massively multilingual neural machine translation in the wild: Findings and challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019. https://arxiv.org/abs/1907.05019

work page arXiv 2019
[7]

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7: 0 597--610, 2019. doi:10.1162/tacl_a_00288. https://aclanthology.org/Q19-1038/

work page doi:10.1162/tacl_a_00288 2019
[8]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Translation in the wild

Yuri Balashov. Translation in the wild. Information, 16 0 (12), 2025. ISSN 2078-2489. doi:10.3390/info16121077. https://www.mdpi.com/2078-2489/16/12/1077

work page doi:10.3390/info16121077 2025
[10]

Varshney

Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, and Lav R. Varshney. Cross-lingual transfer in programming languages: An extensive empirical study. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. https://openreview.net/forum?id=1PRBHKgQVM

2025
[11]

Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Ziji...

work page arXiv 2025
[12]

Statically contextualizing large language models with typed holes

Andrew Blinn, Xiang Li, June Hyung Kim, and Cyrus Omar. Statically contextualizing large language models with typed holes. Proc. ACM Program. Lang., 8 0 (OOPSLA2), October 2024. doi:10.1145/3689728. https://doi.org/10.1145/3689728

work page doi:10.1145/3689728 2024
[13]

Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in P a LM ' s Translation Capability

Eleftheria Briakou, Colin Cherry, and George Foster. Searching for needles in a haystack: On the role of incidental bilingualism in P a LM ' s translation capability. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9432--...

work page doi:10.18653/v1/2023.acl-long.524 2023
[14]

Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49 0 (7): 0 3675–3691, July 2023....

work page doi:10.1109/tse.2023.3267446 2023
[15]

Knowledge transfer from high-resource to low-resource programming languages for code llms

Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms. Proc. ACM Program. Lang., 8 0 (OOPSLA2), October 2024. doi:10.1145/3689735. https://doi...

work page doi:10.1145/3689735 2024
[16]

McEval : Massively multilingual code evaluation

Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, JinKe, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, Noah Wang, Boyang Wang, Xianjie Wu, Bing Wang, Tongliang Li, Liqun Yang, Sufeng Duan, Zhaoxiang Zhang, and Zhoujun Li. McEval : Massively multilingual code evaluation. In The Thirteenth International Conference on Learning Representations, 2...

2025
[17]

Algorithm identification in programming assignments,

Fuxiang Chen, Fatemeh H. Fard, David Lo, and Timofey Bryksin. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ICPC '22, page 401–412, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392983. doi:10.1...

work page doi:10.1145/3524610.3527917 2022
[18]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning

Nuo Chen, Qiushi Sun, Jianing Wang, Xiang Li, and Ming Gao. Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 577--591, Singapore, December 2023. Association for Computational L...

work page doi:10.18653/v1/2023.findings-emnlp.42 2023
[20]

Monolingual or multilingual instruction tuning: Which makes a better alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 1347--1356, St. Julian ' s, Malta, March 2024. Association ...

2024
[21]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review arXiv 2022
[22]

Cross-lingual language model pretraining

Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019

2019
[23]

XNLI : Evaluating Cross-lingual Sentence Representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2...

work page doi:10.18653/v1/d18-1269 2018
[24]

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the ...

work page doi:10.18653/v1/2020.acl-main.747 2020
[25]

Emerging Cross-lingual Structure in Pretrained Language Models

Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging cross-lingual structure in pretrained language models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022--6034, Online, July 2020 b . Associatio...

work page doi:10.18653/v1/2020.acl-main.536 2020
[26]

Adaccd: adaptive semantic contrasts discovery based cross lingual adaptation for code clone detection

Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, and Shouling Ji. Adaccd: adaptive semantic contrasts discovery based cross lingual adaptation for code clone detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium o...

work page doi:10.1609/aaai.v38i16.29749 2024
[27]

Zero-shot cross-lingual classification using multilingual neural machine translation, 2018

Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto Kazawa, and Wolfgang Macherey. Zero-shot cross-lingual classification using multilingual neural machine translation, 2018. https://arxiv.org/abs/1809.04686

work page arXiv 2018
[28]

An empirical analysis of nmt-derived interlingual embeddings and their use in parallel sentence identification

Cristina Espa \ n a-Bonet, \'A d \'a m Csaba Varga, Alberto Barr \'o n-Cede \ n o, and Josef van Genabith. An empirical analysis of nmt-derived interlingual embeddings and their use in parallel sentence identification. IEEE Journal of Selected Topics in Signal Processing, 11: 0 1340--1350, 2017. https://api.semanticscholar.org/CorpusID:12291265

2017
[29]

How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

work page doi:10.18653/v1/d19-1006 2019
[30]

Improving vector space word representations using multilingual correlation

Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual correlation. In Shuly Wintner, Sharon Goldwater, and Stefan Riezler, editors, Proceedings of the 14th Conference of the E uropean Chapter of the Association for Computational Linguistics , pages 462--471, Gothenburg, Sweden, April 2014. Association for Computatio...

work page doi:10.3115/v1/e14-1049 2014
[31]

doi: 10.18653/v1/2022.acl-long.62

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891, Dublin, Ireland, May 2022. Association ...

work page doi:10.18653/v1/2022.acl-long.62 2022
[32]

CodeBERT:

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. C ode BERT : A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536--1547, Online, Novem...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[33]

Understanding LLM s' cross-lingual context retrieval: How good it is and where it comes from

Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, and Shujian Huang. Understanding LLM s' cross-lingual context retrieval: How good it is and where it comes from. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural ...

work page doi:10.18653/v1/2025.emnlp-main.1161 2025
[34]

AST -t5: Structure-aware pretraining for code generation and understanding, 2024

Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung. AST -t5: Structure-aware pretraining for code generation and understanding, 2024. https://openreview.net/forum?id=TS8PXBN6B6

2024
[35]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Tasks, challenges, and paths towards AI for software engineering

Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Tasks, challenges, and paths towards AI for software engineering. In ICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025. https://openreview.net/forum?id=5xElxBGBa3

2025
[37]

Unixcoder: Unified cross-modal pre-training for code representation,

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. U ni X coder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212--7225, Dublin, Ireland,...

work page doi:10.18653/v1/2022.acl-long.499 2022
[38]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. https://openreview.net/forum?id=s...

2021
[39]

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning, pages 4411--4421. PMLR, 2020

2020
[40]

U nicoder: A universal language encoder by pre-training with multiple cross-lingual tasks

Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. U nicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Internatio...

work page doi:10.18653/v1/d19-1252 2019
[41]

The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In Proceedings of Machine Learning Research (ICML), 2024. https://proceedings.mlr.press/v235/huh24a.html

2024
[42]

XC ode E val: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. XC ode E val: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Asso...

work page doi:10.18653/v1/2024.acl-long.367 2024
[43]

Learning bilingual word representations by marginalizing alignments

Tom \'a s Ko c isk \'y , Karl Moritz Hermann, and Phil Blunsom. Learning bilingual word representations by marginalizing alignments. In Kristina Toutanova and Hua Wu, editors, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 224--229, Baltimore, Maryland, June 2014. Association for Com...

work page doi:10.3115/v1/p14-2037 2014
[44]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review arXiv 2025
[45]

Word translation without parallel data

Guillaume Lample, Alexis Conneau, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. In International Conference on Learning Representations, 2018 a . https://openreview.net/forum?id=H196sainb

2018
[46]

Phrase-based & neural unsupervised machine translation

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc ' Aurelio Ranzato. Phrase-based & neural unsupervised machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039--5049, Brussels, Belgium, Octob...

work page doi:10.18653/v1/d18-1549 2018
[48]

and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , year=

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[49]

Mir, Mehdi Keshani, and Sebastian Proksch

Yingwei Ma, Yue Yu, Shanshan Li, Zhouyang Jia, Jun Ma, Rulin Xu, Wei Dong, and Xiangke Liao. Mulcs: Towards a unified deep representation for multilingual code search. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 120--131, 2023. doi:10.1109/SANER56733.2023.00021

work page doi:10.1109/saner56733.2023.00021 2023
[50]

Introducing llama 4: Advancing multimodal intelligence, 2025

Meta AI . Introducing llama 4: Advancing multimodal intelligence, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

2025
[51]

Exploiting Similarities among Languages for Machine Translation

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint, 2013. https://arxiv.org/abs/1309.4168

work page Pith review arXiv 2013
[52]

IRC oder: Intermediate representations make language models robust multilingual code generators

Indraneil Paul, Goran Glava s , and Iryna Gurevych. IRC oder: Intermediate representations make language models robust multilingual code generators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15023--15041, Bangkok, Thailan...

work page doi:10.18653/v1/2024.acl-long.802 2024
[53]

Codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

2025
[54]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vo...

work page doi:10.18653/v1/n18-1202 2018
[55]

Peters, Sebastian Ruder, and Noah A

Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. In Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, and Marek Rei, editors, Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4...

work page doi:10.18653/v1/w19-4302 2019
[56]

Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual BERT ? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996--5001, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1493. ...

work page doi:10.18653/v1/p19-1493 2019
[57]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

On the role of parallel data in cross-lingual transfer learning

Machel Reid and Mikel Artetxe. On the role of parallel data in cross-lingual transfer learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 5999--6006, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.372....

work page doi:10.18653/v1/2023.findings-acl.372 2023
[59]

Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review arXiv 2024
[60]

Learning joint multilingual sentence representations with neural machine translation

Holger Schwenk and Matthijs Douze. Learning joint multilingual sentence representations with neural machine translation. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih, editors, Proceedings of the 2nd Workshop on Representation Learning for NLP , p...

work page doi:10.18653/v1/w17-2619 2017
[61]

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Multilingual instruction tuning with just a pinch of multilinguality. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 2304--2317, Bangkok, Thailand, August 2024. Association for Comp...

work page doi:10.18653/v1/2024.findings-acl.136 2024
[62]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

From unaligned to aligned: Scaling multilingual LLM s with multi-way parallel corpora

Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, and Maosong Sun. From unaligned to aligned: Scaling multilingual LLM s with multi-way parallel corpora. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, ...

work page doi:10.18653/v1/2025.emnlp-main.374 2025
[64]

A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainer's guide to multilingual training data: Uncovering cross-lingual transfer dynamics. arXiv preprint arXiv:2504.16677, 2025

work page arXiv 2025
[65]

Repetition improves language model embeddings

Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=Ahlrf2HGJR

2025
[66]

Code translation with compiler representations

Marc Szafraniec, Baptiste Roziere, Hugh James Leather, Patrick Labatut, Francois Charton, and Gabriel Synnaeve. Code translation with compiler representations. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=XomEU3eNeSQ

2023
[67]

Language agnostic code embeddings

Saiteja Utpala, Alex Gu, and Pin-Yu Chen. Language agnostic code embeddings. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 678--691, Mexico City, Mexico, June 2024. Associa...

work page doi:10.18653/v1/2024.naacl-long.38 2024
[68]

Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation

Haoxiang Wang, Han Zhao, and Bo Li. Bridging multi-task learning and meta-learning: Towards efficient training and effective adaptation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10991--11002. PMLR, 2021 a

2021
[69]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. C ode T 5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696--870...

work page doi:10.18653/v1/2021.emnlp-main.685 2021
[70]

Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in E nglish? on the latent language of multilingual transformers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366--15394, Bangkok, Thaila...

work page doi:10.18653/v1/2024.acl-long.820 2024
[71]

Johnathan Xie, Annie S

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, and Qi Zhang. Reasoning or memorization? unreliable results of reinforcement learning due to data contamination, 2025 a . https://arxiv.org/abs/2507.10532

work page arXiv 2025
[72]

Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT

Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833--84...

work page doi:10.18653/v1/d19-1077 2019
[73]

Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment

Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, and Ahmad Beirami. Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1332--1353, Miami, Florida, USA, Novemb...

work page doi:10.18653/v1/2024.emnlp-main.79 2024
[74]

The semantic hub hypothesis: Language models share semantic representations across languages and modalities

Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Language models share semantic representations across languages and modalities. In The Thirteenth International Conference on Learning Representations, 2025 b . https://openreview.net/forum?id=FrFQpAgnGE

2025
[75]

CRUXEVAL - X : A benchmark for multilingual code reasoning, understanding and execution

Ruiyang Xu, Jialun Cao, Yaojie Lu, Ming Wen, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, and Le Sun. CRUXEVAL - X : A benchmark for multilingual code reasoning, understanding and execution. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/2025.acl-long.1158 2025
[76]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Scaling laws for code: Every programming language matters.arXiv preprint arXiv:2512.13472, 2025

Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, and Bryan Dai. Scaling laws for code: Every programming language matters, 2025 b . https://arxiv.org/abs/2512.13472

work page arXiv 2025