CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

Ali Basirat; Desmond Elliott; Mike Zhang

arxiv: 2605.26293 · v1 · pith:AECYV3MKnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

Mike Zhang , Ali Basirat , Desmond Elliott This is my paper

Pith reviewed 2026-06-29 21:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords cross-lingual preference tuningself-generationscontrastive tuningmultilingual LLMsreward modelspreference optimizationalignment

0 comments

The pith

A reward model trained only on English preferences produces useful rankings for self-generated responses across most of 14 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends contrastive preference tuning to multiple languages by scoring pairs of self-generated responses with a reward model trained solely on English data. This CroCo method improves model performance on structured and open-ended tasks in both monolingual and multilingual setups. It avoids catastrophic forgetting of prior supervised fine-tuning. The gains depend on using on-policy data from the model being tuned.

Core claim

CroCo transfers without language-specific preference annotation. A reward model trained on English preferences atop a multilingual base produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning.

What carries the argument

Cross-lingual contrastive preference tuning on self-generations (CroCo), which uses an English reward model to assign scores that create contrastive signals between on-policy response pairs.

Load-bearing premise

A reward model trained on English preferences produces useful within-language rankings across most languages.

What would settle it

Measuring low agreement between the English reward model's scores and independent human preference judgments on self-generated responses in a held-out language would show the rankings are not useful.

Figures

Figures reproduced from arXiv: 2605.26293 by Ali Basirat, Desmond Elliott, Mike Zhang.

**Figure 1.** Figure 1: Setup. An LLM generates 64 responses per prompt per language; an external off-the-shelf RM scores these and we sample specific quartiles to construct contrastive preference pairs. pairs of chosen and rejected completions. Similarly, recent work has shifted attention from the optimizer to the data: Pan et al. (2025) show that chosen-response quality dominates downstream performance, Geng et al. (2025) est… view at source ↗

**Figure 3.** Figure 3: Subword-token length distribution across languages. We cap the 90th percentile at 1,616 tokens. Romance languages (French, Italian, Spanish) produce systematically longer translations than Germanic ones. 3.2 Training We fine-tune with LoRA (Hu et al., 2022) for all setups in TRL (von Werra et al., 2020).5 For SFT, we train for 1 epoch with sequence length 4,096, global batch size 64, and learning rate 2 ×… view at source ↗

**Figure 4.** Figure 4: m-ArenaHard 2.1 results. Top row: Length-controlled win rates. Multilingual Paired DPO (blue) wins against the respective model in all 7 languages; against the larger Gemma3-it comparison model (red), DPO narrows the deficit visible in the base-vs-Gemma comparison (green) in 4/7 languages for EuroLLM-9B and all seven for aya-3B. Bottom row: LC win rate of multilingual Paired DPO against the base, broken do… view at source ↗

**Figure 5.** Figure 5: m-ArenaHard 2.1 Results on Low-resource Languages with EuroLLM-9B. Length-controlled win rates of Paired DPO (blue) wins against the respective model in all four low-resource languages (left) compared to Max-R and In-lang; against the larger Gemma comparison model (right), DPO narrows the deficit for Galician and Maltese, where the other methods fails to do so. The dashed line marks parity (50%). DPO impro… view at source ↗

**Figure 6.** Figure 6: Representative samples from the EuroLLM-9B reward distribution on a benign English prompt about a drain plug. The µ − 2σ response confabulates an unrelated autonomous-vehicle context; the max-reward response is on-task and coherent. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Representative samples from the EuroLLM-9B reward distribution on a safety-relevant English prompt. All four responses refuse, but the lower-reward refusals are terser; the max-reward refusal explicitly redirects to ethical alternatives. This illustrates that the RM ranks within-category quality even when all responses are categorically appropriate. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Representative samples from the aya-3B reward distribution on an English prompt about social-media virality. The µ − 2σ response is largely incoherent; the max-reward response is structured advice on the requested topic. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Reward score distributions per language for EuroLLM-9B samples. Empirical KDEs are overlaid with Gaussian fits; the dashed vertical line marks the grand mean. Per-language means differ by at most about 0.9 points (a small fraction of the within-language spread of σ ≈ 6), supporting the use of a single English-trained RM for within-language ranking. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Reward score distributions per language for aya-3B samples. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of chosen and rejected response languages in the multilingual Paired DPO dataset. Each language appears as both chosen and rejected at comparable rates, indicating that the sweetspot construction does not select English as the chosen response by default. This configuration is held fixed across all SFT and DPO runs for both models so that any performance differences across data-construction … view at source ↗

**Figure 12.** Figure 12: Offline vs. online DPO on Danish evaluation tasks (EuroLLM-9B). Average improvement over the baseline across 7 tasks; shaded regions denote standard deviation. Offline DPO converges to a higher plateau; online DPO is unstable and never exceeds +0.2 on average. throughout training, with substantially higher variance. Online DPO underperforms offline DPO when the RM is external to the policy because onli… view at source ↗

**Figure 13.** Figure 13: m-ArenaHard 2.1 by subcategory: EuroLLM-9B base vs. Gemma3-12B-Instruct. LC win rate broken down by prompt type. The EuroLLM-9B base loses across all categories and languages, with the largest deficits on math. English Italian Spanish French German Dutch Danish 25 50 75 LC win rate (%) DPO vs Gemma3-12b coding creative writing math [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: m-ArenaHard 2.1 by subcategory: EuroLLM-9B Paired DPO vs. Gemma3-12B-Instruct. After DPO, win rates rise across most languagesubcategory cells relative to [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CroCo shows English preference data plus self-generations can drive useful cross-lingual tuning on most of 14 languages, but only with on-policy samples.

read the letter

The main takeaway is that a reward model trained on English preferences can produce usable within-language rankings for contrastive tuning when applied to self-generated responses from multilingual base models. This works across most of the 14 languages they test and improves over the untuned base on open-ended generation in all 11 cases checked.

The paper extends an existing English contrastive self-generation approach to the cross-lingual case. It reports results for two models on both structured tasks and open-ended generation, notes that the gains disappear with off-policy data, and shows the multilingual variant helps avoid forgetting from prior supervised fine-tuning.

The empirical coverage is reasonable for the claim. They test high- and low-resource languages and compare monolingual versus multilingual pairing.

The soft spots are the usual ones for an abstract-heavy report: no mention of statistical tests, exact win-rate margins, or how the English reward model’s rankings were validated inside each target language. The “matches or exceeds” language on structured tasks leaves open whether the effect is small or consistent. Without those details it is hard to judge robustness.

This is for people building multilingual alignment pipelines who want to skip per-language preference collection. A reader who needs a practical starting point for scaling preference tuning would get value from the on-policy observation and the language coverage.

It deserves a serious referee. The central empirical claim is straightforward to check and the setup is reproducible enough to be worth the time.

Referee Report

3 major / 2 minor

Summary. The paper proposes CroCo, extending English contrastive preference tuning on self-generated responses (using reward scores) to 14 high- and low-resource languages. The central claim is that an English-preference reward model atop a multilingual base produces useful within-language rankings, enabling cross-lingual transfer without language-specific preference data. Experiments on EuroLLM-9B and Aya-3B show that monolingual or multilingual pairing improves over the base on the majority of setups (matching/exceeding base in 6/7 languages for structured tasks on EuroLLM-9B and 4/7 for Aya-3B; wins on open-ended generation across 11 languages), prevents catastrophic forgetting of SFT, and requires on-policy data (off-policy reduces benefit; online optimization does not improve over offline).

Significance. If the empirical results hold under scrutiny of methods and statistics, the work would be significant for multilingual LLM alignment: it demonstrates that English preference annotations can transfer via self-generations and contrastive tuning, lowering annotation costs for low-resource languages while preserving SFT performance. The on-policy requirement is a useful negative result. The multi-model, multi-language, multi-task evaluation strengthens the claim relative to single-language English-only studies.

major comments (3)

[Abstract / Results] Abstract and results sections: the reported win rates and 'useful within-language rankings' are presented without accompanying statistical tests, confidence intervals, or details on how rankings were validated against human or held-out preferences; this makes it impossible to assess whether the cross-lingual transfer claim is supported beyond the specific setups shown.
[Methods] Methods: the on-policy requirement is stated as necessary, yet the manuscript provides no explicit verification (e.g., comparison of data generation policies or ablation on policy mismatch) that would confirm this is not an artifact of the particular self-generation procedure.
[Results] The central claim that the English RM 'produces useful within-language rankings across most languages' rests on downstream task performance; an explicit correlation or ranking-quality metric (e.g., agreement with language-specific preferences) is needed to rule out that gains arise from other factors such as continued pretraining effects.

minor comments (2)

[Abstract] The abstract is information-dense; separating results by model (EuroLLM-9B vs. Aya-3B) and task type (structured vs. open-ended) into a small table would improve readability.
[Methods] Clarify the exact definition of 'on-policy' versus 'off-policy' responses in the experimental setup, including how self-generations were sampled.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical rigor, verification of the on-policy claim, and the need for more direct evidence on ranking quality. We address each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results sections: the reported win rates and 'useful within-language rankings' are presented without accompanying statistical tests, confidence intervals, or details on how rankings were validated against human or held-out preferences; this makes it impossible to assess whether the cross-lingual transfer claim is supported beyond the specific setups shown.

Authors: We agree that the absence of statistical tests and confidence intervals limits interpretability. In the revised manuscript we will add bootstrap confidence intervals for all win rates reported in the abstract and results. On validation, the manuscript infers ranking utility from downstream task gains rather than direct human or held-out preference agreement per language; we will add an explicit statement in the methods clarifying this inference and a limitations paragraph noting the lack of language-specific human validation. revision: yes
Referee: [Methods] Methods: the on-policy requirement is stated as necessary, yet the manuscript provides no explicit verification (e.g., comparison of data generation policies or ablation on policy mismatch) that would confirm this is not an artifact of the particular self-generation procedure.

Authors: The current manuscript already reports that off-policy responses reduce benefits relative to on-policy self-generations. To address the request for more explicit verification we will expand the methods section with additional details on the generation procedure and include a dedicated ablation table contrasting on-policy versus off-policy data. revision: partial
Referee: [Results] The central claim that the English RM 'produces useful within-language rankings across most languages' rests on downstream task performance; an explicit correlation or ranking-quality metric (e.g., agreement with language-specific preferences) is needed to rule out that gains arise from other factors such as continued pretraining effects.

Authors: We maintain that downstream task performance is the appropriate and direct metric for the paper's claim about practical utility of the transferred rankings. The on-policy specificity already helps isolate the effect from generic continued pretraining, as off-policy data does not produce comparable gains. We will add a short discussion paragraph in the results explaining this rationale but will not introduce new ranking-quality metrics, as they would require language-specific preference data that the method is designed to avoid. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results from applying contrastive preference tuning on self-generated responses across 14 languages, using an English-trained reward model on a multilingual base. No equations, derivations, or first-principles claims appear in the provided text. All findings are framed as observations on held-out tasks, with explicit notes that gains require on-policy data and that off-policy or online variants underperform. No self-citations are invoked as load-bearing uniqueness theorems, no parameters are fitted then relabeled as predictions, and no ansatzes or renamings reduce the central claim to its inputs by construction. The work is self-contained as an empirical extension.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no free parameters, invented entities, or additional axioms are described beyond the core transfer assumption.

axioms (1)

domain assumption English reward model produces useful within-language rankings across most languages
This premise is required for the cross-lingual transfer claim to hold without language-specific preference data.

pith-pipeline@v0.9.1-grok · 5736 in / 1120 out tokens · 29468 ms · 2026-06-29T21:28:02.135074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 42 canonical work pages · 11 internal anchors

[1]

Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, and Lukas Galke Poech. 2026. https://doi.org/10.63317/4kcbotaa3zgo Dala: Danish linguistic acceptability evaluation guided by real world errors . In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 4312--4326, Palma, Mallorca, Spain. European La...

work page doi:10.63317/4kcbotaa3zgo 2026
[2]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. https:...

work page doi:10.18653/v1/2025.acl-short.20 2025
[3]

John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet \"U st \"u n, and Sara Hooker. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.729 RLHF can speak many languages: Unlocking multilingual preference optimization for LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13134--13156, Miami...

work page doi:10.18653/v1/2024.emnlp-main.729 2024
[4]

Wietse de Vries, Martijn Wieling, and Malvina Nissim. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.447 DUMB : A benchmark for smart evaluation of D utch models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7221--7241, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.447 2023
[5]

Lysandre Debut, Arthur Zucker, Zachary Mueller, Yih-Dar Shieh, Benjamin Bossan, and Pedro Cuenca. 2024. Fixing gradient accumulation. Hugging Face Blog, https://huggingface.co/blog/gradient_accumulation

2024
[6]

Martin d ' Hoffschmidt, Wacim Belblidia, Quentin Heinrich, Tom Brendl \'e , and Maxime Vidal. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.107 FQ u AD : F rench question answering dataset . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1193--1208, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.107 2020
[7]

DSL . 2024. https://sprogteknologi.dk/dataset/1000-talemader-evalueringsdatasaet Evalueringsdatasæt for 1000 danske talemåder og faste udtryk . Accessed: 2026-03-13

2024
[8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2025. https://arxiv.org/abs/2404.04475 Length-controlled alpacaeval: A simple way to debias automatic evaluators . Preprint, arXiv:2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing

Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing. 2025. https://arxiv.org/abs/2507.17476 Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms . Preprint, arXiv:2507.17476

work page arXiv 2025
[10]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. https://arxiv.org/abs/2601.09012 Translategemma t...

work page arXiv 2026
[11]

Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. 2025. https://openreview.net/forum?id=9rwtezthwo The delta learning hypothesis: Preference tuning on weak data can yield strong gains . In Second Conference on Language Modeling

2025
[12]

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. 2024. https://arxiv.org/abs/2402.04792 Direct language model alignment from online ai feedback . Preprint, arXiv:2402.04792

work page arXiv 2024
[13]

Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. 2025. https://doi.org/10.18653/v1/2025.acl-long.3 M - R eward B ench: Evaluating reward models in multilingual settings . In Proceedings of the 63rd Annual Meeting of t...

work page doi:10.18653/v1/2025.acl-long.3 2025
[14]

Daniel Han and Michael Han. 2024. Bugs in LLM training -- gradient accumulation fix. Unsloth Blog, https://unsloth.ai/blog/gradient

2024
[15]

Jiwoo Hong, Noah Lee, Rodrigo Mart \'i nez-Casta \ n o, C \'e sar Rodr \'i guez, and James Thorne. 2025. https://doi.org/10.18653/v1/2025.naacl-short.8 Cross-lingual transfer of reward models in multilingual alignment . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

work page doi:10.18653/v1/2025.naacl-short.8 2025
[16]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

2022
[17]

Dieuwke Hupkes and Nikolay Bogoychev. 2025. https://arxiv.org/abs/2504.10356 Multiloko: a multilingual local knowledge benchmark for llms spanning 31 languages . Preprint, arXiv:2504.10356

work page arXiv 2025
[18]

Oliver Kinch. 2024. https://huggingface.co/datasets/oliverkinch/life-in-the-uk-multiple-choice oliverkinch/life-in-the-uk-multiple-choice . Accessed: 2026-03-13

2024
[19]

Wouter Kool, Herke van Hoof, and Max Welling. 2019. https://openreview.net/forum?id=r1lgTGL5DE Buy 4 REINFORCE samples, get a baseline for free!

2019
[20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 611–626, New York, ...

work page doi:10.1145/3600006.3613165 2023
[21]

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. https://arxiv.org/abs/1910.09700 Quantifying the carbon emissions of machine learning . ArXiv preprint, abs/1910.09700

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

Mirko Lai, Stefano Menini, Marco Polignano, Valentina Russo, Rachele Sprugnoli, and Giulia Venturi, editors. 2023 a . https://ceur-ws.org/Vol-3473/ Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023) , volume 3473 of CEUR Workshop Proceedings. CEUR-WS.org, Parma, Italy

2023
[23]

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-demo.28 Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2023.emnlp-demo.28 2023
[24]

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. https://doi.org/10.18653/v1/2020.acl-main.653 MLQA : Evaluating cross-lingual extractive question answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315--7330, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.653 2020
[25]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2025. https://proceedings.mlr.press/v267/li25h.html From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of ...

2025
[26]

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, and Valentina Pyatkin. 2026. https://arxiv.org/abs/2604.23747 Sft-then-rl outperforms mixed-policy methods for llm reasoning . Preprint, arXiv:2604.23747

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. 2024. https://arxiv.org/abs/2410.18451 Skywork-reward: Bag of tricks for reward modeling in llms . Preprint, arXiv:2410.18451

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. 2025. https://arxiv.org/abs/2507.01352 Skywork-reward-v2: Scaling preference data curation via human-ai synergy . Preprint, arXiv:2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. https://arxiv.org/abs/2605.09548 Crosslingual on-policy self-distillation for multilingual reasoning . Preprint, arXiv:2605.09548

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. https://doi.org/10.1109/TASLPRO.2025.3606231 An empirical study of catastrophic forgetting in large language models during continual fine-tuning . IEEE Transactions on Audio, Speech and Language Processing, 33:3776--3786

work page doi:10.1109/taslpro.2025.3606231 2025
[31]

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. 2026. https://arxiv.org/abs/2506.01937 Rewardbench 2: Advancing reward model evaluation . Preprint, arXiv:2506.01937

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Timo M \"o ller, Julian Risch, and Malte Pietsch. 2021. https://doi.org/10.18653/v1/2021.mrqa-1.4 G erman Q u AD and G erman DPR : Improving non- E nglish question answering and passage retrieval . In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 42--50, Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.mrqa-1.4 2021
[33]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, and 49 others. 2025. https://arxiv.org/abs/2512.13961 Olmo 3 . Preprint, a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Yu Pan, Zhongze Cai, Huaiyang Zhong, Guanting Chen, and Chonghuan Wang. 2025. https://proceedings.neurips.cc/paper_files/paper/2025/file/3f37b8fbd43303106dd141a602838ad5-Paper-Conference.pdf What matters in data for dpo? In Advances in Neural Information Processing Systems, volume 38, pages 44689--44716. Curran Associates, Inc

2025
[35]

Bolette Pedersen, Nathalie S rensen, Sussi Olsen, Sanni Nimb, and Simon Gray. 2024. https://aclanthology.org/2024.lrec-main.1421/ Towards a D anish semantic reasoning benchmark - compiled from lexical-semantic resources for assessing selected language understanding capabilities of large language models . In Proceedings of the 2024 Joint International Conf...

2024
[36]

Rhitabrat Pokharel, Yufei Tao, and Ameeta Agrawal. 2025. https://doi.org/10.18653/v1/2025.findings-ijcnlp.69 CAPO : Confidence aware preference optimization learning for multilingual preferences . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association f...

work page doi:10.18653/v1/2025.findings-ijcnlp.69 2025
[37]

Qwen Team . 2026. https://qwen.ai/blog?id=qwen3.6-35b-a3b Qwen3.6-35B-A3B : Agentic coding power, now open to all . Accessed: 2026-05-13

2026
[38]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Systems, ...

2023
[39]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016
[40]

Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M

Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, and André F. T. Martins. 2026. https://arxiv.org/abs/2...

work page arXiv 2026
[41]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, page 3505–3506, New York, NY, U...

work page doi:10.1145/3394486.3406703 2020
[42]

Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others. 2025. https://openreview.net/f...

2025
[43]

Dan Saattrup Nielsen, Kenneth Enevoldsen, and Peter Schneider-Kamp. 2025. https://aclanthology.org/2025.nodalida-1.60/ Encoder vs decoder: Comparative analysis of encoder and decoder language models on multilingual NLU tasks . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Techn...

2025
[44]

Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, and 7 others. 2026. https://arxiv.org/abs/2603.11510 Tiny aya: Bridging s...

work page arXiv 2026
[45]

Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, and Jiajun Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.539 MAPO : Advancing multilingual reasoning through multilingual-alignment-as-preference optimization . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2024.acl-long.539 2024
[46]

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. https://arxiv.org/abs/2601.19897 Self-distillation enables continual learning . Preprint, arXiv:2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. 2025. https://doi.org/10.1145/3735633 Continual learning of large language models: A comprehensive survey . ACM Comput. Surv., 58(5)

work page doi:10.1145/3735633 2025
[48]

SIRI . 2026. https://danskogproever.dk/ Dansk og prøver . Website. Accessed: 2026-03-13

2026
[49]

Dan Saattrup Smart. 2023. https://aclanthology.org/2023.nodalida-1.20/ S cand E val: A benchmark for S candinavian natural language processing . In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185--201, T \'o rshavn, Faroe Islands. University of Tartu Library

2023
[50]

Dan Saattrup Smart. 2026. https://doi.org/10.63317/2msrgsu9isrx Multiwikiqa: A reading comprehension benchmark in 300+ languages . In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6298--6311, Palma, Mallorca, Spain. European Language Resources Association (ELRA)

work page doi:10.63317/2msrgsu9isrx 2026
[51]

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. 2024. https://proceedings.mlr.press/v235/tajwar24a.html Preference fine-tuning of LLM s should leverage suboptimal, on-policy data . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

2024
[52]

Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. 2021. https://doi.org/10.18653/v1/2021.eacl-main.188 Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203--2213...

work page doi:10.18653/v1/2021.eacl-main.188 2021
[53]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou \'e dec. 2020. TRL : Transformer reinforcement learning. https://github.com/huggingface/trl

2020
[54]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. https://doi.org/10.52202/079017-3018 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . In Advances in Ne...

work page doi:10.52202/079017-3018 2024
[55]

Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, and Ahmad Beirami. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.79 Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1332--1353, Miami, Florida, USA. Association ...

work page doi:10.18653/v1/2024.emnlp-main.79 2024
[56]

Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, and Roy Ka-Wei Lee. 2025. https://doi.org/10.18653/v1/2025.acl-long.615 Finding the sweet spot: Preference data construction for scaling preference optimization . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1253...

work page doi:10.18653/v1/2025.acl-long.615 2025
[57]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://arxiv.org/abs/2505.09388 Qwen3 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, and Jiajun Zhang. 2025 b . https://doi.org/10.18653/v1/2025.findings-acl.1088 Implicit cross-lingual rewarding for efficient multilingual preference alignment . In Findings of the Association for Computational Linguistics: ACL 2025, pages 21125--21147, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.1088 2025
[59]

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, and Jiajun Zhang. 2025 c . https://openreview.net/forum?id=Kak2ZH5Itp Language imbalance driven rewarding for multilingual self-improving . In The Thirteenth International Conference on Learning Representations

2025
[60]

Jinghui Zhang, Yuan Zhao, Siqin Zhang, Ruijing Zhao, and Siyu Bao. 2024. https://doi.org/10.18653/v1/2024.wassa-1.53 Enhancing cross-lingual emotion detection with data augmentation and token-label mapping . In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages 528--533, Bangkok, Thaila...

work page doi:10.18653/v1/2024.wassa-1.53 2024
[61]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. https://arxiv.org/abs/2601.18734 Self-distilled reasoner: On-policy self-distillation for large language models . Preprint, arXiv:2601.18734

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, and Dietrich Klakow. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.24 Fine-tuning large language models to translate: Will a touch of noisy data in misaligned languages suffice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 388--409, M...

work page doi:10.18653/v1/2024.emnlp-main.24 2024

[1] [1]

Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, and Lukas Galke Poech. 2026. https://doi.org/10.63317/4kcbotaa3zgo Dala: Danish linguistic acceptability evaluation guided by real world errors . In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 4312--4326, Palma, Mallorca, Spain. European La...

work page doi:10.63317/4kcbotaa3zgo 2026

[2] [2]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. https:...

work page doi:10.18653/v1/2025.acl-short.20 2025

[3] [3]

John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet \"U st \"u n, and Sara Hooker. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.729 RLHF can speak many languages: Unlocking multilingual preference optimization for LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13134--13156, Miami...

work page doi:10.18653/v1/2024.emnlp-main.729 2024

[4] [4]

Wietse de Vries, Martijn Wieling, and Malvina Nissim. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.447 DUMB : A benchmark for smart evaluation of D utch models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7221--7241, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.447 2023

[5] [5]

Lysandre Debut, Arthur Zucker, Zachary Mueller, Yih-Dar Shieh, Benjamin Bossan, and Pedro Cuenca. 2024. Fixing gradient accumulation. Hugging Face Blog, https://huggingface.co/blog/gradient_accumulation

2024

[6] [6]

Martin d ' Hoffschmidt, Wacim Belblidia, Quentin Heinrich, Tom Brendl \'e , and Maxime Vidal. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.107 FQ u AD : F rench question answering dataset . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1193--1208, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.107 2020

[7] [7]

DSL . 2024. https://sprogteknologi.dk/dataset/1000-talemader-evalueringsdatasaet Evalueringsdatasæt for 1000 danske talemåder og faste udtryk . Accessed: 2026-03-13

2024

[8] [8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2025. https://arxiv.org/abs/2404.04475 Length-controlled alpacaeval: A simple way to debias automatic evaluators . Preprint, arXiv:2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing

Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing. 2025. https://arxiv.org/abs/2507.17476 Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms . Preprint, arXiv:2507.17476

work page arXiv 2025

[10] [10]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. https://arxiv.org/abs/2601.09012 Translategemma t...

work page arXiv 2026

[11] [11]

Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. 2025. https://openreview.net/forum?id=9rwtezthwo The delta learning hypothesis: Preference tuning on weak data can yield strong gains . In Second Conference on Language Modeling

2025

[12] [12]

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. 2024. https://arxiv.org/abs/2402.04792 Direct language model alignment from online ai feedback . Preprint, arXiv:2402.04792

work page arXiv 2024

[13] [13]

Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. 2025. https://doi.org/10.18653/v1/2025.acl-long.3 M - R eward B ench: Evaluating reward models in multilingual settings . In Proceedings of the 63rd Annual Meeting of t...

work page doi:10.18653/v1/2025.acl-long.3 2025

[14] [14]

Daniel Han and Michael Han. 2024. Bugs in LLM training -- gradient accumulation fix. Unsloth Blog, https://unsloth.ai/blog/gradient

2024

[15] [15]

Jiwoo Hong, Noah Lee, Rodrigo Mart \'i nez-Casta \ n o, C \'e sar Rodr \'i guez, and James Thorne. 2025. https://doi.org/10.18653/v1/2025.naacl-short.8 Cross-lingual transfer of reward models in multilingual alignment . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

work page doi:10.18653/v1/2025.naacl-short.8 2025

[16] [16]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

2022

[17] [17]

Dieuwke Hupkes and Nikolay Bogoychev. 2025. https://arxiv.org/abs/2504.10356 Multiloko: a multilingual local knowledge benchmark for llms spanning 31 languages . Preprint, arXiv:2504.10356

work page arXiv 2025

[18] [18]

Oliver Kinch. 2024. https://huggingface.co/datasets/oliverkinch/life-in-the-uk-multiple-choice oliverkinch/life-in-the-uk-multiple-choice . Accessed: 2026-03-13

2024

[19] [19]

Wouter Kool, Herke van Hoof, and Max Welling. 2019. https://openreview.net/forum?id=r1lgTGL5DE Buy 4 REINFORCE samples, get a baseline for free!

2019

[20] [20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 611–626, New York, ...

work page doi:10.1145/3600006.3613165 2023

[21] [21]

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. https://arxiv.org/abs/1910.09700 Quantifying the carbon emissions of machine learning . ArXiv preprint, abs/1910.09700

work page internal anchor Pith review Pith/arXiv arXiv 2019

[22] [22]

Mirko Lai, Stefano Menini, Marco Polignano, Valentina Russo, Rachele Sprugnoli, and Giulia Venturi, editors. 2023 a . https://ceur-ws.org/Vol-3473/ Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023) , volume 3473 of CEUR Workshop Proceedings. CEUR-WS.org, Parma, Italy

2023

[23] [23]

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023 b . https://doi.org/10.18653/v1/2023.emnlp-demo.28 Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2023.emnlp-demo.28 2023

[24] [24]

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. https://doi.org/10.18653/v1/2020.acl-main.653 MLQA : Evaluating cross-lingual extractive question answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315--7330, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.653 2020

[25] [25]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2025. https://proceedings.mlr.press/v267/li25h.html From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of ...

2025

[26] [26]

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, and Valentina Pyatkin. 2026. https://arxiv.org/abs/2604.23747 Sft-then-rl outperforms mixed-policy methods for llm reasoning . Preprint, arXiv:2604.23747

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. 2024. https://arxiv.org/abs/2410.18451 Skywork-reward: Bag of tricks for reward modeling in llms . Preprint, arXiv:2410.18451

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. 2025. https://arxiv.org/abs/2507.01352 Skywork-reward-v2: Scaling preference data curation via human-ai synergy . Preprint, arXiv:2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. https://arxiv.org/abs/2605.09548 Crosslingual on-policy self-distillation for multilingual reasoning . Preprint, arXiv:2605.09548

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. https://doi.org/10.1109/TASLPRO.2025.3606231 An empirical study of catastrophic forgetting in large language models during continual fine-tuning . IEEE Transactions on Audio, Speech and Language Processing, 33:3776--3786

work page doi:10.1109/taslpro.2025.3606231 2025

[31] [31]

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. 2026. https://arxiv.org/abs/2506.01937 Rewardbench 2: Advancing reward model evaluation . Preprint, arXiv:2506.01937

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Timo M \"o ller, Julian Risch, and Malte Pietsch. 2021. https://doi.org/10.18653/v1/2021.mrqa-1.4 G erman Q u AD and G erman DPR : Improving non- E nglish question answering and passage retrieval . In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 42--50, Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.mrqa-1.4 2021

[33] [33]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, and 49 others. 2025. https://arxiv.org/abs/2512.13961 Olmo 3 . Preprint, a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Yu Pan, Zhongze Cai, Huaiyang Zhong, Guanting Chen, and Chonghuan Wang. 2025. https://proceedings.neurips.cc/paper_files/paper/2025/file/3f37b8fbd43303106dd141a602838ad5-Paper-Conference.pdf What matters in data for dpo? In Advances in Neural Information Processing Systems, volume 38, pages 44689--44716. Curran Associates, Inc

2025

[35] [35]

Bolette Pedersen, Nathalie S rensen, Sussi Olsen, Sanni Nimb, and Simon Gray. 2024. https://aclanthology.org/2024.lrec-main.1421/ Towards a D anish semantic reasoning benchmark - compiled from lexical-semantic resources for assessing selected language understanding capabilities of large language models . In Proceedings of the 2024 Joint International Conf...

2024

[36] [36]

Rhitabrat Pokharel, Yufei Tao, and Ameeta Agrawal. 2025. https://doi.org/10.18653/v1/2025.findings-ijcnlp.69 CAPO : Confidence aware preference optimization learning for multilingual preferences . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association f...

work page doi:10.18653/v1/2025.findings-ijcnlp.69 2025

[37] [37]

Qwen Team . 2026. https://qwen.ai/blog?id=qwen3.6-35b-a3b Qwen3.6-35B-A3B : Agentic coding power, now open to all . Accessed: 2026-05-13

2026

[38] [38]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Systems, ...

2023

[39] [39]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016

[40] [40]

Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M

Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, and André F. T. Martins. 2026. https://arxiv.org/abs/2...

work page arXiv 2026

[41] [41]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, page 3505–3506, New York, NY, U...

work page doi:10.1145/3394486.3406703 2020

[42] [42]

Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others. 2025. https://openreview.net/f...

2025

[43] [43]

Dan Saattrup Nielsen, Kenneth Enevoldsen, and Peter Schneider-Kamp. 2025. https://aclanthology.org/2025.nodalida-1.60/ Encoder vs decoder: Comparative analysis of encoder and decoder language models on multilingual NLU tasks . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Techn...

2025

[44] [44]

Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, and 7 others. 2026. https://arxiv.org/abs/2603.11510 Tiny aya: Bridging s...

work page arXiv 2026

[45] [45]

Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, and Jiajun Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.539 MAPO : Advancing multilingual reasoning through multilingual-alignment-as-preference optimization . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2024.acl-long.539 2024

[46] [46]

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. https://arxiv.org/abs/2601.19897 Self-distillation enables continual learning . Preprint, arXiv:2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. 2025. https://doi.org/10.1145/3735633 Continual learning of large language models: A comprehensive survey . ACM Comput. Surv., 58(5)

work page doi:10.1145/3735633 2025

[48] [48]

SIRI . 2026. https://danskogproever.dk/ Dansk og prøver . Website. Accessed: 2026-03-13

2026

[49] [49]

Dan Saattrup Smart. 2023. https://aclanthology.org/2023.nodalida-1.20/ S cand E val: A benchmark for S candinavian natural language processing . In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185--201, T \'o rshavn, Faroe Islands. University of Tartu Library

2023

[50] [50]

Dan Saattrup Smart. 2026. https://doi.org/10.63317/2msrgsu9isrx Multiwikiqa: A reading comprehension benchmark in 300+ languages . In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6298--6311, Palma, Mallorca, Spain. European Language Resources Association (ELRA)

work page doi:10.63317/2msrgsu9isrx 2026

[51] [51]

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. 2024. https://proceedings.mlr.press/v235/tajwar24a.html Preference fine-tuning of LLM s should leverage suboptimal, on-policy data . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

2024

[52] [52]

Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. 2021. https://doi.org/10.18653/v1/2021.eacl-main.188 Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203--2213...

work page doi:10.18653/v1/2021.eacl-main.188 2021

[53] [53]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou \'e dec. 2020. TRL : Transformer reinforcement learning. https://github.com/huggingface/trl

2020

[54] [54]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. https://doi.org/10.52202/079017-3018 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . In Advances in Ne...

work page doi:10.52202/079017-3018 2024

[55] [55]

Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, and Ahmad Beirami. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.79 Reuse your rewards: Reward model transfer for zero-shot cross-lingual alignment . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1332--1353, Miami, Florida, USA. Association ...

work page doi:10.18653/v1/2024.emnlp-main.79 2024

[56] [56]

Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, and Roy Ka-Wei Lee. 2025. https://doi.org/10.18653/v1/2025.acl-long.615 Finding the sweet spot: Preference data construction for scaling preference optimization . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1253...

work page doi:10.18653/v1/2025.acl-long.615 2025

[57] [57]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://arxiv.org/abs/2505.09388 Qwen3 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, and Jiajun Zhang. 2025 b . https://doi.org/10.18653/v1/2025.findings-acl.1088 Implicit cross-lingual rewarding for efficient multilingual preference alignment . In Findings of the Association for Computational Linguistics: ACL 2025, pages 21125--21147, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.1088 2025

[59] [59]

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, and Jiajun Zhang. 2025 c . https://openreview.net/forum?id=Kak2ZH5Itp Language imbalance driven rewarding for multilingual self-improving . In The Thirteenth International Conference on Learning Representations

2025

[60] [60]

Jinghui Zhang, Yuan Zhao, Siqin Zhang, Ruijing Zhao, and Siyu Bao. 2024. https://doi.org/10.18653/v1/2024.wassa-1.53 Enhancing cross-lingual emotion detection with data augmentation and token-label mapping . In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages 528--533, Bangkok, Thaila...

work page doi:10.18653/v1/2024.wassa-1.53 2024

[61] [61]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. https://arxiv.org/abs/2601.18734 Self-distilled reasoner: On-policy self-distillation for large language models . Preprint, arXiv:2601.18734

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, and Dietrich Klakow. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.24 Fine-tuning large language models to translate: Will a touch of noisy data in misaligned languages suffice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 388--409, M...

work page doi:10.18653/v1/2024.emnlp-main.24 2024