Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Benjamin Negrevergne; Gabriel Synnaeve; Jonas Gehring; Juliette Decugis; Kunhao Zheng; Pierre Chambon; Taco Cohen

arxiv: 2605.28751 · v1 · pith:5VIWWPPUnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI· cs.CL

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Kunhao Zheng , Pierre Chambon , Juliette Decugis , Jonas Gehring , Taco Cohen , Benjamin Negrevergne , Gabriel Synnaeve This is my paper

Pith reviewed 2026-06-29 14:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords code reinforcement learningweight averagingcorrectness-efficiency frontiercompetitive programmingextrapolationensemble methodsunit test coverage

0 comments

The pith

Extrapolative weight averaging between RL checkpoints extends a correctness-efficiency frontier and improves ensembles by 3.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains code RL checkpoints from a shared start using rewards tied to progressively stricter unit-test coverage. This sweep produces a frontier on hard problems where higher coverage cuts optimization failures but raises correctness failures, leaving overall solve rate flat. Linear interpolation between checkpoints traces the frontier; extrapolation beyond the endpoints generates new models that continue the trade-off. These extrapolated models act as complementary policies, and ensembles built from them raise pass@250 on hard problems by 3.3% at fixed sample budget across model sizes and inference modes.

Core claim

Nested unit-test coverage in code RL induces a correctness-efficiency frontier in weight space. Interpolation between low- and high-coverage checkpoints recovers this frontier while extrapolation extends it beyond the trained endpoints, producing models that improve ensemble coverage without further RL training.

What carries the argument

Extrapolative weight averaging, the linear extension of model weights past the segment joining two checkpoints to produce new points on an extended frontier.

If this is right

Extrapolated checkpoints solve different problems than the originals, functioning as complementary policies at inference time.
The frontier and its extrapolative extension appear consistently for 7B and 32B models under pure reasoning, tool use, and agentic coding.
Ensembles that include extrapolated checkpoints raise pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.
Both interpolation and extrapolation operate without any additional RL training steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Weight-space linearity may let post-training averaging trade off objectives in other RL settings that pit correctness against efficiency.
Multi-checkpoint or higher-order extrapolation could further widen ensemble coverage on hard instances.
The observed structure implies the loss landscape for this task contains low-dimensional linear directions connecting different coverage regimes.

Load-bearing premise

The relationship between checkpoints in weight space is sufficiently linear that extrapolation produces valid new models on the frontier rather than unrelated failure modes.

What would settle it

Measure whether extrapolated checkpoints exhibit abrupt correctness or efficiency collapse outside the linear trend observed between the original low- and high-coverage checkpoints.

read the original abstract

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that nested unit-test coverage in code RL creates a correctness-efficiency frontier that extrapolative weight averaging can extend, yielding a 3.3% ensemble gain, but the stats and controls are thin.

read the letter

The paper's core finding is that in RL for competitive programming, checkpoints trained with increasing levels of unit-test coverage form a frontier between correctness and efficiency, and extrapolating the weights past the high-coverage end can continue that frontier at inference time.

What stands out is the use of nested coverage to induce the trade-off, and the demonstration that this holds across pure reasoning, tool use, and agentic settings, as well as 7B and 32B models. They also show that the extrapolated checkpoints solve different problems, so ensembling them gives a 3.3% lift in pass@250 on hard problems at the same budget. That's a practical result for inference scaling in code models.

The soft spots are in the experimental reporting. The abstract mentions consistent behavior but gives no error bars, no seed variation, and no detail on how the coverage levels were selected or if the extrapolation coefficient was tuned. Without those, it's hard to know if the 3.3% gain is robust or if the linearity in weight space is as clean as claimed. The stress test on the linearity assumption is fair; the paper would be stronger with some analysis of the weight directions or failure mode shifts in the extrapolated models.

This work is aimed at people doing RL post-training for code and those interested in weight-space methods for multi-objective trade-offs. It has a clear, falsifiable claim with some cross-setting replication, so it deserves a serious referee even if revisions will be needed on the stats and controls.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that in RL for competitive programming, training checkpoints under nested unit-test coverage induces a correctness-efficiency frontier on hard problems. Linear interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. The frontier and its extension hold across three inference settings and two model scales; ensembles using extrapolative averaging improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.

Significance. If the results hold under the linearity assumption, the work provides evidence that weight-space extrapolation can navigate and extend Pareto fronts in code RL without additional training, enabling complementary policies and inference-time gains. The consistency across inference modes and scales is a positive feature of the experimental design.

major comments (2)

[Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.
[Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.

minor comments (1)

[Abstract] The abstract refers to 'three inference settings' (pure reasoning, tool use, agentic coding) but does not list them explicitly when first introducing the consistency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.

Authors: We agree that including error bars and additional details would strengthen the claim. In the revised manuscript, we will report error bars for the pass@250 metric based on available runs, specify the coverage levels used in the experiments (low-coverage at 25% and high-coverage at 100% of the test suite), and note that the 3.3% gain is for the primary metric without multiple-testing correction. We will also discuss the limitation regarding different random seeds and include variance estimates where possible. revision: yes
Referee: [Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.

Authors: We will include the specific extrapolation coefficient in the methods section of the revised manuscript (typically set to 0.5 beyond the high-coverage endpoint). For control experiments, we will add analysis showing the monotonic shift in coverage and optimization failures for extrapolated models. However, exhaustive controls to rule out all unrelated failure modes would require additional training runs, which we will partially address by including results from a control direction (e.g., extrapolation in the opposite direction) and discussing this as a limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims rest on direct experimental training of RL checkpoints under nested coverage rewards, followed by explicit weight averaging and benchmark evaluations (pass@250, LCB/hard). No equations, fitted parameters, or self-citations are invoked to define or force the reported frontiers or extrapolation results by construction; the outcomes are measured quantities that remain independently falsifiable via replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study; it introduces no new mathematical axioms, no invented physical or computational entities, and no free parameters beyond those standard in RL training that are not enumerated in the abstract.

pith-pipeline@v0.9.1-grok · 5818 in / 1190 out tokens · 41515 ms · 2026-06-29T14:37:55.340779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 30 canonical work pages · 11 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025. https://arxiv.org/abs/2507.09075

work page arXiv 2025
[3]

On predictability of reinforcement learning dynamics for large language models, 2026

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models, 2026. https://arxiv.org/abs/2510.00553

work page arXiv 2026
[4]

Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025

Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025. https://arxiv.org/abs/2503.15242

work page arXiv 2025
[5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Process supervision-guided policy optimization for code generation, 2025

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation, 2025. https://arxiv.org/abs/2410.17621

work page arXiv 2025
[7]

Mercury: A code efficiency benchmark for code large language models

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See - Kiong Ng. Mercury: A code efficiency benchmark for code large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

2024
[8]

Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025

Mingzhe Du, Luu Anh Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, and See kiong Ng. Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025. https://arxiv.org/abs/2505.23387

work page arXiv 2025
[9]

FAIR CodeGen team , Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabi...

work page arXiv 2025
[10]

Roy, and Michael Carbin

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pages 3259--3269. PMLR , 2020. http://proceedings.mlr.pre...

2020
[11]

Neural thickets: Diverse task experts are dense around pretrained weights, 2026

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights, 2026. https://arxiv.org/abs/2603.12228

work page arXiv 2026
[12]

RLEF: grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. RLEF: grounding code llms in execution feedback with reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Lear...

2025
[13]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[14]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...

2021
[15]

Effibench: Benchmarking the efficiency of automatically generated code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Benchmarking the efficiency of automatically generated code. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pr...

2024
[16]

Editing models with task arithmetic

Gabriel Ilharco, Marco T \' u lio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=6t0Kwf8-jrj

2023
[17]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...

2025
[18]

Reveal: Self-evolving code agents via reliable self-verification, 2025

Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via reliable self-verification, 2025. https://arxiv.org/abs/2506.11442

work page arXiv 2025
[19]

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. https://arxiv.org/abs/2604.16529

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Kimi Team , Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-...

2025
[22]

C ode PRM : Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. C ode PRM : Execution feedback-enhanced process reward model for code generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, Vienna,...

work page doi:10.18653/v1/2025.findings-acl.428 2025
[23]

Taco: Topics in algorithmic code generation dataset, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

work page arXiv 2023
[24]

doi: 10.1126/science.abq1158

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[25]

Mitigating the alignment tax of RLHF

Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of RLHF . In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...

work page doi:10.18653/v1/2024.emnlp-main.35 2024
[26]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on N...

2023
[27]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. https://arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025. https://arxiv.org/abs/2511.06090

work page internal anchor Pith review arXiv 2025
[29]

Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, and Yossi Adi. Self-execution simulation improves coding models, 2026. https://arxiv.org/abs/2604.03253

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025. https://arxiv.org/abs/2504.16891

work page arXiv 2025
[32]

Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini. Kernelbench: Can llms write efficient GPU kernels? In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ...

2025
[33]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Alexandre Ram \' e , Guillaume Couairon, Corentin Dancette, Jean - Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural ...

2023
[35]

WARM: on the benefits of weight averaged reward models

Alexandre Ram \' e , Nino Vieillard, L \' e onard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. WARM: on the benefits of weight averaged reward models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conferenc...

2024
[36]

Warp: On the benefits of weight averaged rewarded policies, 2024

Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024. https://arxiv.org/abs/2406.16768

work page arXiv 2024
[37]

Optimizing language models for inference time objectives using reinforcement learning

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \' e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ICML 20...

2025
[38]

Improving diversity in language models: When temperature fails, change the loss

Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, and Benjamin N \' e grevergne. Improving diversity in language models: When temperature fails, change the loss. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second Inter...

2025
[39]

Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Nov...

work page doi:10.18653/v1/2024.emnlp-main.859 2024
[40]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Mac...

2024
[41]

Linear Dynamics in the RLVR Training of Large Language Models

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, and Ning Miao. Linear dynamics in the rlvr training of large language models, 2026. https://arxiv.org/abs/2601.04537

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Codecontests+: High-quality test case generation for competitive programming, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. https://arxiv.org/abs/2506.05817

work page arXiv 2025
[43]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song,...

2022
[44]

The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin, 2026. https://arxiv.org/abs/2507.14843

work page arXiv 2026
[45]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. https://arxiv.org/abs/2509.02479

work page arXiv 2025
[46]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=nZP6NgD3QY

2024
[47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Model extrapolation expedites alignment

Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng. Model extrapolation expedites alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 20...

2025
[50]

Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin N \' e Xuanjinggrevergne, and Gabriel Synnaeve. What makes large language models reason in (multi-turn) code generation? In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 b . https://openreview.net/forum?...

2025

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025. https://arxiv.org/abs/2507.09075

work page arXiv 2025

[3] [3]

On predictability of reinforcement learning dynamics for large language models, 2026

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models, 2026. https://arxiv.org/abs/2510.00553

work page arXiv 2026

[4] [4]

Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025

Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025. https://arxiv.org/abs/2503.15242

work page arXiv 2025

[5] [5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Process supervision-guided policy optimization for code generation, 2025

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation, 2025. https://arxiv.org/abs/2410.17621

work page arXiv 2025

[7] [7]

Mercury: A code efficiency benchmark for code large language models

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See - Kiong Ng. Mercury: A code efficiency benchmark for code large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

2024

[8] [8]

Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025

Mingzhe Du, Luu Anh Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, and See kiong Ng. Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025. https://arxiv.org/abs/2505.23387

work page arXiv 2025

[9] [9]

FAIR CodeGen team , Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabi...

work page arXiv 2025

[10] [10]

Roy, and Michael Carbin

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pages 3259--3269. PMLR , 2020. http://proceedings.mlr.pre...

2020

[11] [11]

Neural thickets: Diverse task experts are dense around pretrained weights, 2026

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights, 2026. https://arxiv.org/abs/2603.12228

work page arXiv 2026

[12] [12]

RLEF: grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. RLEF: grounding code llms in execution feedback with reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Lear...

2025

[13] [13]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[14] [14]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...

2021

[15] [15]

Effibench: Benchmarking the efficiency of automatically generated code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Benchmarking the efficiency of automatically generated code. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pr...

2024

[16] [16]

Editing models with task arithmetic

Gabriel Ilharco, Marco T \' u lio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=6t0Kwf8-jrj

2023

[17] [17]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...

2025

[18] [18]

Reveal: Self-evolving code agents via reliable self-verification, 2025

Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via reliable self-verification, 2025. https://arxiv.org/abs/2506.11442

work page arXiv 2025

[19] [19]

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. https://arxiv.org/abs/2604.16529

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Kimi Team , Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-...

2025

[22] [22]

C ode PRM : Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. C ode PRM : Execution feedback-enhanced process reward model for code generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, Vienna,...

work page doi:10.18653/v1/2025.findings-acl.428 2025

[23] [23]

Taco: Topics in algorithmic code generation dataset, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

work page arXiv 2023

[24] [24]

doi: 10.1126/science.abq1158

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022

[25] [25]

Mitigating the alignment tax of RLHF

Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of RLHF . In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...

work page doi:10.18653/v1/2024.emnlp-main.35 2024

[26] [26]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on N...

2023

[27] [27]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. https://arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025. https://arxiv.org/abs/2511.06090

work page internal anchor Pith review arXiv 2025

[29] [29]

Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, and Yossi Adi. Self-execution simulation improves coding models, 2026. https://arxiv.org/abs/2604.03253

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025. https://arxiv.org/abs/2504.16891

work page arXiv 2025

[32] [32]

Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini. Kernelbench: Can llms write efficient GPU kernels? In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ...

2025

[33] [33]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Alexandre Ram \' e , Guillaume Couairon, Corentin Dancette, Jean - Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural ...

2023

[35] [35]

WARM: on the benefits of weight averaged reward models

Alexandre Ram \' e , Nino Vieillard, L \' e onard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. WARM: on the benefits of weight averaged reward models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conferenc...

2024

[36] [36]

Warp: On the benefits of weight averaged rewarded policies, 2024

Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024. https://arxiv.org/abs/2406.16768

work page arXiv 2024

[37] [37]

Optimizing language models for inference time objectives using reinforcement learning

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \' e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ICML 20...

2025

[38] [38]

Improving diversity in language models: When temperature fails, change the loss

Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, and Benjamin N \' e grevergne. Improving diversity in language models: When temperature fails, change the loss. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second Inter...

2025

[39] [39]

Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Nov...

work page doi:10.18653/v1/2024.emnlp-main.859 2024

[40] [40]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Mac...

2024

[41] [41]

Linear Dynamics in the RLVR Training of Large Language Models

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, and Ning Miao. Linear dynamics in the rlvr training of large language models, 2026. https://arxiv.org/abs/2601.04537

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Codecontests+: High-quality test case generation for competitive programming, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. https://arxiv.org/abs/2506.05817

work page arXiv 2025

[43] [43]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song,...

2022

[44] [44]

The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin, 2026. https://arxiv.org/abs/2507.14843

work page arXiv 2026

[45] [45]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. https://arxiv.org/abs/2509.02479

work page arXiv 2025

[46] [46]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=nZP6NgD3QY

2024

[47] [47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Model extrapolation expedites alignment

Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng. Model extrapolation expedites alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 20...

2025

[50] [50]

Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin N \' e Xuanjinggrevergne, and Gabriel Synnaeve. What makes large language models reason in (multi-turn) code generation? In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 b . https://openreview.net/forum?...

2025