pith. sign in

arxiv: 2605.28751 · v1 · pith:5VIWWPPUnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI· cs.CL

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Pith reviewed 2026-06-29 14:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords code reinforcement learningweight averagingcorrectness-efficiency frontiercompetitive programmingextrapolationensemble methodsunit test coverage
0
0 comments X

The pith

Extrapolative weight averaging between RL checkpoints extends a correctness-efficiency frontier and improves ensembles by 3.3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains code RL checkpoints from a shared start using rewards tied to progressively stricter unit-test coverage. This sweep produces a frontier on hard problems where higher coverage cuts optimization failures but raises correctness failures, leaving overall solve rate flat. Linear interpolation between checkpoints traces the frontier; extrapolation beyond the endpoints generates new models that continue the trade-off. These extrapolated models act as complementary policies, and ensembles built from them raise pass@250 on hard problems by 3.3% at fixed sample budget across model sizes and inference modes.

Core claim

Nested unit-test coverage in code RL induces a correctness-efficiency frontier in weight space. Interpolation between low- and high-coverage checkpoints recovers this frontier while extrapolation extends it beyond the trained endpoints, producing models that improve ensemble coverage without further RL training.

What carries the argument

Extrapolative weight averaging, the linear extension of model weights past the segment joining two checkpoints to produce new points on an extended frontier.

If this is right

  • Extrapolated checkpoints solve different problems than the originals, functioning as complementary policies at inference time.
  • The frontier and its extrapolative extension appear consistently for 7B and 32B models under pure reasoning, tool use, and agentic coding.
  • Ensembles that include extrapolated checkpoints raise pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.
  • Both interpolation and extrapolation operate without any additional RL training steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Weight-space linearity may let post-training averaging trade off objectives in other RL settings that pit correctness against efficiency.
  • Multi-checkpoint or higher-order extrapolation could further widen ensemble coverage on hard instances.
  • The observed structure implies the loss landscape for this task contains low-dimensional linear directions connecting different coverage regimes.

Load-bearing premise

The relationship between checkpoints in weight space is sufficiently linear that extrapolation produces valid new models on the frontier rather than unrelated failure modes.

What would settle it

Measure whether extrapolated checkpoints exhibit abrupt correctness or efficiency collapse outside the linear trend observed between the original low- and high-coverage checkpoints.

read the original abstract

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that in RL for competitive programming, training checkpoints under nested unit-test coverage induces a correctness-efficiency frontier on hard problems. Linear interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. The frontier and its extension hold across three inference settings and two model scales; ensembles using extrapolative averaging improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.

Significance. If the results hold under the linearity assumption, the work provides evidence that weight-space extrapolation can navigate and extend Pareto fronts in code RL without additional training, enabling complementary policies and inference-time gains. The consistency across inference modes and scales is a positive feature of the experimental design.

major comments (2)
  1. [Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.
  2. [Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.
minor comments (1)
  1. [Abstract] The abstract refers to 'three inference settings' (pure reasoning, tool use, agentic coding) but does not list them explicitly when first introducing the consistency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.

    Authors: We agree that including error bars and additional details would strengthen the claim. In the revised manuscript, we will report error bars for the pass@250 metric based on available runs, specify the coverage levels used in the experiments (low-coverage at 25% and high-coverage at 100% of the test suite), and note that the 3.3% gain is for the primary metric without multiple-testing correction. We will also discuss the limitation regarding different random seeds and include variance estimates where possible. revision: yes

  2. Referee: [Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.

    Authors: We will include the specific extrapolation coefficient in the methods section of the revised manuscript (typically set to 0.5 beyond the high-coverage endpoint). For control experiments, we will add analysis showing the monotonic shift in coverage and optimization failures for extrapolated models. However, exhaustive controls to rule out all unrelated failure modes would require additional training runs, which we will partially address by including results from a control direction (e.g., extrapolation in the opposite direction) and discussing this as a limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims rest on direct experimental training of RL checkpoints under nested coverage rewards, followed by explicit weight averaging and benchmark evaluations (pass@250, LCB/hard). No equations, fitted parameters, or self-citations are invoked to define or force the reported frontiers or extrapolation results by construction; the outcomes are measured quantities that remain independently falsifiable via replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study; it introduces no new mathematical axioms, no invented physical or computational entities, and no free parameters beyond those standard in RL training that are not enumerated in the abstract.

pith-pipeline@v0.9.1-grok · 5818 in / 1190 out tokens · 41515 ms · 2026-06-29T14:37:55.340779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 30 canonical work pages · 11 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025

    Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025. https://arxiv.org/abs/2507.09075

  3. [3]

    On predictability of reinforcement learning dynamics for large language models, 2026

    Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models, 2026. https://arxiv.org/abs/2510.00553

  4. [4]

    Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025

    Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025. https://arxiv.org/abs/2503.15242

  5. [5]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

  6. [6]

    Process supervision-guided policy optimization for code generation, 2025

    Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation, 2025. https://arxiv.org/abs/2410.17621

  7. [7]

    Mercury: A code efficiency benchmark for code large language models

    Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See - Kiong Ng. Mercury: A code efficiency benchmark for code large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

  8. [8]

    Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025

    Mingzhe Du, Luu Anh Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, and See kiong Ng. Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025. https://arxiv.org/abs/2505.23387

  9. [9]

    FAIR CodeGen team , Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabi...

  10. [10]

    Roy, and Michael Carbin

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pages 3259--3269. PMLR , 2020. http://proceedings.mlr.pre...

  11. [11]

    Neural thickets: Diverse task experts are dense around pretrained weights, 2026

    Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights, 2026. https://arxiv.org/abs/2603.12228

  12. [12]

    RLEF: grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. RLEF: grounding code llms in execution feedback with reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Lear...

  13. [13]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  14. [14]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...

  15. [15]

    Effibench: Benchmarking the efficiency of automatically generated code

    Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Benchmarking the efficiency of automatically generated code. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pr...

  16. [16]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco T \' u lio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=6t0Kwf8-jrj

  17. [17]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...

  18. [18]

    Reveal: Self-evolving code agents via reliable self-verification, 2025

    Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via reliable self-verification, 2025. https://arxiv.org/abs/2506.11442

  19. [19]

    Scaling Test-Time Compute for Agentic Coding

    Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. https://arxiv.org/abs/2604.16529

  20. [20]

    Kimi Team , Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

  21. [21]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-...

  22. [22]

    C ode PRM : Execution feedback-enhanced process reward model for code generation

    Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. C ode PRM : Execution feedback-enhanced process reward model for code generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, Vienna,...

  23. [23]

    Taco: Topics in algorithmic code generation dataset, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

  24. [24]

    doi: 10.1126/science.abq1158

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  25. [25]

    Mitigating the alignment tax of RLHF

    Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of RLHF . In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...

  26. [26]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on N...

  27. [27]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. https://arxiv.org/abs/2503.20783

  28. [28]

    SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

    Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025. https://arxiv.org/abs/2511.06090

  29. [29]

    Self-Execution Simulation Improves Coding Models

    Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, and Yossi Adi. Self-execution simulation improves coding models, 2026. https://arxiv.org/abs/2604.03253

  30. [30]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

  31. [31]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025. https://arxiv.org/abs/2504.16891

  32. [32]

    Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini. Kernelbench: Can llms write efficient GPU kernels? In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ...

  33. [33]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  34. [34]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

    Alexandre Ram \' e , Guillaume Couairon, Corentin Dancette, Jean - Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural ...

  35. [35]

    WARM: on the benefits of weight averaged reward models

    Alexandre Ram \' e , Nino Vieillard, L \' e onard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. WARM: on the benefits of weight averaged reward models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conferenc...

  36. [36]

    Warp: On the benefits of weight averaged rewarded policies, 2024

    Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024. https://arxiv.org/abs/2406.16768

  37. [37]

    Optimizing language models for inference time objectives using reinforcement learning

    Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \' e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ICML 20...

  38. [38]

    Improving diversity in language models: When temperature fails, change the loss

    Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, and Benjamin N \' e grevergne. Improving diversity in language models: When temperature fails, change the loss. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second Inter...

  39. [39]

    Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Nov...

  40. [40]

    Localizing task information for improved model merging and compression

    Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Mac...

  41. [41]

    Linear Dynamics in the RLVR Training of Large Language Models

    Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, and Ning Miao. Linear dynamics in the rlvr training of large language models, 2026. https://arxiv.org/abs/2601.04537

  42. [42]

    Codecontests+: High-quality test case generation for competitive programming, 2025

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. https://arxiv.org/abs/2506.05817

  43. [43]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song,...

  44. [44]

    The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

    Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin, 2026. https://arxiv.org/abs/2507.14843

  45. [45]

    Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025

    Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. https://arxiv.org/abs/2509.02479

  46. [46]

    Adamerging: Adaptive model merging for multi-task learning

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=nZP6NgD3QY

  47. [47]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  48. [48]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. https://arxiv.org/abs/2504.13837

  49. [49]

    Model extrapolation expedites alignment

    Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng. Model extrapolation expedites alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 20...

  50. [50]

    Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin N \' e Xuanjinggrevergne, and Gabriel Synnaeve. What makes large language models reason in (multi-turn) code generation? In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 b . https://openreview.net/forum?...