Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
Pith reviewed 2026-06-29 14:37 UTC · model grok-4.3
The pith
Extrapolative weight averaging between RL checkpoints extends a correctness-efficiency frontier and improves ensembles by 3.3%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nested unit-test coverage in code RL induces a correctness-efficiency frontier in weight space. Interpolation between low- and high-coverage checkpoints recovers this frontier while extrapolation extends it beyond the trained endpoints, producing models that improve ensemble coverage without further RL training.
What carries the argument
Extrapolative weight averaging, the linear extension of model weights past the segment joining two checkpoints to produce new points on an extended frontier.
If this is right
- Extrapolated checkpoints solve different problems than the originals, functioning as complementary policies at inference time.
- The frontier and its extrapolative extension appear consistently for 7B and 32B models under pure reasoning, tool use, and agentic coding.
- Ensembles that include extrapolated checkpoints raise pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.
- Both interpolation and extrapolation operate without any additional RL training steps.
Where Pith is reading between the lines
- Weight-space linearity may let post-training averaging trade off objectives in other RL settings that pit correctness against efficiency.
- Multi-checkpoint or higher-order extrapolation could further widen ensemble coverage on hard instances.
- The observed structure implies the loss landscape for this task contains low-dimensional linear directions connecting different coverage regimes.
Load-bearing premise
The relationship between checkpoints in weight space is sufficiently linear that extrapolation produces valid new models on the frontier rather than unrelated failure modes.
What would settle it
Measure whether extrapolated checkpoints exhibit abrupt correctness or efficiency collapse outside the linear trend observed between the original low- and high-coverage checkpoints.
read the original abstract
Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in RL for competitive programming, training checkpoints under nested unit-test coverage induces a correctness-efficiency frontier on hard problems. Linear interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. The frontier and its extension hold across three inference settings and two model scales; ensembles using extrapolative averaging improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.
Significance. If the results hold under the linearity assumption, the work provides evidence that weight-space extrapolation can navigate and extend Pareto fronts in code RL without additional training, enabling complementary policies and inference-time gains. The consistency across inference modes and scales is a positive feature of the experimental design.
major comments (2)
- [Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.
- [Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.
minor comments (1)
- [Abstract] The abstract refers to 'three inference settings' (pure reasoning, tool use, agentic coding) but does not list them explicitly when first introducing the consistency claim.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 3.3% pass@250 gain is reported without error bars, without specification of how coverage levels were chosen, and without verification that the gain survives multiple-testing correction or different random seeds; this directly affects the strength of the performance claim.
Authors: We agree that including error bars and additional details would strengthen the claim. In the revised manuscript, we will report error bars for the pass@250 metric based on available runs, specify the coverage levels used in the experiments (low-coverage at 25% and high-coverage at 100% of the test suite), and note that the 3.3% gain is for the primary metric without multiple-testing correction. We will also discuss the limitation regarding different random seeds and include variance estimates where possible. revision: yes
-
Referee: [Experimental results] The central extrapolation result assumes that linear extrapolation in weight space produces models that continue the correctness-efficiency trade-off rather than introducing unrelated failure modes. The manuscript provides no quantitative value for the extrapolation coefficient and no control experiments confirming that the direction is frontier-specific or that extrapolated models exhibit the expected monotonic shift in coverage vs. optimization failures.
Authors: We will include the specific extrapolation coefficient in the methods section of the revised manuscript (typically set to 0.5 beyond the high-coverage endpoint). For control experiments, we will add analysis showing the monotonic shift in coverage and optimization failures for extrapolated models. However, exhaustive controls to rule out all unrelated failure modes would require additional training runs, which we will partially address by including results from a control direction (e.g., extrapolation in the opposite direction) and discussing this as a limitation. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's claims rest on direct experimental training of RL checkpoints under nested coverage rewards, followed by explicit weight averaging and benchmark evaluations (pass@250, LCB/hard). No equations, fitted parameters, or self-citations are invoked to define or force the reported frontiers or extrapolation results by construction; the outcomes are measured quantities that remain independently falsifiable via replication.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025
Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique, 2025. https://arxiv.org/abs/2507.09075
-
[3]
On predictability of reinforcement learning dynamics for large language models, 2026
Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models, 2026. https://arxiv.org/abs/2510.00553
-
[4]
Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025
Pierre Chambon, Baptiste Roziere, Benoit Sagot, and Gabriel Synnaeve. Bigo(bench) -- can llms generate code with controlled time and space complexity?, 2025. https://arxiv.org/abs/2503.15242
-
[5]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Process supervision-guided policy optimization for code generation, 2025
Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation, 2025. https://arxiv.org/abs/2410.17621
-
[7]
Mercury: A code efficiency benchmark for code large language models
Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See - Kiong Ng. Mercury: A code efficiency benchmark for code large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...
2024
-
[8]
Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025
Mingzhe Du, Luu Anh Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, and See kiong Ng. Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization, 2025. https://arxiv.org/abs/2505.23387
-
[9]
FAIR CodeGen team , Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabi...
-
[10]
Roy, and Michael Carbin
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pages 3259--3269. PMLR , 2020. http://proceedings.mlr.pre...
2020
-
[11]
Neural thickets: Diverse task experts are dense around pretrained weights, 2026
Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights, 2026. https://arxiv.org/abs/2603.12228
-
[12]
RLEF: grounding code llms in execution feedback with reinforcement learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. RLEF: grounding code llms in execution feedback with reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Lear...
2025
-
[13]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[14]
Measuring coding challenge competence with APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...
2021
-
[15]
Effibench: Benchmarking the efficiency of automatically generated code
Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Benchmarking the efficiency of automatically generated code. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pr...
2024
-
[16]
Editing models with task arithmetic
Gabriel Ilharco, Marco T \' u lio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=6t0Kwf8-jrj
2023
-
[17]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...
2025
-
[18]
Reveal: Self-evolving code agents via reliable self-verification, 2025
Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. Reveal: Self-evolving code agents via reliable self-verification, 2025. https://arxiv.org/abs/2506.11442
-
[19]
Scaling Test-Time Compute for Agentic Coding
Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling test-time compute for agentic coding, 2026. https://arxiv.org/abs/2604.16529
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Kimi Team , Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Gonzalez, and Ion Stoica
Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-...
2025
-
[22]
C ode PRM : Execution feedback-enhanced process reward model for code generation
Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. C ode PRM : Execution feedback-enhanced process reward model for code generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 8169--8182, Vienna,...
-
[23]
Taco: Topics in algorithmic code generation dataset, 2023
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852
-
[24]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[25]
Mitigating the alignment tax of RLHF
Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of RLHF . In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...
-
[26]
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on N...
2023
-
[27]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. https://arxiv.org/abs/2503.20783
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?
Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025. https://arxiv.org/abs/2511.06090
work page internal anchor Pith review arXiv 2025
-
[29]
Self-Execution Simulation Improves Coding Models
Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, and Yossi Adi. Self-execution simulation improves coding models, 2026. https://arxiv.org/abs/2604.03253
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025. https://arxiv.org/abs/2504.16891
-
[32]
Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R \' e , and Azalia Mirhoseini. Kernelbench: Can llms write efficient GPU kernels? In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ...
2025
-
[33]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ram \' e , Guillaume Couairon, Corentin Dancette, Jean - Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural ...
2023
-
[35]
WARM: on the benefits of weight averaged reward models
Alexandre Ram \' e , Nino Vieillard, L \' e onard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. WARM: on the benefits of weight averaged reward models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conferenc...
2024
-
[36]
Warp: On the benefits of weight averaged rewarded policies, 2024
Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, and Olivier Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024. https://arxiv.org/abs/2406.16768
-
[37]
Optimizing language models for inference time objectives using reinforcement learning
Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \' e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ICML 20...
2025
-
[38]
Improving diversity in language models: When temperature fails, change the loss
Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, and Benjamin N \' e grevergne. Improving diversity in language models: When temperature fails, change the loss. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second Inter...
2025
-
[39]
Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, Nov...
-
[40]
Localizing task information for improved model merging and compression
Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Mac...
2024
-
[41]
Linear Dynamics in the RLVR Training of Large Language Models
Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, and Ning Miao. Linear dynamics in the rlvr training of large language models, 2026. https://arxiv.org/abs/2601.04537
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Codecontests+: High-quality test case generation for competitive programming, 2025
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. https://arxiv.org/abs/2506.05817
-
[43]
Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song,...
2022
-
[44]
The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,
Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin, 2026. https://arxiv.org/abs/2507.14843
-
[45]
Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025
Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. https://arxiv.org/abs/2509.02479
-
[46]
Adamerging: Adaptive model merging for multi-task learning
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=nZP6NgD3QY
2024
-
[47]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. https://arxiv.org/abs/2504.13837
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Model extrapolation expedites alignment
Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng. Model extrapolation expedites alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 20...
2025
-
[50]
Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin N \' e Xuanjinggrevergne, and Gabriel Synnaeve. What makes large language models reason in (multi-turn) code generation? In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 b . https://openreview.net/forum?...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.