Recognition: unknown
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
Pith reviewed 2026-05-08 16:46 UTC · model grok-4.3
The pith
The power distribution is the closed-form optimizer of KL-regularized RL when sequence-level log-probabilities serve as the reward.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. From the sampling perspective, inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. Experiments show that power sampling raises self-reward, that true-reward gains depend on alignment with self-reward, and that power self-distillation can match or do
What carries the argument
The power distribution, the target of power sampling that is proportional to the model's probability raised to a power, which exactly optimizes the self-reward KL-regularized RL objective and supplies the shared target for the resulting offline distillation.
If this is right
- Power sampling raises the model's self-reward.
- Improvement in downstream true reward is governed by the covariance between true reward and self-reward under the power distribution.
- Power self-distillation achieves self-reward sharpening while matching or exceeding the performance of power sampling at much lower inference cost.
- Local approximations to sequence-level power sampling cannot succeed without suffix information.
Where Pith is reading between the lines
- If covariance between self-reward and true reward can be increased by prior alignment steps, the method could replace more expensive online RL loops in many training pipelines.
- The same bridge might exist for other sampling schemes if they admit closed-form RL optima, suggesting a general pattern for turning inference-time methods into offline training targets.
- Practitioners could measure self-reward/true-reward covariance on held-out data before committing to power-based training to predict whether gains will transfer.
Load-bearing premise
Defining the RL reward as the model's own sequence-level log-probabilities produces a meaningful and non-degenerate self-reward signal whose covariance with external true rewards can be reliably estimated and exploited.
What would settle it
A direct derivation or small-model experiment showing that the minimizer of the KL-regularized objective with self-reward differs from the power distribution, or an experiment where power self-distillation fails to match power sampling performance even when covariance between self-reward and true reward is high.
Figures
read the original abstract
Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation in LLMs. It identifies the power distribution as the closed-form optimizer of KL-regularized RL when the reward is the model's own sequence-level log-probabilities, which leads to power self-distillation as an offline surrogate sharing the same target distribution. It further claims that true-reward improvements are governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks show that power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed power sampling at lower inference cost.
Significance. If the identification and covariance analysis hold, the work unifies sampling, RL, and distillation techniques, offering a theoretical basis for power sampling's effectiveness and a lower-cost offline alternative via distillation. The covariance condition provides a predictive lens for when self-reward methods succeed on true rewards. The experiments lend empirical support on reasoning tasks, and the explicit link to amortizing sampling costs into supervised training is a practical strength.
major comments (2)
- [RL perspective section] The identification that the power distribution is the closed-form optimizer follows immediately from substituting r(x) = log π(x) into the standard KL-regularized solution p*(x) ∝ π(x) exp(r(x)/β), but the manuscript should clarify whether β is fixed or general and whether additional assumptions (e.g., on the support or normalization) are required for the equivalence to hold exactly.
- [Covariance and true-reward analysis] The claim that improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution requires an explicit derivation or reference to the relevant result (e.g., a policy-gradient expansion or expectation under the power distribution). Without this step, the governance relation remains informal and load-bearing for the self-distillation analysis.
minor comments (3)
- [Abstract] The abstract statement that inexpensive local approximations cannot reproduce sequence-level power without suffix information would be strengthened by a short illustrative example or pointer to the corresponding derivation in the main text.
- [Experiments section] In the experiments, provide details on how covariance between true and self-reward was estimated (e.g., sample size, estimator, controls for confounding), along with error bars or statistical tests supporting the dependence claim.
- [Notation and presentation] Ensure consistent notation for the power exponent (or β) across equations, text, and figures; consider adding a table summarizing the three perspectives (sampling, RL, distillation) for clarity.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to incorporate the requested clarifications and derivations.
read point-by-point responses
-
Referee: [RL perspective section] The identification that the power distribution is the closed-form optimizer follows immediately from substituting r(x) = log π(x) into the standard KL-regularized solution p*(x) ∝ π(x) exp(r(x)/β), but the manuscript should clarify whether β is fixed or general and whether additional assumptions (e.g., on the support or normalization) are required for the equivalence to hold exactly.
Authors: We agree that the identification follows directly from the substitution. In the revised manuscript, we will clarify that β is a general positive temperature parameter (yielding the power exponent 1 + 1/β) and state that the equivalence holds exactly under the standard assumptions of the KL-regularized RL objective: the distributions share the same support and are properly normalized. No additional assumptions are required. revision: yes
-
Referee: [Covariance and true-reward analysis] The claim that improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution requires an explicit derivation or reference to the relevant result (e.g., a policy-gradient expansion or expectation under the power distribution). Without this step, the governance relation remains informal and load-bearing for the self-distillation analysis.
Authors: We acknowledge that the covariance relation was presented conceptually. We will add an explicit derivation in the revised manuscript, computing the expected true-reward improvement directly under the power distribution (or via the corresponding policy-gradient expansion) to show that it is governed by the covariance between the true reward and the self-reward (sequence-level log-probability). This will make the claim rigorous and better support the self-distillation analysis. revision: yes
Circularity Check
No significant circularity; core identification is algebraic substitution
full rationale
The paper's key step is the observation that substituting the self-reward r(x) = log π(x) into the standard closed-form solution of KL-regularized RL immediately produces the power distribution p*(x) ∝ π(x)^{1 + 1/β}. This is a direct algebraic identity rather than a derived prediction or fitted result. The links to power self-distillation (offline training on samples from the same target) and the covariance governing true-reward gains follow from standard policy-gradient identities applied to this equivalence. No load-bearing step relies on self-citation, ansatz smuggling, or renaming; the argument is self-contained as a mathematical bridge between three existing techniques and does not reduce any claimed result to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- power exponent
axioms (2)
- standard math KL-regularized RL admits a closed-form solution when reward equals sequence log-probability
- domain assumption Sequence-level power cannot be recovered from local token approximations without suffix information
Reference graph
Works this paper leans on
-
[2]
Variational best-of-n alignment
Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-n alignment. In International Conference on Learning Representations, 2025
2025
-
[5]
I nf A lign: Inference-aware language model alignment
Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, and Ahmad Beirami. I nf A lign: Inference-aware language model alignment. In International Conference on Machine Learning, volume 267, 2025
2025
-
[6]
Distillation scaling laws
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. In International Conference on Machine Learning, 2025
2025
-
[8]
An introduction to sequential Monte Carlo, volume 4
Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020
2020
-
[9]
Elements of information theory
Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999
1999
-
[11]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10835--10866. PMLR, 2023
2023
-
[12]
Empirical Processes in M -estimation , volume 6
Sara A Geer. Empirical Processes in M -estimation , volume 6. Cambridge university press, 2000
2000
-
[13]
Bo NB on alignment for large language models and the sweetness of best-of-n sampling
Lin Gui, Cristina Garbacea, and Victor Veitch. Bo NB on alignment for large language models and the sweetness of best-of-n sampling. In Advances in Neural Information Processing Systems, 2024
2024
-
[15]
Rewarding the unlikely: Lifting GRPO beyond distribution sharpening
Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Conference on Empirical Methods in Natural Language Processing, pages 25559--25571, 2025
2025
-
[17]
Lo RA : Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
2022
-
[18]
Ash, and Akshay Krishnamurthy
Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. In International Conference on Learning Representations, 2025
2025
-
[19]
Large language models can self-improve
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Conference on Empirical Methods in Natural Language Processing, pages 1051--1068, 2023
2023
-
[21]
Reasoning without training: Your base model is smarter than you think
Aayush Karan and Yilun Du. Reasoning without training: Your base model is smarter than you think. In International Conference on Learning Representations, 2026
2026
-
[22]
Optimization by simulated annealing
Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. Science, 220 0 (4598): 0 671--680, 1983
1983
-
[24]
In-context reinforcement learning with algorithm distillation
Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, and Volodymyr Mnih. In-context reinforcement learning with algorithm distillation. In International Conference on Learning Representations, 2023
2023
-
[26]
Let's verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024
2024
-
[28]
Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization
L \'a szl \'o Lov \'a sz and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pages 57--68. IEEE, 2006
2006
-
[29]
Controlled decoding from language models
Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. In International Conference on Machine Learning, 2024
2024
-
[31]
Reward augmented maximum likelihood for neural structured prediction
Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems, 2016
2016
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...
2022
-
[34]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In Conference on Language Modeling, 2024
2024
-
[35]
Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell
Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In International Conference on Learning Representations, 2016
2016
-
[36]
Pier Giuseppe Sessa, Robert Dadashi-Tazehozi, Leonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Rame, Bobak Shahriari, Sarah Perrin, Abram L. Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Garea, Am \'e lie H \'e liou, Aliaksei Severyn, Matthew Hoffman, Nikola Momchev, and Olivier Bachem. ...
2025
-
[40]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33: 0 3008--3021, 2020
2020
-
[41]
Distral: Robust multitask reinforcement learning
Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, 2017
2017
-
[43]
Self-instruct: Aligning language models with self-generated instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics (volume 1: long papers), pages 13484--13508, 2023
2023
-
[44]
From decoding to meta-generation: Inference-time algorithms for large language models
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Survey Certification
2024
-
[45]
Probability inequalities for likelihood ratios and convergence rates of sieve MLE s
Wing Hung Wong and Xiaotong Shen. Probability inequalities for likelihood ratios and convergence rates of sieve MLE s. The Annals of Statistics, pages 339--362, 1995
1995
-
[48]
Faster wind: Accelerating iterative best-of- n distillation for llm alignment
Tong Yang, Jincheng Mei, Hanjun Dai, Zixin Wen, Shicong Cen, Dale Schuurmans, Yuejie Chi, and Bo Dai. Faster wind: Accelerating iterative best-of- n distillation for llm alignment. In International Conference on Artificial Intelligence and Statistics, pages 4537--4545, 2025
2025
-
[49]
Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems, 2025
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems, 2025
2025
-
[50]
From -entropy to KL -entropy: Analysis of minimum information complexity density estimation
Tong Zhang. From -entropy to KL -entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, pages 2180--2210, 2006
2006
-
[51]
Probabilistic inference in language models via twisted sequential M onte C arlo
Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference in language models via twisted sequential M onte C arlo. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 60704--60748. PMLR, 2024
2024
-
[52]
Learning to reason without external rewards
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. In International Conference on Learning Representations, 2026
2026
-
[53]
International Conference on Learning Representations , year=
Reasoning without Training: Your Base Model is Smarter Than You Think , author=. International Conference on Learning Representations , year=
-
[54]
International Conference on Learning Representations , year=
Self-Improvement in Language Models: The Sharpening Mechanism , author=. International Conference on Learning Representations , year=
-
[55]
Lin Gui and Cristina Garbacea and Victor Veitch , booktitle=. Bo
-
[56]
ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models , year=
West-of-N: Synthetic Preference Generation for Improved Reward Modeling , author=. ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models , year=
2024
-
[57]
International Conference on Machine Learning , year=
Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. International Conference on Machine Learning , year=
-
[58]
International Conference on Machine Learning , year=
Theoretical Guarantees on the Best-of-n Alignment Policy , author=. International Conference on Machine Learning , year=
-
[59]
International Conference on Learning Representations , year=
Variational Best-of-N Alignment , author=. International Conference on Learning Representations , year=
-
[60]
Friesen and Geoffrey Cideron and Sertan Girgin and Piotr Stanczyk and Andrea Michi and Danila Sinopalnikov and Sabela Ramos Garea and Am
Pier Giuseppe Sessa and Robert Dadashi-Tazehozi and Leonard Hussenot and Johan Ferret and Nino Vieillard and Alexandre Rame and Bobak Shahriari and Sarah Perrin and Abram L. Friesen and Geoffrey Cideron and Sertan Girgin and Piotr Stanczyk and Andrea Michi and Danila Sinopalnikov and Sabela Ramos Garea and Am. International Conference on Learning Represen...
-
[61]
Azizi, Seyedarmin and Potraghloo, Erfan Baghaei and Ahmadi, Minoo and Kundu, Souvik and Pedram, Massoud , journal=. Power-
-
[62]
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for
Ji, Xiaotong and Tutunov, Rasul and Zimmer, Matthieu and Ammar, Haitham Bou , journal=. Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for
-
[63]
Advances in Neural Information Processing Systems , year=
Reward augmented maximum likelihood for neural structured prediction , author=. Advances in Neural Information Processing Systems , year=
-
[64]
International Conference on Learning Representations , year=
Policy Distillation , author=. International Conference on Learning Representations , year=
-
[65]
Advances in Neural Information Processing Systems , year=
Distral: Robust Multitask Reinforcement Learning , author=. Advances in Neural Information Processing Systems , year=
-
[66]
International Conference on Learning Representations , year=
In-context Reinforcement Learning with Algorithm Distillation , author=. International Conference on Learning Representations , year=
-
[67]
arXiv preprint arXiv:2603.05739 , year=
Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment , author=. arXiv preprint arXiv:2603.05739 , year=
-
[68]
2025 , volume =
Balashankar, Ananth and Sun, Ziteng and Berant, Jonathan and Eisenstein, Jacob and Collins, Michael and Hutter, Adrian and Lee, Jong and Nagpal, Chirag and Prost, Flavien and Sinha, Aradhana and Suresh, Ananda Theertha and Beirami, Ahmad , booktitle =. 2025 , volume =
2025
-
[69]
International Conference on Artificial Intelligence and Statistics , pages =
Faster WIND: Accelerating Iterative Best-of- N Distillation for LLM Alignment , author =. International Conference on Artificial Intelligence and Statistics , pages =
-
[70]
Makoto Shing and Kou Misaki and Han Bao and Sho Yokoi and Takuya Akiba , booktitle=
-
[71]
National Conference on Artificial Intelligence , pages =
Relative Entropy Policy Search , author =. National Conference on Artificial Intelligence , pages =. 2010 , month = jul, publisher =
2010
-
[72]
Deep reinforcement learning from human preferences , author=. arXiv preprint arXiv:1706.03741 , year=
-
[73]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. arXiv preprint arXiv:2305.18290 , year=
work page internal anchor Pith review arXiv
-
[74]
Probability inequalities for likelihood ratios and convergence rates of sieve
Wong, Wing Hung and Shen, Xiaotong , journal=. Probability inequalities for likelihood ratios and convergence rates of sieve. 1995 , publisher=
1995
-
[75]
Empirical Processes in
Geer, Sara A , volume=. Empirical Processes in. 2000 , publisher=
2000
-
[76]
From -Entropy to
Zhang, Tong , journal=. From -Entropy to. 2006 , publisher=
2006
-
[77]
Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , journal=. Scaling
-
[78]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review arXiv 2001
-
[79]
Advances in Neural Information Processing Systems , year=
Training compute-optimal large language models , author=. Advances in Neural Information Processing Systems , year=
-
[80]
Advances in Neural Information Processing Systems , year=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=
-
[81]
Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V and Liu, Alisa and Dziri, Nouha and Lyu, Shane and others , journal=
-
[82]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=
-
[83]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
-
[84]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in
-
[85]
Rewarding the unlikely: Lifting
He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the unlikely: Lifting
-
[86]
Outcome-based Exploration for
Song, Yuda and Kempe, Julia and Munos, R. Outcome-based Exploration for. NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists , year=
2025
-
[87]
International Conference on Learning Representations , year=
Learning to Reason without External Rewards , author=. International Conference on Learning Representations , year=
-
[88]
arXiv preprint arXiv:2505.22660
Maximizing confidence alone improves reasoning , author=. arXiv preprint arXiv:2505.22660 , year=
-
[89]
Spurious rewards: Rethinking training signals in
Shao, Rulin and Li, Shuyue Stella and Xin, Rui and Geng, Scott and Wang, Yiping and Oh, Sewoong and Du, Simon Shaolei and Lambert, Nathan and Min, Sewon and Krishna, Ranjay and others , journal=. Spurious rewards: Rethinking training signals in
-
[90]
Conference on Empirical Methods in Natural Language Processing , pages=
Large language models can self-improve , author=. Conference on Empirical Methods in Natural Language Processing , pages=
-
[91]
Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=
Self-instruct: Aligning language models with self-generated instructions , author=. Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=
-
[92]
Constitutional
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional
-
[93]
International Conference on Learning Representations , year=
Language Model Self-improvement by Reinforcement Learning Contemplation , author=. International Conference on Learning Representations , year=
-
[94]
Gonzalez and Ion Stoica , booktitle=
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging
-
[95]
International Conference on Machine Learning , year=
Self-Rewarding Language Models , author=. International Conference on Machine Learning , year=
-
[96]
Meta-rewarding language models: Self-improving alignment with
Wu, Tianhao and Yuan, Weizhe and Golovneva, Olga and Xu, Jing and Tian, Yuandong and Jiao, Jiantao and Weston, Jason E and Sukhbaatar, Sainbayar , booktitle=. Meta-rewarding language models: Self-improving alignment with
-
[97]
arXiv preprint arXiv:2408.02666 , year=
Self-taught evaluators , author=. arXiv preprint arXiv:2408.02666 , year=
-
[98]
Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , journal=
-
[99]
ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
Model compression , author=. ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.