arxiv: 2605.04542 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: unknown

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Akiyoshi Tomihari , Issei Sato

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords power distributionself-reward RLself-distillationKL-regularized RLpower samplinglarge language modelsreasoning tasks

0 comments

The pith

The power distribution is the closed-form optimizer of KL-regularized RL when sequence-level log-probabilities serve as the reward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the power distribution unifies power sampling, self-reward KL-regularized reinforcement learning, and self-distillation in large language models. It derives that this distribution exactly solves the RL objective when the reward equals the model's own sequence log-probability, which directly produces an offline method called power self-distillation. This method trains on teacher samples drawn from the power distribution and thereby avoids repeated online sampling at inference time. Gains in external true reward are controlled by the covariance between that true reward and the self-reward under the power distribution. Experiments on reasoning tasks confirm that power sampling increases self-reward and that the distillation surrogate can reach or surpass the sampling results at far lower cost.

Core claim

From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. From the sampling perspective, inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. Experiments show that power sampling raises self-reward, that true-reward gains depend on alignment with self-reward, and that power self-distillation can match or do

What carries the argument

The power distribution, the target of power sampling that is proportional to the model's probability raised to a power, which exactly optimizes the self-reward KL-regularized RL objective and supplies the shared target for the resulting offline distillation.

If this is right

Power sampling raises the model's self-reward.
Improvement in downstream true reward is governed by the covariance between true reward and self-reward under the power distribution.
Power self-distillation achieves self-reward sharpening while matching or exceeding the performance of power sampling at much lower inference cost.
Local approximations to sequence-level power sampling cannot succeed without suffix information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If covariance between self-reward and true reward can be increased by prior alignment steps, the method could replace more expensive online RL loops in many training pipelines.
The same bridge might exist for other sampling schemes if they admit closed-form RL optima, suggesting a general pattern for turning inference-time methods into offline training targets.
Practitioners could measure self-reward/true-reward covariance on held-out data before committing to power-based training to predict whether gains will transfer.

Load-bearing premise

Defining the RL reward as the model's own sequence-level log-probabilities produces a meaningful and non-degenerate self-reward signal whose covariance with external true rewards can be reliably estimated and exploited.

What would settle it

A direct derivation or small-model experiment showing that the minimizer of the KL-regularized objective with self-reward differs from the power distribution, or an experiment where power self-distillation fails to match power sampling performance even when covariance between self-reward and true reward is high.

Figures

Figures reproduced from arXiv: 2605.04542 by Akiyoshi Tomihari, Issei Sato.

**Figure 1.** Figure 1: Overview of our contribution. The power distribution connects sampling, KL-regularized RL, view at source ↗

**Figure 2.** Figure 2: Sharper power sampling raises both rself and r ⋆ . A smaller temperature parameter τ = 1/α (sharper distribution) increases rself and true reward r ⋆ (accuracy) on MATH500 view at source ↗

read the original abstract

Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation in LLMs. It identifies the power distribution as the closed-form optimizer of KL-regularized RL when the reward is the model's own sequence-level log-probabilities, which leads to power self-distillation as an offline surrogate sharing the same target distribution. It further claims that true-reward improvements are governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks show that power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed power sampling at lower inference cost.

Significance. If the identification and covariance analysis hold, the work unifies sampling, RL, and distillation techniques, offering a theoretical basis for power sampling's effectiveness and a lower-cost offline alternative via distillation. The covariance condition provides a predictive lens for when self-reward methods succeed on true rewards. The experiments lend empirical support on reasoning tasks, and the explicit link to amortizing sampling costs into supervised training is a practical strength.

major comments (2)

[RL perspective section] The identification that the power distribution is the closed-form optimizer follows immediately from substituting r(x) = log π(x) into the standard KL-regularized solution p*(x) ∝ π(x) exp(r(x)/β), but the manuscript should clarify whether β is fixed or general and whether additional assumptions (e.g., on the support or normalization) are required for the equivalence to hold exactly.
[Covariance and true-reward analysis] The claim that improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution requires an explicit derivation or reference to the relevant result (e.g., a policy-gradient expansion or expectation under the power distribution). Without this step, the governance relation remains informal and load-bearing for the self-distillation analysis.

minor comments (3)

[Abstract] The abstract statement that inexpensive local approximations cannot reproduce sequence-level power without suffix information would be strengthened by a short illustrative example or pointer to the corresponding derivation in the main text.
[Experiments section] In the experiments, provide details on how covariance between true and self-reward was estimated (e.g., sample size, estimator, controls for confounding), along with error bars or statistical tests supporting the dependence claim.
[Notation and presentation] Ensure consistent notation for the power exponent (or β) across equations, text, and figures; consider adding a table summarizing the three perspectives (sampling, RL, distillation) for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to incorporate the requested clarifications and derivations.

read point-by-point responses

Referee: [RL perspective section] The identification that the power distribution is the closed-form optimizer follows immediately from substituting r(x) = log π(x) into the standard KL-regularized solution p*(x) ∝ π(x) exp(r(x)/β), but the manuscript should clarify whether β is fixed or general and whether additional assumptions (e.g., on the support or normalization) are required for the equivalence to hold exactly.

Authors: We agree that the identification follows directly from the substitution. In the revised manuscript, we will clarify that β is a general positive temperature parameter (yielding the power exponent 1 + 1/β) and state that the equivalence holds exactly under the standard assumptions of the KL-regularized RL objective: the distributions share the same support and are properly normalized. No additional assumptions are required. revision: yes
Referee: [Covariance and true-reward analysis] The claim that improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution requires an explicit derivation or reference to the relevant result (e.g., a policy-gradient expansion or expectation under the power distribution). Without this step, the governance relation remains informal and load-bearing for the self-distillation analysis.

Authors: We acknowledge that the covariance relation was presented conceptually. We will add an explicit derivation in the revised manuscript, computing the expected true-reward improvement directly under the power distribution (or via the corresponding policy-gradient expansion) to show that it is governed by the covariance between the true reward and the self-reward (sequence-level log-probability). This will make the claim rigorous and better support the self-distillation analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core identification is algebraic substitution

full rationale

The paper's key step is the observation that substituting the self-reward r(x) = log π(x) into the standard closed-form solution of KL-regularized RL immediately produces the power distribution p*(x) ∝ π(x)^{1 + 1/β}. This is a direct algebraic identity rather than a derived prediction or fitted result. The links to power self-distillation (offline training on samples from the same target) and the covariance governing true-reward gains follow from standard policy-gradient identities applied to this equivalence. No load-bearing step relies on self-citation, ansatz smuggling, or renaming; the argument is self-contained as a mathematical bridge between three existing techniques and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard properties of exponential families and KL-regularized objectives plus the modeling choice to treat model log-probabilities as rewards. No new physical or statistical entities are introduced.

free parameters (1)

power exponent
The exponent that defines the power distribution; its value is chosen per task or experiment and directly controls the target distribution used in sampling, RL, and distillation.

axioms (2)

standard math KL-regularized RL admits a closed-form solution when reward equals sequence log-probability
Invoked to identify the power distribution as the optimizer.
domain assumption Sequence-level power cannot be recovered from local token approximations without suffix information
Used to argue that full-sequence sampling is necessary.

pith-pipeline@v0.9.0 · 5549 in / 1469 out tokens · 70370 ms · 2026-05-08T16:46:53.190484+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

183 extracted references · 29 canonical work pages · 19 internal anchors

[2]

Variational best-of-n alignment

Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-n alignment. In International Conference on Learning Representations, 2025

2025
[5]

I nf A lign: Inference-aware language model alignment

Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, and Ahmad Beirami. I nf A lign: Inference-aware language model alignment. In International Conference on Machine Learning, volume 267, 2025

2025
[6]

Distillation scaling laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. In International Conference on Machine Learning, 2025

2025
[8]

An introduction to sequential Monte Carlo, volume 4

Nicolas Chopin, Omiros Papaspiliopoulos, et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

2020
[9]

Elements of information theory

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999

1999
[11]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10835--10866. PMLR, 2023

2023
[12]

Empirical Processes in M -estimation , volume 6

Sara A Geer. Empirical Processes in M -estimation , volume 6. Cambridge university press, 2000

2000
[13]

Bo NB on alignment for large language models and the sweetness of best-of-n sampling

Lin Gui, Cristina Garbacea, and Victor Veitch. Bo NB on alignment for large language models and the sweetness of best-of-n sampling. In Advances in Neural Information Processing Systems, 2024

2024
[15]

Rewarding the unlikely: Lifting GRPO beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. In Conference on Empirical Methods in Natural Language Processing, pages 25559--25571, 2025

2025
[17]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[18]

Ash, and Akshay Krishnamurthy

Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. In International Conference on Learning Representations, 2025

2025
[19]

Large language models can self-improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Conference on Empirical Methods in Natural Language Processing, pages 1051--1068, 2023

2023
[21]

Reasoning without training: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning without training: Your base model is smarter than you think. In International Conference on Learning Representations, 2026

2026
[22]

Optimization by simulated annealing

Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. Science, 220 0 (4598): 0 671--680, 1983

1983
[24]

In-context reinforcement learning with algorithm distillation

Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, and Volodymyr Mnih. In-context reinforcement learning with algorithm distillation. In International Conference on Learning Representations, 2023

2023
[26]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, 2024

2024
[28]

Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization

L \'a szl \'o Lov \'a sz and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pages 57--68. IEEE, 2006

2006
[29]

Controlled decoding from language models

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. In International Conference on Machine Learning, 2024

2024
[31]

Reward augmented maximum likelihood for neural structured prediction

Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems, 2016

2016
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

2022
[34]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In Conference on Language Modeling, 2024

2024
[35]

Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell

Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In International Conference on Learning Representations, 2016

2016
[36]

Pier Giuseppe Sessa, Robert Dadashi-Tazehozi, Leonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Rame, Bobak Shahriari, Sarah Perrin, Abram L. Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Garea, Am \'e lie H \'e liou, Aliaksei Severyn, Matthew Hoffman, Nikola Momchev, and Olivier Bachem. ...

2025
[40]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33: 0 3008--3021, 2020

2020
[41]

Distral: Robust multitask reinforcement learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, 2017

2017
[43]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics (volume 1: long papers), pages 13484--13508, 2023

2023
[44]

From decoding to meta-generation: Inference-time algorithms for large language models

Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Survey Certification

2024
[45]

Probability inequalities for likelihood ratios and convergence rates of sieve MLE s

Wing Hung Wong and Xiaotong Shen. Probability inequalities for likelihood ratios and convergence rates of sieve MLE s. The Annals of Statistics, pages 339--362, 1995

1995
[48]

Faster wind: Accelerating iterative best-of- n distillation for llm alignment

Tong Yang, Jincheng Mei, Hanjun Dai, Zixin Wen, Shicong Cen, Dale Schuurmans, Yuejie Chi, and Bo Dai. Faster wind: Accelerating iterative best-of- n distillation for llm alignment. In International Conference on Artificial Intelligence and Statistics, pages 4537--4545, 2025

2025
[49]

Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems, 2025

2025
[50]

From -entropy to KL -entropy: Analysis of minimum information complexity density estimation

Tong Zhang. From -entropy to KL -entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, pages 2180--2210, 2006

2006
[51]

Probabilistic inference in language models via twisted sequential M onte C arlo

Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic inference in language models via twisted sequential M onte C arlo. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 60704--60748. PMLR, 2024

2024
[52]

Learning to reason without external rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. In International Conference on Learning Representations, 2026

2026
[53]

International Conference on Learning Representations , year=

Reasoning without Training: Your Base Model is Smarter Than You Think , author=. International Conference on Learning Representations , year=
[54]

International Conference on Learning Representations , year=

Self-Improvement in Language Models: The Sharpening Mechanism , author=. International Conference on Learning Representations , year=
[55]

Lin Gui and Cristina Garbacea and Victor Veitch , booktitle=. Bo
[56]

ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models , year=

West-of-N: Synthetic Preference Generation for Improved Reward Modeling , author=. ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models , year=

2024
[57]

International Conference on Machine Learning , year=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. International Conference on Machine Learning , year=
[58]

International Conference on Machine Learning , year=

Theoretical Guarantees on the Best-of-n Alignment Policy , author=. International Conference on Machine Learning , year=
[59]

International Conference on Learning Representations , year=

Variational Best-of-N Alignment , author=. International Conference on Learning Representations , year=
[60]

Friesen and Geoffrey Cideron and Sertan Girgin and Piotr Stanczyk and Andrea Michi and Danila Sinopalnikov and Sabela Ramos Garea and Am

Pier Giuseppe Sessa and Robert Dadashi-Tazehozi and Leonard Hussenot and Johan Ferret and Nino Vieillard and Alexandre Rame and Bobak Shahriari and Sarah Perrin and Abram L. Friesen and Geoffrey Cideron and Sertan Girgin and Piotr Stanczyk and Andrea Michi and Danila Sinopalnikov and Sabela Ramos Garea and Am. International Conference on Learning Represen...
[61]

Azizi, Seyedarmin and Potraghloo, Erfan Baghaei and Ahmadi, Minoo and Kundu, Souvik and Pedram, Massoud , journal=. Power-
[62]

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for

Ji, Xiaotong and Tutunov, Rasul and Zimmer, Matthieu and Ammar, Haitham Bou , journal=. Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for
[63]

Advances in Neural Information Processing Systems , year=

Reward augmented maximum likelihood for neural structured prediction , author=. Advances in Neural Information Processing Systems , year=
[64]

International Conference on Learning Representations , year=

Policy Distillation , author=. International Conference on Learning Representations , year=
[65]

Advances in Neural Information Processing Systems , year=

Distral: Robust Multitask Reinforcement Learning , author=. Advances in Neural Information Processing Systems , year=
[66]

International Conference on Learning Representations , year=

In-context Reinforcement Learning with Algorithm Distillation , author=. International Conference on Learning Representations , year=
[67]

arXiv preprint arXiv:2603.05739 , year=

Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment , author=. arXiv preprint arXiv:2603.05739 , year=

work page arXiv
[68]

2025 , volume =

Balashankar, Ananth and Sun, Ziteng and Berant, Jonathan and Eisenstein, Jacob and Collins, Michael and Hutter, Adrian and Lee, Jong and Nagpal, Chirag and Prost, Flavien and Sinha, Aradhana and Suresh, Ananda Theertha and Beirami, Ahmad , booktitle =. 2025 , volume =

2025
[69]

International Conference on Artificial Intelligence and Statistics , pages =

Faster WIND: Accelerating Iterative Best-of- N Distillation for LLM Alignment , author =. International Conference on Artificial Intelligence and Statistics , pages =
[70]

Makoto Shing and Kou Misaki and Han Bao and Sho Yokoi and Takuya Akiba , booktitle=
[71]

National Conference on Artificial Intelligence , pages =

Relative Entropy Policy Search , author =. National Conference on Artificial Intelligence , pages =. 2010 , month = jul, publisher =

2010
[72]

2017 , month = dec, journal =

Deep reinforcement learning from human preferences , author=. arXiv preprint arXiv:1706.03741 , year=

work page arXiv
[73]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. arXiv preprint arXiv:2305.18290 , year=

work page internal anchor Pith review arXiv
[74]

Probability inequalities for likelihood ratios and convergence rates of sieve

Wong, Wing Hung and Shen, Xiaotong , journal=. Probability inequalities for likelihood ratios and convergence rates of sieve. 1995 , publisher=

1995
[75]

Empirical Processes in

Geer, Sara A , volume=. Empirical Processes in. 2000 , publisher=

2000
[76]

From -Entropy to

Zhang, Tong , journal=. From -Entropy to. 2006 , publisher=

2006
[77]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , journal=. Scaling
[78]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review arXiv 2001
[79]

Advances in Neural Information Processing Systems , year=

Training compute-optimal large language models , author=. Advances in Neural Information Processing Systems , year=
[80]

Advances in Neural Information Processing Systems , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=
[81]

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V and Liu, Alisa and Dziri, Nouha and Lyu, Shane and others , journal=
[82]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=
[83]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
[84]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in
[85]

Rewarding the unlikely: Lifting

He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the unlikely: Lifting
[86]

Outcome-based Exploration for

Song, Yuda and Kempe, Julia and Munos, R. Outcome-based Exploration for. NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists , year=

2025
[87]

International Conference on Learning Representations , year=

Learning to Reason without External Rewards , author=. International Conference on Learning Representations , year=
[88]

arXiv preprint arXiv:2505.22660

Maximizing confidence alone improves reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv
[89]

Spurious rewards: Rethinking training signals in

Shao, Rulin and Li, Shuyue Stella and Xin, Rui and Geng, Scott and Wang, Yiping and Oh, Sewoong and Du, Simon Shaolei and Lambert, Nathan and Min, Sewon and Krishna, Ranjay and others , journal=. Spurious rewards: Rethinking training signals in
[90]

Conference on Empirical Methods in Natural Language Processing , pages=

Large language models can self-improve , author=. Conference on Empirical Methods in Natural Language Processing , pages=
[91]

Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=
[92]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional
[93]

International Conference on Learning Representations , year=

Language Model Self-improvement by Reinforcement Learning Contemplation , author=. International Conference on Learning Representations , year=
[94]

Gonzalez and Ion Stoica , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging
[95]

International Conference on Machine Learning , year=

Self-Rewarding Language Models , author=. International Conference on Machine Learning , year=
[96]

Meta-rewarding language models: Self-improving alignment with

Wu, Tianhao and Yuan, Weizhe and Golovneva, Olga and Xu, Jing and Tian, Yuandong and Jiao, Jiantao and Weston, Jason E and Sukhbaatar, Sainbayar , booktitle=. Meta-rewarding language models: Self-improving alignment with
[97]

arXiv preprint arXiv:2408.02666 , year=

Self-taught evaluators , author=. arXiv preprint arXiv:2408.02666 , year=

work page arXiv
[98]

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , journal=
[99]

ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Model compression , author=. ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Showing first 80 references.