pith. machine review for the scientific record. sign in

arxiv: 2509.20265 · v3 · submitted 2025-09-24 · 💻 cs.LG · cs.CL

Failure Modes of Maximum Entropy RLHF

Pith reviewed 2026-05-18 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reinforcement learning from human feedbackmaximum entropy RLSimPOoveroptimizationreward hackingpreference optimizationonline RLHFKL divergence
0
0 comments X

The pith

Maximum entropy RL applied to online RLHF produces overoptimization and unstable KL dynamics, even when entropy regularization is used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Simple Preference Optimization can be derived directly from a maximum entropy reinforcement learning objective. It then tests whether this formulation delivers similar benefits when moved from offline preference optimization into online RLHF training. Experiments across model scales reveal frequent overoptimization, reward hacking, and erratic KL behavior that entropy regularization does not reliably suppress. These results matter because they indicate that reference-free methods successful in offline settings encounter distinct stability problems once online sampling and iterative optimization are introduced.

Core claim

SimPO is equivalent to maximum entropy RL. When the same objective is used for online RLHF, training exhibits overoptimization and unstable KL dynamics across scales. Entropy regularization fails to prevent reward hacking and, in the reported runs, correlates with the start of overoptimization rather than protecting against it. Configurations that remain stable do not owe their stability to the entropy term. The paper contrasts these outcomes with KL-constrained methods that maintain more reliable behavior and discusses why the offline success of SimPO does not transfer to the online regime.

What carries the argument

Derivation of SimPO as a maximum-entropy RL objective followed by its empirical deployment in online RLHF to expose failure modes.

If this is right

  • KL-constrained methods keep training stable while entropy regularization does not.
  • Overoptimization appears even at conservative learning rates in some configurations.
  • Entropy regularization is not the source of stability in runs that remain stable.
  • Reference-free methods encounter separate difficulties in online versus offline preference learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid objectives that combine entropy with an explicit reference-model KL term may be needed for stable online use of reference-free methods.
  • The difference between offline and online regimes may stem from how on-policy sampling amplifies reward-model errors without a fixed reference distribution.
  • A direct test would be to add a reference model to the maximum-entropy objective and measure whether the reported instabilities decrease.

Load-bearing premise

The tested model scales, learning rates, and preference datasets are representative enough that the observed overoptimization can be attributed primarily to entropy regularization rather than other implementation choices.

What would settle it

Re-run the online RLHF experiments with the maximum-entropy objective at the reported scales and learning rates while monitoring whether KL divergence stays bounded and reward hacking remains absent.

Figures

Figures reproduced from arXiv: 2509.20265 by Bar{\i}\c{s} Akg\"un, \"Omer Veysel \c{C}a\u{g}atan.

Figure 1
Figure 1. Figure 1: RLHF reward and entropy bonus during training for Pythia 1B with different entropy [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RLHF reward and entropy bonus during training for Pythia 2.8B with different entropy [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: KL divergence metrics and win rates for KL-Constrained and Maximum Entropy regu [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Batch average of log( πref(yw|x) πref(yl|x) ) during DPO training. This perspective suggests that offline meth￾ods might potentially reduce reliance on ref￾erence models by introducing target margins that could serve a similar function to refer￾ence contributions. To explore this possibility, we visualize the reference log probability mar￾gins log  πref(yw|x) πref(yl|x)  during DPO training with Pythia 1… view at source ↗
Figure 5
Figure 5. Figure 5: KL divergence evolution during training for 1B and 2.8B parameter models using different [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Win rate progression across training checkpoints for different values of the gamma hyper [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SimPO training metrics across different gamma values. Comparison of key training dy [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SimPO training metrics across different gamma values. Comparison of key training dy [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DPO training metrics with β = 0.05. Comparison of key training dynamics including loss, gradients, logits, and reward metrics, using Pythia-1B with learning rate 1 × 10−6 . 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL frequently exhibits overoptimization and unstable KL dynamics across model scales, with overoptimization persisting even at conservative learning rates for some configurations. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to reliably prevent reward hacking and, in our experiments, correlates with the onset of overoptimization rather than guarding against it. Even in configurations where training remains stable, entropy regularization is not the stabilizing factor. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online versus offline preference learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper derives Simple Preference Optimization (SimPO) as an instance of Maximum Entropy Reinforcement Learning and then empirically investigates MaxEnt RL in online RLHF. It reports that MaxEnt RL exhibits overoptimization and unstable KL dynamics across model scales, that entropy regularization fails to prevent reward hacking (unlike KL-constrained baselines), and that these issues persist even at conservative learning rates; the authors discuss why SimPO succeeds offline while MaxEnt RL struggles online.

Significance. If the empirical claims are substantiated with rigorous controls and metrics, the work would usefully document failure modes of entropy regularization in online preference optimization and clarify the offline-online gap for reference-free methods, informing safer RLHF design.

major comments (3)
  1. [§4] §4 (Experiments): The abstract and results assert 'frequent overoptimization' and 'unstable KL dynamics' yet supply no quantitative metrics (e.g., reward scores, KL values with error bars), baseline comparisons, or statistical tests. Without these, the strength of the central empirical claim cannot be assessed.
  2. [§4.2] §4.2 and §5: The attribution of overoptimization and KL instability primarily to entropy regularization (rather than reward-model quality, online sampling, learning-rate schedules, or base-model initialization) requires explicit ablations that hold all other pipeline elements fixed while varying only the regularizer. The current controls do not isolate this causal factor.
  3. [§3] §3 (Derivation): The claim that SimPO is exactly MaxEnt RL should specify whether the equivalence is exact or approximate and how the entropy coefficient maps to the SimPO loss; any definitional overlap with the fitted regularization term should be clarified to avoid circularity in the instability analysis.
minor comments (2)
  1. [Abstract] Abstract: Adding one sentence on the specific model scales, preference datasets, and learning-rate ranges used would help readers gauge the generality of the reported failure modes.
  2. [Notation] Notation: The manuscript should consistently distinguish the entropy coefficient from any implicit KL terms when comparing MaxEnt RL to KL-constrained baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each comment and revised the paper to strengthen the empirical evidence and clarify the theoretical aspects. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [§4] The abstract and results assert 'frequent overoptimization' and 'unstable KL dynamics' yet supply no quantitative metrics (e.g., reward scores, KL values with error bars), baseline comparisons, or statistical tests. Without these, the strength of the central empirical claim cannot be assessed.

    Authors: We agree that quantitative metrics and statistical support would strengthen the presentation. In the revised manuscript we have added tables reporting mean reward scores and KL values with standard deviations across multiple random seeds, direct side-by-side comparisons against KL-constrained baselines, and p-values for the reported differences. These appear in Section 4 and the appendix. revision: yes

  2. Referee: [§4.2] The attribution of overoptimization and KL instability primarily to entropy regularization (rather than reward-model quality, online sampling, learning-rate schedules, or base-model initialization) requires explicit ablations that hold all other pipeline elements fixed while varying only the regularizer. The current controls do not isolate this causal factor.

    Authors: We appreciate the call for tighter isolation. The revised version includes new ablation experiments that keep the reward model, sampling distribution, learning-rate schedule, and base-model initialization fixed while varying only the choice of regularizer (entropy versus KL penalty). The additional results continue to associate the observed instabilities with entropy regularization in the online regime. revision: yes

  3. Referee: [§3] The claim that SimPO is exactly MaxEnt RL should specify whether the equivalence is exact or approximate and how the entropy coefficient maps to the SimPO loss; any definitional overlap with the fitted regularization term should be clarified to avoid circularity in the instability analysis.

    Authors: We thank the referee for this precision request. Section 3 presents an exact equivalence between SimPO and MaxEnt RL under the standard assumptions of the framework; the entropy coefficient maps directly to the SimPO beta hyper-parameter via the closed-form derivation we now display explicitly. We have also clarified that the regularization term in SimPO is the entropy objective itself rather than a separately fitted component, removing any potential circularity when analyzing online instability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a mathematical equivalence and experiments are independent empirical observations

full rationale

The paper's core derivation shows SimPO can be obtained from MaxEnt RL objectives via standard RLHF reformulation steps, which constitutes an equivalence rather than a self-referential loop or fitted parameter renamed as prediction. The subsequent empirical investigation of overoptimization and KL instability in online settings relies on direct training runs across model scales and learning rates, without reducing to quantities defined by the same regularization term or self-citation chains. No load-bearing uniqueness theorems, ansatz smuggling, or renaming of known results appear; the claims about entropy regularization failing to stabilize training are presented as observed outcomes, not tautological consequences of the initial derivation. The work is therefore self-contained against external benchmarks for both the theoretical step and the reported failure modes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard RLHF modeling assumptions and the representativeness of the chosen experimental configurations; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Maximum Entropy RL is a suitable framework for modeling reference-free preference optimization in both offline and online regimes
    Invoked when the paper derives SimPO and then applies the same principle to online RLHF.

pith-pipeline@v0.9.0 · 5703 in / 1261 out tokens · 47953 ms · 2026-05-18T13:57:30.320437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 20 internal anchors

  1. [1]

    Maximum a Posteriori Policy Optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018

  2. [2]

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning, 2025. URL https://arxiv.org/abs/2505.15134

  3. [3]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024. URL https://arxiv.org/abs/2402.14740

  4. [4]

    Understanding the impact of entropy on policy optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp.\ 151--160. PMLR, 2019

  5. [5]

    A practical analysis of human alignment with *po, 2025

    Kian Ahrabian, Xihui Lin, Barun Patra, Vishrav Chaudhary, Alon Benhaim, Jay Pujara, and Xia Song. A practical analysis of human alignment with *po, 2025. URL https://arxiv.org/abs/2407.15229

  6. [6]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URL https://arxiv.org/abs/1606.06565

  7. [7]

    A general theoretical paradigm to understand learning from human preferences, 2023

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023. URL https://arxiv.org/abs/2310.12036

  8. [8]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  9. [9]

    H., Vanessa Wilfred, John Willes, Denise Wong, Wei Xu, Rongwu Xu, Yi Zeng, HongJiang Zhang, and Djordje Žikelić

    Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Junt...

  10. [10]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

  11. [11]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334029

  12. [12]

    Deep reinforcement learning from human preferences

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. URL https://arxiv.org/abs/1706.03741

  13. [13]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback, 2024. URL https://arxiv.org/abs/2310.01377

  14. [14]

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment, 2023. URL https://arxiv.org/abs/2304.06767

  15. [15]

    Maximum entropy exploration in contextual bandits with neural networks and energy based models

    Adam Elwood, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. Maximum entropy exploration in contextual bandits with neural networks and energy based models. Entropy, 25 0 (2): 0 188, January 2023. ISSN 1099-4300. doi:10.3390/e25020188. URL http://dx.doi.org/10.3390/e25020188

  16. [16]

    Maximum entropy rl (provably) solves some robust rl problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021

  17. [17]

    Taming the Noise in Reinforcement Learning via Soft Updates

    Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562, 2015

  18. [18]

    Scaling laws for reward model overoptimization, 2022

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760

  19. [19]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  20. [20]

    Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory

    Peter D Gr \"u nwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. 2004

  21. [21]

    Sathiya Keerthi

    Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agrawal, Natesh Pillai, and S. Sathiya Keerthi. Alphapo: Reward shape matters for llm alignment, 2025. URL https://arxiv.org/abs/2501.03884

  22. [22]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1861--1870, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR

  23. [23]

    Inverse reward design, 2020

    Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inverse reward design, 2020. URL https://arxiv.org/abs/1711.02827

  24. [24]

    A max-min entropy framework for reinforcement learning

    Seungyul Han and Youngchul Sung. A max-min entropy framework for reinforcement learning. Advances in Neural Information Processing Systems, 34: 0 25732--25745, 2021

  25. [25]

    Provably efficient maximum entropy exploration

    Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp.\ 2681--2691. PMLR, 2019

  26. [26]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model, 2024. URL https://arxiv.org/abs/2403.07691

  27. [27]

    Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J

    Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization, 2025. URL https://arxiv.org/abs/2407.13399

  28. [28]

    A2c is a special case of ppo, 2022

    Shengyi Huang, Anssi Kanervisto, Antonin Raffin, Weixun Wang, Santiago Ontañón, and Rousslan Fernand Julien Dossa. A2c is a special case of ppo, 2022. URL https://arxiv.org/abs/2205.09123

  29. [29]

    The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024

    Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024. URL https://arxiv.org/abs/2403.17031

  30. [30]

    An adaptive entropy-regularization framework for multi-agent reinforcement learning

    Woojun Kim and Youngchul Sung. An adaptive entropy-regularization framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pp.\ 16829--16852. PMLR, 2023

  31. [31]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

  32. [32]

    Amopo: Adaptive multi-objective preference optimization without reward models and reference models, 2025

    Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, and Tong Xu. Amopo: Adaptive multi-objective preference optimization without reward models and reference models, 2025. URL https://arxiv.org/abs/2506.07165

  33. [33]

    Understanding reference policies in direct preference optimization, 2024

    Yixin Liu, Pengfei Liu, and Arman Cohan. Understanding reference policies in direct preference optimization, 2024. URL https://arxiv.org/abs/2407.13709

  34. [34]

    Leveraging exploration in off-policy algorithms via normalizing flows

    Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, and R Devon Hjelm. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pp.\ 430--444. PMLR, 2020

  35. [35]

    Simpo: Simple preference optimization with a reference-free reward, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024. URL https://arxiv.org/abs/2405.14734

  36. [36]

    Combining policy gradient and Q-learning

    Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016

  37. [37]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

  38. [38]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  39. [39]

    The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models, 2022. URL https://arxiv.org/abs/2201.03544

  40. [40]

    R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24 0 (2): 0 193--202, 1975. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346567

  41. [41]

    Scaling laws for reward model overoptimization in direct alignment algorithms, 2024 a

    Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms, 2024 a . URL https://arxiv.org/abs/2406.02900

  42. [42]

    From r to q^* : Your language model is secretly a q-function, 2024 b

    Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024 b . URL https://arxiv.org/abs/2404.12358

  43. [43]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024 c . URL https://arxiv.org/abs/2305.18290

  44. [44]

    On stochastic optimal control and reinforcement learning by approximate inference

    Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. 2013

  45. [45]

    Artificial Intelligence and the Problem of Control, pp.\ 19--24

    Stuart Russell. Artificial Intelligence and the Problem of Control, pp.\ 19--24. 01 2022. ISBN 978-3-030-86143-8. doi:10.1007/978-3-030-86144-5_3

  46. [46]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

  47. [47]

    Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking, 2025. URL https://arxiv.org/abs/2209.13085

  48. [48]

    Learning to summarize from human feedback

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URL https://arxiv.org/abs/2009.01325

  49. [49]

    Robust preference optimization via dynamic target margins, 2025

    Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins, 2025. URL https://arxiv.org/abs/2506.03690

  50. [50]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

  51. [51]

    Robot trajectory optimization using approximate inference

    Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp.\ 1049--1056, 2009

  52. [52]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

  53. [53]

    Truly proximal policy optimization, 2020

    Yuhui Wang, Hao He, Chao Wen, and Xiaoyang Tan. Truly proximal policy optimization, 2020. URL https://arxiv.org/abs/1903.07940

  54. [54]

    Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin

    Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf, 2024. URL https://arxiv.org/abs/2405.21046

  55. [55]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024 a

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024 a . URL https://arxiv.org/abs/2401.08417

  56. [56]

    Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024 b

    Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024 b . URL https://arxiv.org/abs/2312.16682

  57. [57]

    Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang D. Yoo. Confpo: Exploiting policy model confidence for critical token selection in preference optimization, 2025. URL https://arxiv.org/abs/2506.08712

  58. [58]

    Rrhf: Rank responses to align language models with human feedback without tears, 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears, 2023. URL https://arxiv.org/abs/2304.05302

  59. [59]

    When maximum entropy misleads policy optimization, 2025

    Ruipeng Zhang, Ya-Chien Chang, and Sicun Gao. When maximum entropy misleads policy optimization, 2025. URL https://arxiv.org/abs/2506.05615

  60. [60]

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023. URL https://arxiv.org/abs/2305.10425

  61. [61]

    Modeling purposeful adaptive behavior with the principle of maximum causal entropy

    Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

  62. [62]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008

  63. [63]

    Maximum causal entropy correlated equilibria for markov games

    Brian D Ziebart, Drew Bagnell, and Anind K Dey. Maximum causal entropy correlated equilibria for markov games. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010

  64. [64]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593

  65. [65]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  66. [66]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  67. [67]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  68. [68]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...