arxiv: 2509.20265 · v3 · submitted 2025-09-24 · 💻 cs.LG · cs.CL

Failure Modes of Maximum Entropy RLHF

\"Omer Veysel \c{C}a\u{g}atan , Bar{\i}\c{s} Akg\"un This is my paper

Pith reviewed 2026-05-18 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learning from human feedbackmaximum entropy RLSimPOoveroptimizationreward hackingpreference optimizationonline RLHFKL divergence

0 comments

The pith

Maximum entropy RL applied to online RLHF produces overoptimization and unstable KL dynamics, even when entropy regularization is used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Simple Preference Optimization can be derived directly from a maximum entropy reinforcement learning objective. It then tests whether this formulation delivers similar benefits when moved from offline preference optimization into online RLHF training. Experiments across model scales reveal frequent overoptimization, reward hacking, and erratic KL behavior that entropy regularization does not reliably suppress. These results matter because they indicate that reference-free methods successful in offline settings encounter distinct stability problems once online sampling and iterative optimization are introduced.

Core claim

SimPO is equivalent to maximum entropy RL. When the same objective is used for online RLHF, training exhibits overoptimization and unstable KL dynamics across scales. Entropy regularization fails to prevent reward hacking and, in the reported runs, correlates with the start of overoptimization rather than protecting against it. Configurations that remain stable do not owe their stability to the entropy term. The paper contrasts these outcomes with KL-constrained methods that maintain more reliable behavior and discusses why the offline success of SimPO does not transfer to the online regime.

What carries the argument

Derivation of SimPO as a maximum-entropy RL objective followed by its empirical deployment in online RLHF to expose failure modes.

If this is right

KL-constrained methods keep training stable while entropy regularization does not.
Overoptimization appears even at conservative learning rates in some configurations.
Entropy regularization is not the source of stability in runs that remain stable.
Reference-free methods encounter separate difficulties in online versus offline preference learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid objectives that combine entropy with an explicit reference-model KL term may be needed for stable online use of reference-free methods.
The difference between offline and online regimes may stem from how on-policy sampling amplifies reward-model errors without a fixed reference distribution.
A direct test would be to add a reference model to the maximum-entropy objective and measure whether the reported instabilities decrease.

Load-bearing premise

The tested model scales, learning rates, and preference datasets are representative enough that the observed overoptimization can be attributed primarily to entropy regularization rather than other implementation choices.

What would settle it

Re-run the online RLHF experiments with the maximum-entropy objective at the reported scales and learning rates while monitoring whether KL divergence stays bounded and reward hacking remains absent.

Figures

Figures reproduced from arXiv: 2509.20265 by Bar{\i}\c{s} Akg\"un, \"Omer Veysel \c{C}a\u{g}atan.

**Figure 2.** Figure 2: RLHF reward and entropy bonus during training for Pythia 2.8B with different entropy [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: KL divergence metrics and win rates for KL-Constrained and Maximum Entropy regu [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Batch average of log( πref(yw|x) πref(yl|x) ) during DPO training. This perspective suggests that offline methods might potentially reduce reliance on reference models by introducing target margins that could serve a similar function to reference contributions. To explore this possibility, we visualize the reference log probability margins log πref(yw|x) πref(yl|x) during DPO training with Pythia 1… view at source ↗

**Figure 5.** Figure 5: KL divergence evolution during training for 1B and 2.8B parameter models using different [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Win rate progression across training checkpoints for different values of the gamma hyper [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: SimPO training metrics across different gamma values. Comparison of key training dy [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: SimPO training metrics across different gamma values. Comparison of key training dy [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: DPO training metrics with β = 0.05. Comparison of key training dynamics including loss, gradients, logits, and reward metrics, using Pythia-1B with learning rate 1 × 10−6 . 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL frequently exhibits overoptimization and unstable KL dynamics across model scales, with overoptimization persisting even at conservative learning rates for some configurations. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to reliably prevent reward hacking and, in our experiments, correlates with the onset of overoptimization rather than guarding against it. Even in configurations where training remains stable, entropy regularization is not the stabilizing factor. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online versus offline preference learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper derives Simple Preference Optimization (SimPO) as an instance of Maximum Entropy Reinforcement Learning and then empirically investigates MaxEnt RL in online RLHF. It reports that MaxEnt RL exhibits overoptimization and unstable KL dynamics across model scales, that entropy regularization fails to prevent reward hacking (unlike KL-constrained baselines), and that these issues persist even at conservative learning rates; the authors discuss why SimPO succeeds offline while MaxEnt RL struggles online.

Significance. If the empirical claims are substantiated with rigorous controls and metrics, the work would usefully document failure modes of entropy regularization in online preference optimization and clarify the offline-online gap for reference-free methods, informing safer RLHF design.

major comments (3)

[§4] §4 (Experiments): The abstract and results assert 'frequent overoptimization' and 'unstable KL dynamics' yet supply no quantitative metrics (e.g., reward scores, KL values with error bars), baseline comparisons, or statistical tests. Without these, the strength of the central empirical claim cannot be assessed.
[§4.2] §4.2 and §5: The attribution of overoptimization and KL instability primarily to entropy regularization (rather than reward-model quality, online sampling, learning-rate schedules, or base-model initialization) requires explicit ablations that hold all other pipeline elements fixed while varying only the regularizer. The current controls do not isolate this causal factor.
[§3] §3 (Derivation): The claim that SimPO is exactly MaxEnt RL should specify whether the equivalence is exact or approximate and how the entropy coefficient maps to the SimPO loss; any definitional overlap with the fitted regularization term should be clarified to avoid circularity in the instability analysis.

minor comments (2)

[Abstract] Abstract: Adding one sentence on the specific model scales, preference datasets, and learning-rate ranges used would help readers gauge the generality of the reported failure modes.
[Notation] Notation: The manuscript should consistently distinguish the entropy coefficient from any implicit KL terms when comparing MaxEnt RL to KL-constrained baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each comment and revised the paper to strengthen the empirical evidence and clarify the theoretical aspects. Below we provide point-by-point responses.

read point-by-point responses

Referee: [§4] The abstract and results assert 'frequent overoptimization' and 'unstable KL dynamics' yet supply no quantitative metrics (e.g., reward scores, KL values with error bars), baseline comparisons, or statistical tests. Without these, the strength of the central empirical claim cannot be assessed.

Authors: We agree that quantitative metrics and statistical support would strengthen the presentation. In the revised manuscript we have added tables reporting mean reward scores and KL values with standard deviations across multiple random seeds, direct side-by-side comparisons against KL-constrained baselines, and p-values for the reported differences. These appear in Section 4 and the appendix. revision: yes
Referee: [§4.2] The attribution of overoptimization and KL instability primarily to entropy regularization (rather than reward-model quality, online sampling, learning-rate schedules, or base-model initialization) requires explicit ablations that hold all other pipeline elements fixed while varying only the regularizer. The current controls do not isolate this causal factor.

Authors: We appreciate the call for tighter isolation. The revised version includes new ablation experiments that keep the reward model, sampling distribution, learning-rate schedule, and base-model initialization fixed while varying only the choice of regularizer (entropy versus KL penalty). The additional results continue to associate the observed instabilities with entropy regularization in the online regime. revision: yes
Referee: [§3] The claim that SimPO is exactly MaxEnt RL should specify whether the equivalence is exact or approximate and how the entropy coefficient maps to the SimPO loss; any definitional overlap with the fitted regularization term should be clarified to avoid circularity in the instability analysis.

Authors: We thank the referee for this precision request. Section 3 presents an exact equivalence between SimPO and MaxEnt RL under the standard assumptions of the framework; the entropy coefficient maps directly to the SimPO beta hyper-parameter via the closed-form derivation we now display explicitly. We have also clarified that the regularization term in SimPO is the entropy objective itself rather than a separately fitted component, removing any potential circularity when analyzing online instability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a mathematical equivalence and experiments are independent empirical observations

full rationale

The paper's core derivation shows SimPO can be obtained from MaxEnt RL objectives via standard RLHF reformulation steps, which constitutes an equivalence rather than a self-referential loop or fitted parameter renamed as prediction. The subsequent empirical investigation of overoptimization and KL instability in online settings relies on direct training runs across model scales and learning rates, without reducing to quantities defined by the same regularization term or self-citation chains. No load-bearing uniqueness theorems, ansatz smuggling, or renaming of known results appear; the claims about entropy regularization failing to stabilize training are presented as observed outcomes, not tautological consequences of the initial derivation. The work is therefore self-contained against external benchmarks for both the theoretical step and the reported failure modes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard RLHF modeling assumptions and the representativeness of the chosen experimental configurations; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Maximum Entropy RL is a suitable framework for modeling reference-free preference optimization in both offline and online regimes
Invoked when the paper derives SimPO and then applies the same principle to online RLHF.

pith-pipeline@v0.9.0 · 5703 in / 1261 out tokens · 47953 ms · 2026-05-18T13:57:30.320437+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that SimPO can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature... max_π E[r(x,y)] + α H[π(y|x)]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy regularization fails to reliably prevent reward hacking and... correlates with the onset of overoptimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 20 internal anchors

[1]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning, 2025. URL https://arxiv.org/abs/2505.15134

work page internal anchor Pith review arXiv 2025
[3]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024. URL https://arxiv.org/abs/2402.14740

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Understanding the impact of entropy on policy optimization

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp.\ 151--160. PMLR, 2019

work page 2019
[5]

A practical analysis of human alignment with *po, 2025

Kian Ahrabian, Xihui Lin, Barun Patra, Vishrav Chaudhary, Alon Benhaim, Jay Pujara, and Xia Song. A practical analysis of human alignment with *po, 2025. URL https://arxiv.org/abs/2407.15229

work page arXiv 2025
[6]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URL https://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

A general theoretical paradigm to understand learning from human preferences, 2023

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023. URL https://arxiv.org/abs/2310.12036

work page arXiv 2023
[8]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

H., Vanessa Wilfred, John Willes, Denise Wong, Wei Xu, Rongwu Xu, Yi Zeng, HongJiang Zhang, and Djordje Žikelić

Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Junt...

work page arXiv 2025
[10]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334029

work page arXiv 1952
[12]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. URL https://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback, 2024. URL https://arxiv.org/abs/2310.01377

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment, 2023. URL https://arxiv.org/abs/2304.06767

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Maximum entropy exploration in contextual bandits with neural networks and energy based models

Adam Elwood, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. Maximum entropy exploration in contextual bandits with neural networks and energy based models. Entropy, 25 0 (2): 0 188, January 2023. ISSN 1099-4300. doi:10.3390/e25020188. URL http://dx.doi.org/10.3390/e25020188

work page doi:10.3390/e25020188 2023
[16]

Maximum entropy rl (provably) solves some robust rl problems

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021

work page arXiv 2021
[17]

Taming the Noise in Reinforcement Learning via Soft Updates

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Scaling laws for reward model overoptimization, 2022

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760

work page arXiv 2022
[19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory

Peter D Gr \"u nwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. 2004

work page 2004
[21]

Sathiya Keerthi

Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agrawal, Natesh Pillai, and S. Sathiya Keerthi. Alphapo: Reward shape matters for llm alignment, 2025. URL https://arxiv.org/abs/2501.03884

work page arXiv 2025
[22]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1861--1870, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR

work page 2018
[23]

Inverse reward design, 2020

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inverse reward design, 2020. URL https://arxiv.org/abs/1711.02827

work page arXiv 2020
[24]

A max-min entropy framework for reinforcement learning

Seungyul Han and Youngchul Sung. A max-min entropy framework for reinforcement learning. Advances in Neural Information Processing Systems, 34: 0 25732--25745, 2021

work page 2021
[25]

Provably efficient maximum entropy exploration

Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp.\ 2681--2691. PMLR, 2019

work page 2019
[26]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model, 2024. URL https://arxiv.org/abs/2403.07691

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization, 2025. URL https://arxiv.org/abs/2407.13399

work page arXiv 2025
[28]

A2c is a special case of ppo, 2022

Shengyi Huang, Anssi Kanervisto, Antonin Raffin, Weixun Wang, Santiago Ontañón, and Rousslan Fernand Julien Dossa. A2c is a special case of ppo, 2022. URL https://arxiv.org/abs/2205.09123

work page arXiv 2022
[29]

The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024. URL https://arxiv.org/abs/2403.17031

work page arXiv 2024
[30]

An adaptive entropy-regularization framework for multi-agent reinforcement learning

Woojun Kim and Youngchul Sung. An adaptive entropy-regularization framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pp.\ 16829--16852. PMLR, 2023

work page 2023
[31]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Amopo: Adaptive multi-objective preference optimization without reward models and reference models, 2025

Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, and Tong Xu. Amopo: Adaptive multi-objective preference optimization without reward models and reference models, 2025. URL https://arxiv.org/abs/2506.07165

work page arXiv 2025
[33]

Understanding reference policies in direct preference optimization, 2024

Yixin Liu, Pengfei Liu, and Arman Cohan. Understanding reference policies in direct preference optimization, 2024. URL https://arxiv.org/abs/2407.13709

work page arXiv 2024
[34]

Leveraging exploration in off-policy algorithms via normalizing flows

Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, and R Devon Hjelm. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pp.\ 430--444. PMLR, 2020

work page 2020
[35]

Simpo: Simple preference optimization with a reference-free reward, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024. URL https://arxiv.org/abs/2405.14734

work page arXiv 2024
[36]

Combining policy gradient and Q-learning

Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models, 2022. URL https://arxiv.org/abs/2201.03544

work page arXiv 2022
[40]

R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24 0 (2): 0 193--202, 1975. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346567

work page arXiv 1975
[41]

Scaling laws for reward model overoptimization in direct alignment algorithms, 2024 a

Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms, 2024 a . URL https://arxiv.org/abs/2406.02900

work page arXiv 2024
[42]

From r to q^* : Your language model is secretly a q-function, 2024 b

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024 b . URL https://arxiv.org/abs/2404.12358

work page arXiv 2024
[43]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024 c . URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

On stochastic optimal control and reinforcement learning by approximate inference

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. 2013

work page 2013
[45]

Artificial Intelligence and the Problem of Control, pp.\ 19--24

Stuart Russell. Artificial Intelligence and the Problem of Control, pp.\ 19--24. 01 2022. ISBN 978-3-030-86143-8. doi:10.1007/978-3-030-86144-5_3

work page doi:10.1007/978-3-030-86144-5_3 2022
[46]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking, 2025. URL https://arxiv.org/abs/2209.13085

work page arXiv 2025
[48]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URL https://arxiv.org/abs/2009.01325

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Robust preference optimization via dynamic target margins, 2025

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins, 2025. URL https://arxiv.org/abs/2506.03690

work page arXiv 2025
[50]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[51]

Robot trajectory optimization using approximate inference

Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp.\ 1049--1056, 2009

work page 2009
[52]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020
[53]

Truly proximal policy optimization, 2020

Yuhui Wang, Hao He, Chao Wen, and Xiaoyang Tan. Truly proximal policy optimization, 2020. URL https://arxiv.org/abs/1903.07940

work page arXiv 2020
[54]

Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin

Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf, 2024. URL https://arxiv.org/abs/2405.21046

work page arXiv 2024
[55]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024 a

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024 a . URL https://arxiv.org/abs/2401.08417

work page arXiv 2024
[56]

Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024 b

Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024 b . URL https://arxiv.org/abs/2312.16682

work page arXiv 2024
[57]

Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang D. Yoo. Confpo: Exploiting policy model confidence for critical token selection in preference optimization, 2025. URL https://arxiv.org/abs/2506.08712

work page arXiv 2025
[58]

Rrhf: Rank responses to align language models with human feedback without tears, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears, 2023. URL https://arxiv.org/abs/2304.05302

work page arXiv 2023
[59]

When maximum entropy misleads policy optimization, 2025

Ruipeng Zhang, Ya-Chien Chang, and Sicun Gao. When maximum entropy misleads policy optimization, 2025. URL https://arxiv.org/abs/2506.05615

work page arXiv 2025
[60]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023. URL https://arxiv.org/abs/2305.10425

work page arXiv 2023
[61]

Modeling purposeful adaptive behavior with the principle of maximum causal entropy

Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

work page 2010
[62]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008

work page 2008
[63]

Maximum causal entropy correlated equilibria for markov games

Brian D Ziebart, Drew Bagnell, and Anind K Dey. Maximum causal entropy correlated equilibria for markov games. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010

work page 2010
[64]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2020
[65]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[66]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[67]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[68]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page