arxiv: 2402.14740 · v2 · submitted 2024-02-22 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Ahmet \"Ust\"un, Arash Ahmadian, Chris Cremer, Julia Kreutzer, Marzieh Fadaee, Matthias Gall\'e, Olivier Pietquin, Sara Hooker

Pith reviewed 2026-05-13 01:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords RLHFREINFORCEPPODPOLLM alignmenthuman feedbackpolicy optimizationreinforcement learning

0 comments

The pith

Simpler REINFORCE-style optimization outperforms PPO and RL-free methods like DPO for LLM alignment with human feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that in learning from human feedback for large language models, the standard Proximal Policy Optimization (PPO) algorithm includes many unnecessary components. Simpler variants based on the classic REINFORCE method not only reduce computational costs but also achieve higher performance than both PPO and newer methods that avoid reinforcement learning altogether, such as DPO and RAFT. A sympathetic reader would care because RLHF is widely used to make LLMs safer and more aligned, yet it currently demands significant resources and careful tuning. If these simpler methods work as described, alignment training becomes more accessible and efficient. The authors emphasize that adapting reinforcement learning to the specific properties of language model alignment unlocks these benefits.

Core claim

We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed 'RL-free' methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

What carries the argument

REINFORCE-style optimization variants for policy updates in RLHF, which rely on basic reward-weighted gradient estimates without PPO's clipping mechanisms or auxiliary value networks.

If this is right

LLM alignment can be performed with substantially lower computational expense.
Hyperparameter tuning becomes less sensitive and more straightforward.
Online RL methods can match or exceed the results of RL-free alternatives in preference optimization.
Many advanced features of modern RL algorithms provide no benefit in this specific setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These findings could encourage wider adoption of reinforcement learning in LLM training by lowering the barrier to entry.
Practitioners might prioritize data collection and reward modeling over selecting complex optimizers.
Similar simplifications could be explored in other domains where PPO is applied by default, such as robotics or game playing.
Re-evaluating basic methods with modern large models may reveal overlooked efficiencies across machine learning.

Load-bearing premise

The superior results of the REINFORCE variants are attributable to their algorithmic simplicity and not to unequal experimental conditions, hyperparameter tuning efforts, or implementation specifics across the compared methods.

What would settle it

A replication study that applies the same level of hyperparameter optimization and identical data and compute resources to PPO, the REINFORCE variants, DPO, and RAFT, then measures whether the performance gap persists.

read the original abstract

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that in RLHF for LLMs, many components of PPO are unnecessary given the domain characteristics. It shows that simpler REINFORCE-style variants outperform both PPO and RL-free methods such as DPO and RAFT in experiments, while incurring lower computational cost, and concludes that careful adaptation enables effective online RL optimization for alignment at low cost.

Significance. If the empirical results prove robust, the work would be significant for RLHF practice: it challenges the default use of PPO, demonstrates that basic online RL methods can be preferable when adapted to LLM traits, and offers a lower-cost path to alignment that preserves performance. The emphasis on simplicity and the direct comparisons to recent RL-free baselines provide a useful counterpoint to increasing method complexity in the field.

major comments (2)

[Experiments] Experiments section: the central claim that REINFORCE-style variants outperform PPO/DPO/RAFT due to algorithmic simplicity is load-bearing on the fairness of the comparisons. The manuscript does not report the hyperparameter search budgets, number of trials, or tuning effort allocated to each baseline; given that RLHF performance is known to be highly sensitive to KL coefficients, learning rates, and sampling strategies, this leaves open the possibility that reported gains arise from uneven implementation details rather than the removal of PPO components.
[§4] §4 (or equivalent results section): without ablations that isolate the effect of each removed PPO component (e.g., clipping, value function, advantage normalization) while holding all other factors fixed, it is difficult to attribute performance differences specifically to the 'back to basics' REINFORCE formulation rather than to other unstated implementation choices.

minor comments (2)

[Abstract/Introduction] The abstract and introduction would benefit from a concise table summarizing the key differences between the proposed REINFORCE variants and PPO (e.g., presence/absence of clipping, value head, etc.).
[Figures] Figures comparing methods should include error bars or statistical significance markers to allow readers to assess the reliability of the reported outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that greater transparency and additional analysis will strengthen the paper, and we commit to incorporating the requested details and experiments in the revision.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that REINFORCE-style variants outperform PPO/DPO/RAFT due to algorithmic simplicity is load-bearing on the fairness of the comparisons. The manuscript does not report the hyperparameter search budgets, number of trials, or tuning effort allocated to each baseline; given that RLHF performance is known to be highly sensitive to KL coefficients, learning rates, and sampling strategies, this leaves open the possibility that reported gains arise from uneven implementation details rather than the removal of PPO components.

Authors: We agree that explicit reporting of hyperparameter search budgets and tuning effort is necessary to support the fairness of the comparisons. In the original experiments we followed the hyperparameter ranges and implementation details reported in the source papers for PPO, DPO, and RAFT, performing grid searches of comparable scope over the most sensitive parameters (KL coefficient, learning rate, and sampling temperature). To eliminate any ambiguity, we will add a dedicated subsection (and appendix table) that documents the exact search ranges, number of trials, and total compute allocated to each baseline. This addition will make the experimental protocol fully reproducible and allow readers to assess whether the observed advantages are attributable to the algorithmic simplifications. revision: yes
Referee: [§4] §4 (or equivalent results section): without ablations that isolate the effect of each removed PPO component (e.g., clipping, value function, advantage normalization) while holding all other factors fixed, it is difficult to attribute performance differences specifically to the 'back to basics' REINFORCE formulation rather than to other unstated implementation choices.

Authors: We concur that component-wise ablations would provide clearer causal attribution. The current manuscript presents end-to-end comparisons of the full REINFORCE-style method against full PPO, but does not isolate the contribution of individual PPO elements such as the clipping ratio, learned value function, or advantage normalization. In the revised version we will insert a new ablation study that successively removes each of these components while keeping all other implementation choices (optimizer, batch size, KL penalty schedule, etc.) fixed. The results will be reported alongside the main tables, directly addressing the referee’s concern and strengthening the claim that the removed components are not required for effective alignment in the LLM setting. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance claims

full rationale

The paper advances its central claim through direct experimental comparisons of REINFORCE-style variants against PPO, DPO, and RAFT on LLM alignment tasks. No derivation chain exists that reduces a prediction or first-principles result to its own inputs by construction; the work contains no fitted-parameter predictions, self-definitional equations, or load-bearing self-citations that would force the outcome. The reported outperformance is presented as an empirical observation rather than a mathematically closed loop, rendering the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard RL assumptions and empirical validation without introducing new free parameters or invented entities in the presented claim.

axioms (1)

domain assumption RLHF can be formulated as a standard reinforcement learning problem with human preferences as the reward signal
The abstract revisits the formulation of alignment from human preferences in the context of RL.

pith-pipeline@v0.9.0 · 5500 in / 1205 out tokens · 56657 ms · 2026-05-13T01:37:23.838144+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Mem-W: Latent Memory-Native GUI Agents
cs.CL 2026-05 unverdicted novelty 7.0

Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
cs.CL 2026-04 unverdicted novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
cs.LG 2026-04 unverdicted novelty 7.0

EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
cs.AI 2026-04 conditional novelty 7.0

DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
Ethics Testing: Proactive Identification of Generative AI System Harms
cs.SE 2026-04 unverdicted novelty 6.0

Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
cs.AI 2026-04 conditional novelty 6.0

DReST-trained RL agents and LLMs achieve higher usefulness and neutrality to trajectory lengths, halving the probability of delaying shutdown in out-of-distribution tests.
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
cs.AI 2026-04 unverdicted novelty 6.0

DReST-trained deep RL agents and fine-tuned LLMs generalize to higher usefulness and neutrality on unseen test contexts, with reported gains of 11-18% over baselines and near-maximum scores for the LLM.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
cs.IR 2026-04 unverdicted novelty 6.0

ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
Target Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
cs.AI 2025-04 unverdicted novelty 6.0

VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
cs.LG 2026-05 unverdicted novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
cs.LG 2026-05 unverdicted novelty 4.0

Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 27 Pith papers · 1 internal anchor

[1]

Graph constrained reinforcement learning for natural language action spaces, 2020

Prithviraj Ammanabrolu and Matthew Hausknecht. Graph constrained reinforcement learning for natural language action spaces, 2020

work page 2020
[3]

Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

work page 2023
[4]

A general language assistant as a laboratory for alignment, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

work page 2021
[5]

A general theoretical paradigm to understand learning from human preferences, 2023

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023

work page 2023
[6]

An actor-critic algorithm for sequence prediction, 2017

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction, 2017

work page 2017
[7]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 a

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[8]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[9]

Pythia: A suite for analyzing large language models across training and scaling, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

work page 2023
[10]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[11]

Correlated input-dependent label noise in large-scale image classification, 2021

Mark Collier, Basil Mustafa, Efi Kokiopoulou, Rodolphe Jenatton, and Jesse Berent. Correlated input-dependent label noise in large-scale image classification, 2021

work page 2021
[12]

Cold-start reinforcement learning with softmax policy gradient

Nan Ding and Radu Soricut. Cold-start reinforcement learning with softmax policy gradient. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[13]

Raft: Reward ranked finetuning for generative foundation model alignment, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment, 2023

work page 2023
[14]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024

work page 2024
[15]

Implementation matters in deep policy gradients: A case study on ppo and trpo, 2020

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on ppo and trpo, 2020

work page 2020
[16]

Scaling laws for reward model overoptimization, 2022

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022

work page 2022
[18]

What uncertainties do we need in bayesian deep learning for computer vision?, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?, 2017

work page 2017
[19]

A distributional approach to controlled text generation, 2021

Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation, 2021

work page 2021
[20]

Understanding the effects of rlhf on llm generalisation and diversity, 2024

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2024

work page 2024
[21]

Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URL https://api.semanticscholar.org/CorpusID:198489118

work page 2019
[22]

On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting

Tomasz Korbak, Hady Elsahar, Germ\' a n Kruszewski, and Marc Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 16203--16220. Curran Asso...

work page 2022
[27]

Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023

work page 2023
[29]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, 2023

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, 2023

work page 2023
[30]

Liu, and Jialu Liu

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization, 2023

work page 2023
[31]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with restarts. ArXiv, abs/1608.03983, 2016. URL https://api.semanticscholar.org/CorpusID:15884797

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Neural variational inference and learning in belief networks, 2014

Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks, 2014

work page 2014
[34]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022
[35]

Reinforcement learning for bandit neural machine translation with simulated human feedback, 2017 a

Khanh Nguyen, Hal Daumé III au2, and Jordan Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback, 2017 a

work page 2017
[37]

Gpt-4 technical report, 2023

OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, L...

work page 2023
[38]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[39]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023

work page 2023
[40]

Sequence level training with recurrent neural networks, 2016

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016

work page 2016
[41]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[42]

High-dimensional continuous control using generalized advantage estimation, 2018

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018

work page 2018
[43]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020

work page 2020
[44]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

work page 2022
[45]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2020

work page 2020
[46]

Policy gradient methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. M\" u ller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed...

work page 1999
[47]

Llama: Open and efficient foundation language models, 2023 a

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

work page 2023
[48]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[49]

Rush, and Thomas Wolf

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023

work page 2023
[50]

Aya model: An instruction finetuned open-access multilingual language model, 2024

Ahmet \"U st \"u n, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024

work page 2024
[51]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 0 (3-4): 0 229--256, 1992

work page 1992
[53]

Rrhf: Rank responses to align language models with human feedback without tears, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears, 2023

work page 2023
[54]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023

work page 2023
[55]

Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020

work page 2020
[56]

2017 , eprint=

Deep reinforcement learning from human preferences , author=. 2017 , eprint=

work page 2017
[57]

2021 , eprint=

A Distributional Approach to Controlled Text Generation , author=. 2021 , eprint=

work page 2021
[58]

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =

Korbak, Tomasz and Elsahar, Hady and Kruszewski, Germ\'. On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =. Advances in Neural Information Processing Systems , editor =

work page
[59]

2023 , eprint=

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models , author=. 2023 , eprint=

work page 2023
[60]

2024 , eprint=

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , author=. 2024 , eprint=

work page 2024
[61]

2023 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=

work page 2023
[62]

2020 , eprint=

Learning to summarize from human feedback , author=. 2020 , eprint=

work page 2020
[63]

2021 , eprint=

A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

work page 2021
[64]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[65]

Machine Learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=. 1992 , publisher=

work page 1992
[66]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

work page 2017
[67]

2023 , eprint=

Zephyr: Direct Distillation of LM Alignment , author=. 2023 , eprint=

work page 2023
[68]

2022 , eprint=

WebGPT: Browser-assisted question-answering with human feedback , author=. 2022 , eprint=

work page 2022
[69]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[70]

2023 , eprint=

SLiC-HF: Sequence Likelihood Calibration with Human Feedback , author=. 2023 , eprint=

work page 2023
[71]

2023 , eprint=

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , author=. 2023 , eprint=

work page 2023
[72]

2023 , eprint=

Statistical Rejection Sampling Improves Preference Optimization , author=. 2023 , eprint=

work page 2023
[73]

2018 , url=

Improving Language Understanding by Generative Pre-Training , author=. 2018 , url=

work page 2018
[74]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[75]

2020 , publisher=

Reinforcement learning: An introduction , author=. 2020 , publisher=

work page 2020
[76]

2019 , eprint=

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , author=. 2019 , eprint=

work page 2019
[77]

2022 , eprint=

Learning to summarize from human feedback , author=. 2022 , eprint=

work page 2022
[78]

2018 , eprint=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. 2018 , eprint=

work page 2018
[79]

DeepRLStructPred@ICLR , year=

Buy 4 REINFORCE Samples, Get a Baseline for Free! , author=. DeepRLStructPred@ICLR , year=

work page
[80]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

work page 2023
[81]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[82]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[83]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[84]

2023 , eprint=

PaLM 2 Technical Report , author=. 2023 , eprint=

work page 2023
[85]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[86]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[87]

2022 , eprint=

Gradient Estimation with Discrete Stein Operators , author=. 2022 , eprint=

work page 2022
[88]

2021 , eprint=

CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator , author=. 2021 , eprint=

work page 2021
[89]

2022 , eprint=

Double Control Variates for Gradient Estimation in Discrete Latent Variable Models , author=. 2022 , eprint=

work page 2022
[90]

2022 , eprint=

Auto-Encoding Variational Bayes , author=. 2022 , eprint=

work page 2022

Showing first 80 references.