Offline Regularised Reinforcement Learning for Large Language Models Alignment

Aliaksei Severyn; Bernardo Avila Pires; Bilal Piot; Daniele Calandriello; Daniel Guo; Eugene Tarassov; Gil Shamir; Jonathan Mallinson; Lior Shani; Lucas Spangher

arxiv: 2405.19107 · v1 · pith:3QOXZ4BQnew · submitted 2024-05-29 · 💻 cs.LG · cs.AI

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Pierre Harvey Richemond , Yunhao Tang , Daniel Guo , Daniele Calandriello , Mohammad Gheshlaghi Azar , Rafael Rafailov , Bernardo Avila Pires , Eugene Tarassov

show 10 more authors

Lucas Spangher Will Ellsworth Aliaksei Severyn Jonathan Mallinson Lior Shani Gil Shamir Rishabh Joshi Tianqi Liu Remi Munos Bilal Piot

This is my paper

classification 💻 cs.LG cs.AI

keywords promptdatasetselementfeedbackhumanlanguagemodelsoptimisation

0 comments

read the original abstract

The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
cs.LG 2026-02 unverdicted novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and co...
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
cs.LG 2025-12 unverdicted novelty 6.0

Autoregressive language models are equivalent to energy-based models through a bijection that corresponds to the soft Bellman equation, explaining their lookahead capabilities despite next-token training.
UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
cs.LG 2024-08 unverdicted novelty 6.0

UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning
cs.LG 2026-06 unverdicted novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Trust Region On-Policy Distillation
cs.LG 2026-05 unverdicted novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.