Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques , Asma Ghandeharioun , Judy Hanwen Shen , Craig Ferguson , Agata Lapedriza , Noah Jones , Shixiang Gu , Rosalind Picard

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIstat.ML

keywords dataoff-policybatchhumanlearneffectivelyinteractionlearning

0 comments

read the original abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning
cs.LG 2020-04 accept novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Red Teaming Language Models with Language Models
cs.CL 2022-02 conditional novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
cs.AI 2026-05 unverdicted novelty 6.0

In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic p...
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
cs.LG 2020-06 unverdicted novelty 6.0

AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.