hub

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, Ofir Nachum · 2019 · cs.LG · arXiv 1911.11361

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

cs.LG · 2020-04-15 · accept · novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Zero-shot Imitation Learning by Latent Topology Mapping

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

astro-ph.IM · 2026-05-07 · unverdicted · novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent

cs.GT · 2026-04-30 · unverdicted · novelty 7.0

OPMD achieves the first fast Õ(1/n) rate for offline Nash equilibrium learning in α-potential games via a new reference-anchored coverage framework.

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.

An adaptive variance estimator for relative sparsity

stat.ME · 2026-05-04 · unverdicted · novelty 6.0

A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

KL regularization enables pessimism-free offline learning in general-sum games by recovering regularized Nash equilibria at rate O(1/n) via GANE and converging to coarse correlated equilibria at O(1/sqrt(n) + 1/T) via GAMD.

Learning from Demonstration with Failure Awareness for Safe Robot Navigation

cs.RO · 2026-04-25 · unverdicted · novelty 6.0

A framework decouples failure data for value estimation and success data for policy learning in offline RL to reduce collisions in robot navigation while maintaining success rates.

Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs

cs.CE · 2026-04-07 · unverdicted · novelty 6.0

Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

cs.LG · 2023-04-20 · conditional · novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

cs.RO · 2021-08-06 · accept · novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

cs.LG · 2020-05-04 · unverdicted · novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

citing papers explorer

Showing 23 of 23 citing papers.

Offline Reinforcement Learning with Implicit Q-Learning cs.LG · 2021-10-12 · unverdicted · none · ref 13 · internal anchor
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning cs.LG · 2020-04-15 · accept · none · ref 22 · internal anchor
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.
Aligning Flow Map Policies with Optimal Q-Guidance cs.LG · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Zero-shot Imitation Learning by Latent Topology Mapping cs.LG · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification astro-ph.IM · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning cs.LG · 2026-05-03 · unverdicted · none · ref 24 · internal anchor
FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent cs.GT · 2026-04-30 · unverdicted · none · ref 35 · internal anchor
OPMD achieves the first fast Õ(1/n) rate for offline Nash equilibrium learning in α-potential games via a new reference-anchored coverage framework.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy cs.LG · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning cs.RO · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow cs.LG · 2026-05-08 · unverdicted · none · ref 64 · internal anchor
DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization cs.LG · 2026-05-04 · unverdicted · none · ref 56 · internal anchor
Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.
An adaptive variance estimator for relative sparsity stat.ME · 2026-05-04 · unverdicted · none · ref 101 · internal anchor
A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 58 · internal anchor
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 204 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization cs.LG · 2026-04-30 · unverdicted · none · ref 26 · internal anchor
KL regularization enables pessimism-free offline learning in general-sum games by recovering regularized Nash equilibria at rate O(1/n) via GANE and converging to coarse correlated equilibria at O(1/sqrt(n) + 1/T) via GAMD.
Learning from Demonstration with Failure Awareness for Safe Robot Navigation cs.RO · 2026-04-25 · unverdicted · none · ref 16 · internal anchor
A framework decouples failure data for value estimation and success data for policy learning in offline RL to reduce collisions in robot navigation while maintaining success rates.
Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs cs.CE · 2026-04-07 · unverdicted · none · ref 28 · internal anchor
Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies cs.LG · 2023-04-20 · conditional · none · ref 48 · internal anchor
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 37 · internal anchor
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems cs.LG · 2020-05-04 · unverdicted · none · ref 173 · internal anchor
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Behavior Regularized Offline Reinforcement Learning

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer