hub Canonical reference

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, Ofir Nachum · 2019 · cs.LG · arXiv 1911.11361

Canonical reference. 100% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 10

representative citing papers

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

cs.LG · 2020-04-15 · accept · novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction

cs.LG · 2026-06-02 · conditional · novelty 7.0

A three-stage diagnostic on edX data shows offline selectors (BC, DQN, CQL) fail to reach oracle performance due to local representational ambiguity rather than learner mismatch or label shift.

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Zero-shot Imitation Learning by Latent Topology Mapping

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

astro-ph.IM · 2026-05-07 · unverdicted · novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.

Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent

cs.GT · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

Proposes OPMD algorithm achieving accelerated O(1/n) rates for offline Nash equilibrium learning in alpha-potential games via reference-anchored data coverage.

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

cs.LG · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

KL regularization enables pessimism-free offline learning in general-sum games, recovering regularized Nash equilibria at accelerated rate O(1/n) via GANE and converging to coarse correlated equilibria at standard rate O(1/sqrt(n)+1/T) via GAMD.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

cs.LG · 2022-08-12 · unverdicted · novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

SCORE constrains sim RL to the support of a real-data policy via flow steering, raising average success on eight dexterous tasks from 37.8% to 89.9%.

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

Argues for shifting to diagnosis-driven tension management of offline priors in online RL, supported by a framework on prior roles, experiments showing help-or-hurt reversals, and cross-domain evidence.

Reversal Q-Learning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versus prior flow-based offline RL methods.

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Counterfactual transport flows enable conservative, instance-specific trajectory refinement in offline RL by constructing local preference pairs in latent space from offline data and learning refinement directions controlled by a strength parameter.

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

UNIQ uses split conformal prediction on a multi-expectile ensemble to produce state-adaptive expectiles on top of IQL, yielding consistent gains on D4RL MuJoCo tasks at near-IQL memory cost.

Moment Matching Q-Learning

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

MoMa QL uses MMD moment matching to enforce distribution-level convergence of conditional score functions in flow-based RL policies for improved sampling efficiency.

SPAR: Support-Preserving Action Rectification

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.

citing papers explorer

Showing 1 of 1 citing paper after filters.

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 11 · 2 links · internal anchor
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.

Behavior Regularized Offline Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer