IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
hub Canonical reference
Behavior Regularized Offline Reinforcement Learning
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.
hub tools
citation-role summary
citation-polarity summary
roles
background 10polarities
background 10representative citing papers
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
A three-stage diagnostic on edX data shows offline selectors (BC, DQN, CQL) fail to reach oracle performance due to local representational ambiguity rather than learner mismatch or label shift.
FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.
CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.
Proposes OPMD algorithm achieving accelerated O(1/n) rates for offline Nash equilibrium learning in alpha-potential games via reference-anchored data coverage.
KL regularization enables pessimism-free offline learning in general-sum games, recovering regularized Nash equilibria at accelerated rate O(1/n) via GANE and converging to coarse correlated equilibria at standard rate O(1/sqrt(n)+1/T) via GAMD.
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
SCORE constrains sim RL to the support of a real-data policy via flow steering, raising average success on eight dexterous tasks from 37.8% to 89.9%.
Argues for shifting to diagnosis-driven tension management of offline priors in online RL, supported by a framework on prior roles, experiments showing help-or-hurt reversals, and cross-domain evidence.
Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versus prior flow-based offline RL methods.
BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.
Counterfactual transport flows enable conservative, instance-specific trajectory refinement in offline RL by constructing local preference pairs in latent space from offline data and learning refinement directions controlled by a strength parameter.
UNIQ uses split conformal prediction on a multi-expectile ensemble to produce state-adaptive expectiles on top of IQL, yielding consistent gains on D4RL MuJoCo tasks at near-IQL memory cost.
MoMa QL uses MMD moment matching to enforce distribution-level convergence of conditional score functions in flow-based RL policies for improved sampling efficiency.
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
citing papers explorer
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.