hub Mixed citations

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine · 2019 · cs.LG · arXiv 1910.00177

Mixed citation behavior. Most common role is background (65%).

95 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 95 citing papers arXiv PDF

abstract

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 6

citation-polarity summary

background 11 use method 5 unclear 1

claims ledger

abstract In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can ac

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

cs.LG · 2020-04-15 · accept · novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Dual Advantage Fields

cs.LG · 2026-06-02 · conditional · novelty 7.0

Dual Advantage Fields converts bilinear dual value models into local advantage scores via learned action-effect models, equaling the goal-conditioned Bellman advantage under realizability and improving aggregate metrics on OGBench locomotion, manipulation, and puzzle tasks.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

cs.CV · 2026-05-20 · conditional · novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Switching successor measures extend classical successor measures to enable hierarchical zero-shot RL via the FB π-Switch algorithm that extracts subgoal-selection and control policies from forward-backward representations.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and drug discovery tasks.

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Test-time Offline Reinforcement Learning on Goal-related Experience

cs.LG · 2025-07-24 · unverdicted · novelty 7.0

GC-TTT adapts goal-conditioned policies at test time by fine-tuning on self-supervised selected goal-related offline data, yielding performance gains in loco-navigation and manipulation tasks.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

KTO: Model Alignment as Prospect Theoretic Optimization

cs.LG · 2024-02-02 · conditional · novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

cs.RO · 2022-09-30 · unverdicted · novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

cs.LG · 2022-08-12 · unverdicted · novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

Controllable Sim Agents with Behavior Latents

cs.RO · 2026-07-02 · unverdicted · novelty 6.0

CNeVA combines variational behavior latents with rectified-flow generators and soft eligibility to deliver controllable yet realistic traffic simulation on Waymo data.

citing papers explorer

Showing 19 of 19 citing papers after filters.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training cs.RO · 2022-09-30 · unverdicted · none · ref 23 · internal anchor
VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Controllable Sim Agents with Behavior Latents cs.RO · 2026-07-02 · unverdicted · none · ref 45 · internal anchor
CNeVA combines variational behavior latents with rectified-flow generators and soft eligibility to deliver controllable yet realistic traffic simulation on Waymo data.
Freeform Preference Learning for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 50 · internal anchor
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning cs.RO · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
STEAM learns advantages from expert trajectories via self-supervised temporal ensemble modeling to improve policy learning on real robot tasks like bimanual folding and pick-and-place.
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 27 · internal anchor
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 35 · internal anchor
SARM2 presents RM, a multi-task stage-aware reward model achieving 80% lower value-estimation MSE, which when used in SPIRAL boosts manipulation task success from ~50% to near-perfect on several benchmarks.
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 56 · 2 links · internal anchor
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers cs.RO · 2026-04-14 · unverdicted · none · ref 44 · internal anchor
WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without teleoperation or real-world training data.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks cs.RO · 2026-04-11 · unverdicted · none · ref 35 · internal anchor
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 145 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving cs.RO · 2026-02-26 · unverdicted · none · ref 41 · internal anchor
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
RISE: Self-Improving Robot Policy with Compositional World Model cs.RO · 2026-02-11 · unverdicted · none · ref 72 · internal anchor
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning cs.RO · 2025-03-18 · unverdicted · none · ref 33 · internal anchor
COLSON applies diffusion models to reinforcement learning for social robot navigation and adds controllability mechanisms that enable zero-shot adaptation to unseen static obstacles and altered objectives.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 67 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation cs.RO · 2026-06-24 · unverdicted · none · ref 54 · internal anchor
FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.
FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning cs.RO · 2026-06-10 · unverdicted · none · ref 28 · internal anchor
NEXT estimates external torques from short free-motion data without hardware sensors and FIRST improves imitation learning by upsampling contact phases, yielding over 17% better task progress on long-horizon manipulation tasks.
PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 8 · internal anchor
PACT is a self-evolving post-training framework that projects diffusion policies onto constraint-feasible regions via reverse-KL distillation and a tightening curriculum, reporting 31% fewer safety violations and 30.7% higher task success on embodied manipulation benchmarks.
Robot Self-Improvement via Human-Video Dynamics Models cs.RO · 2026-06-19 · unverdicted · none · ref 61 · internal anchor
Human-video dynamics models enable cross-embodiment robot self-improvement via training-free Dynamics-Guided Action Correction, raising success rates from 40% to 81% on seven real-world tasks.
Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline) cs.RO · 2026-06-25 · unverdicted · none · ref 5 · internal anchor
A competition entry for bimanual garment folding won 1st in simulation and 2nd in reality by making a VLA policy predict its own value quantities to drive advantage estimation, failure detection, and action selection.

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer