pith. machine review for the scientific record. sign in

arxiv: 1910.00177 · v3 · submitted 2019-10-01 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Aviral Kumar, Grace Zhang, Sergey Levine, Xue Bin Peng

Pith reviewed 2026-05-11 13:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords advantage-weighted regressionoff-policy reinforcement learningsupervised learningexperience replaystatic datasetscontinuous controlvalue function regressionpolicy regression
0
0 comments X

The pith

Advantage-weighted regression turns two supervised learning steps into a simple and effective off-policy RL algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops advantage-weighted regression to perform off-policy reinforcement learning using only standard supervised learning techniques. One step regresses a value function onto targets from replay data; the second regresses the policy onto actions weighted by their estimated advantages. This design lets the method draw on off-policy experience without on-policy sampling or complex constraints. Experiments on OpenAI Gym tasks show performance that matches established state-of-the-art algorithms, and the approach learns stronger policies than most off-policy methods when given only a fixed static dataset.

Core claim

Advantage-weighted regression consists of regressing onto target values for a value function and then regressing onto weighted target actions for the policy. The weighting uses advantages computed from the value estimates, allowing the policy to be improved from off-policy data in experience replay. The method uses convergent maximum-likelihood losses, supports both continuous and discrete actions, and requires only a few lines of code atop standard supervised learners. It achieves competitive results against well-established RL algorithms on benchmark tasks and outperforms most off-policy baselines when trained on purely static datasets without further environment interaction.

What carries the argument

Advantage-weighted regression: a two-step supervised process that first fits a value function and then regresses the policy onto actions selected in proportion to their advantage estimates from replay data.

If this is right

  • The algorithm can acquire effective policies from purely static datasets with no additional environment interactions.
  • It applies directly to both continuous and discrete action spaces.
  • Implementation requires only standard supervised learning routines plus advantage weighting.
  • Theoretical analysis shows that off-policy data from experience replay can be incorporated without breaking convergence of the regression steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method's reliance on replay data suggests it may integrate easily with existing supervised-learning pipelines that already collect large offline datasets.
  • Performance on static data implies potential use in domains where new interactions are expensive or unsafe.
  • The two-step structure could be combined with modern function approximators to scale to higher-dimensional control problems without redesigning the core update rule.

Load-bearing premise

Advantage estimates derived from replay data remain sufficiently accurate and unbiased to guide policy regression toward improvement without on-policy sampling or additional constraints.

What would settle it

A controlled test in which advantage estimates computed from replay data are deliberately biased or high-variance, followed by observation that the resulting policy fails to improve or underperforms on-policy baselines.

read the original abstract

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Advantage-Weighted Regression (AWR), an off-policy RL algorithm consisting of two supervised-learning steps: regressing a value function onto targets and regressing the policy onto actions weighted by exp(A/β), where A denotes advantage estimates. It provides a theoretical motivation linking the method to policy iteration, analyzes its behavior with experience replay, and reports competitive results against established off-policy algorithms on OpenAI Gym continuous-control benchmarks while claiming superior performance when learning from purely static datasets with no further environment interactions.

Significance. If the empirical claims hold with proper statistical support, AWR would be a meaningful contribution by reducing off-policy RL to standard, convergent maximum-likelihood regression steps that are simple to implement and scale. The static-dataset results, if robust, would be particularly notable for offline RL applications where additional interactions are unavailable.

major comments (2)
  1. [§3] §3 (off-policy analysis): The derivation of policy improvement via advantage-weighted regression assumes that advantage estimates computed from replay data remain sufficiently accurate and unbiased to produce improvement. No importance-sampling correction or bound on distribution-shift bias appears in the update rules or analysis, yet this assumption is load-bearing for the static-dataset claims where no fresh interactions mitigate mismatch between the behavior policy and the current policy.
  2. [Experiments] Experiments section and abstract: The central performance claims (competitive benchmark results and superiority on static datasets) are stated without quantitative numbers, error bars, or statistical tests in the provided text. This makes it impossible to verify the strength of evidence supporting the off-policy and offline claims.
minor comments (2)
  1. [Method] The temperature parameter β is introduced without a clear discussion of its sensitivity or selection procedure in the method description.
  2. [Method] Notation for the value-function target and advantage computation should be made consistent between the theoretical motivation and the algorithm pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the theoretical assumptions and strengthening the empirical presentation.

read point-by-point responses
  1. Referee: [§3] §3 (off-policy analysis): The derivation of policy improvement via advantage-weighted regression assumes that advantage estimates computed from replay data remain sufficiently accurate and unbiased to produce improvement. No importance-sampling correction or bound on distribution-shift bias appears in the update rules or analysis, yet this assumption is load-bearing for the static-dataset claims where no fresh interactions mitigate mismatch between the behavior policy and the current policy.

    Authors: We appreciate this observation on the analysis in Section 3. The derivation shows that the advantage-weighted regression step corresponds to a policy improvement operator (in the sense of the policy improvement theorem) when advantage estimates are accurate, providing the link to policy iteration. The off-policy analysis with experience replay examines how the method can still yield improvement when the replay buffer provides sufficient coverage. We agree that the analysis does not include an explicit importance-sampling correction or a rigorous bound on distribution-shift bias, and that this assumption is particularly relevant for the static-dataset experiments. In the revised manuscript we will expand Section 3 to state these assumptions more explicitly and discuss their implications for the offline setting. revision: yes

  2. Referee: [Experiments] Experiments section and abstract: The central performance claims (competitive benchmark results and superiority on static datasets) are stated without quantitative numbers, error bars, or statistical tests in the provided text. This makes it impossible to verify the strength of evidence supporting the off-policy and offline claims.

    Authors: The abstract and text summarize the results qualitatively, while the quantitative evidence (learning curves with error bars across multiple random seeds) appears in the figures of the Experiments section. To improve verifiability from the text itself, we will add a summary table of final performance values (means and standard deviations) for all algorithms and tasks, along with the number of seeds used. This will allow direct assessment of the competitive and offline results without relying solely on the figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent supervised regression steps

full rationale

The paper derives AWR from standard policy iteration and maximum-likelihood regression on advantage-weighted targets. Value function regression produces targets that are then used to weight policy regression, but these are distinct supervised learning subroutines with no reduction of the policy update to quantities defined solely by the fitted parameters themselves. Off-policy analysis is motivated via replay buffer properties without self-definitional loops or load-bearing self-citations that collapse the central claim. The static-dataset experiments rely on the same replay-derived advantages but do not create a tautological prediction by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL concepts (MDP, value functions, advantages) plus the domain assumption that advantage-weighted regression yields policy improvement; no new entities are introduced.

free parameters (1)
  • advantage weighting temperature
    Likely controls sharpness of advantage weighting in the policy regression loss; value not stated in abstract.
axioms (1)
  • domain assumption Advantage estimates from replay data can be used to weight policy regression targets without introducing prohibitive bias.
    Core premise stated in the abstract's theoretical motivation and off-policy analysis.

pith-pipeline@v0.9.0 · 5509 in / 1288 out tokens · 50592 ms · 2026-05-11T13:18:09.534528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  2. Offline Reinforcement Learning with Implicit Q-Learning

    cs.LG 2021-10 unverdicted novelty 8.0

    IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

  3. D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    cs.LG 2020-04 accept novelty 8.0

    D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

  4. Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Switching successor measures extend classical successor measures to enable hierarchical zero-shot RL via the FB π-Switch algorithm that extracts subgoal-selection and control policies from forward-backward representations.

  5. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  6. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  7. Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

    cs.LG 2026-05 unverdicted novelty 7.0

    AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and...

  8. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  9. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  10. SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Survival value learning expresses the goal-conditioned value function as a discounted sum of survival probabilities and estimates it with maximum-likelihood hazard models on censored data, matching or exceeding TD bas...

  11. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  12. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  13. Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    cs.LG 2023-05 accept novelty 7.0

    DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

  14. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  15. Decaf: Improving Neural Decompilation with Automatic Feedback and Search

    cs.SE 2026-05 unverdicted novelty 6.0

    Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.

  16. Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.

  17. Implicit Preference Alignment for Human Image Animation

    cs.CV 2026-05 unverdicted novelty 6.0

    IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.

  18. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  19. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  20. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  21. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

  22. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  23. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  24. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  25. When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...

  26. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  27. Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

    cs.RO 2026-04 unverdicted novelty 6.0

    WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...

  28. MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

  29. PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.

  30. Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.

  31. Target Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.

  32. Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.

  33. $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    cs.LG 2025-11 unverdicted novelty 6.0

    RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

  34. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  35. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  36. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    cs.LG 2023-04 conditional novelty 6.0

    IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

  37. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  38. Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.

  39. Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    ME-AM adds entropy regularization and a mixture prior to adjoint matching in flow-based offline RL to extract better multi-modal policies from limited data.

  40. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  41. Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.

  42. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    cs.LG 2020-05 unverdicted novelty 2.0

    Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 39 Pith papers · 1 internal anchor

  1. [1]

    Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine

    URL http: //proceedings.mlr.press/v48/duan16.html. Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep q- learning algorithms. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, pp. 2021–2030...

  2. [2]

    URL http://proceedings.mlr.press/v80/fujimoto18a.html

    PMLR. URL http://proceedings.mlr.press/v80/fujimoto18a.html. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2052–...

  3. [3]

    doi: 10.1016/j.neunet.2009.01.002

    ISSN 0893-6080. doi: 10.1016/j.neunet.2009.01.002. URL http://dx.doi.org/10.1016/j.neunet.2009.01.002. Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q- learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , AAAI’16, pp. 2094–2100. AAAI Press,

  4. [4]

    Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S

    URL http://dl.acm.org/citation.cfm?id= 3016100.3016191. Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. CoRR, abs/1707.02286,

  5. [5]

    Jonathan Ho and Stefano Ermon

    URL http: //arxiv.org/abs/1707.02286. Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Daniel Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. CoRR, abs/1710.02298,

  6. [6]

    Rainbow: Combining Improvements in Deep Reinforcement Learning , October 2017

    URL http:// arxiv.org/abs/1710.02298. Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning , ICML ’02, pp. 267–274, San Francisco, CA, USA,

  7. [7]

    ISBN 1-55860- 873-7

    Morgan Kaufmann Publishers Inc. ISBN 1-55860- 873-7. URL http://dl.acm.org/citation.cfm?id=645531.656005. Jens Kober and Jan R. Peters. Policy search for motor primitives in robotics. In D. Koller, D. Schu- urmans, Y . Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems 21, pp. 849–856. Curran Associates, Inc.,

  8. [9]

    Stabilizing off-policy q-learning via bootstrapping error reduction.arXiv preprint arXiv:1906.00949,

    URL http://arxiv.org/ abs/1906.00949. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn- ing. ICLR,

  9. [10]

    Rusu, Joel Veness, Marc G

    ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236. R´emi Munos, Thomas Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient off- policy reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 1054–1062, USA,

  10. [11]

    ISBN 978-1-60558-907-7

    Omnipress. ISBN 978-1-60558-907-7. URL http://dl.acm.org/citation.cfm?id=3104322.3104425. 12 Gerhard Neumann. Variational inference for policy search in changing situations. InProceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 817–824, USA,

  11. [12]

    ISBN 978-1-4503-0619-5

    Omnipress. ISBN 978-1-4503-0619-5. URL http://dl.acm. org/citation.cfm?id=3104482.3104585. Gerhard Neumann and Jan R. Peters. Fitted q-iteration by advantage weighted re- gression. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou (eds.), Ad- vances in Neural Information Processing Systems 21 , pp. 1177–1184. Cur- ran Associates, Inc.,

  12. [13]

    Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills

    ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http: //doi.acm.org/10.1145/3197517.3201311. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pp. 745–750, New York, NY , USA,

  13. [14]

    Reinforcement learning by reward-weighted regression for operational space control

    ACM. ISBN 978-1-59593-793-3. doi: 10.1145/ 1273496.1273590. URL http://doi.acm.org/10.1145/1273496.1273590. Jan Peters, Katharina M¨ulling, and Yasemin Alt¨un. Relative entropy policy search. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1607–1612. AAAI Press,

  14. [15]

    Vitchyr Pong

    URL http://dl.acm.org/citation.cfm?id=2898607.2898863. Vitchyr Pong. Rlkit. https://github.com/vitchyr/rlkit,

  15. [17]

    URL http://arxiv.org/abs/ 1707.06347. SFU. Sfu motion capture database. http://mocap.cs.sfu.ca/. Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning . MIT Press, Cambridge, MA, USA, 1st edition,

  16. [19]

    arXiv preprint arXiv:1611.01224 , year=

    URL http://arxiv.org/abs/1611.01224. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Mach. Learn., 8(3-4):229–256, May

  17. [20]

    Williams

    ISSN 0885-6125. doi: 10.1007/ BF00992696. URL https://doi.org/10.1007/BF00992696. He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. , 37(4):145:1–145:11, July

  18. [21]

    2018 , issue_date =

    ISSN 0730-0301. doi: 10.1145/3197517.3201366. URL http://doi.acm.org/10.1145/ 3197517.3201366. 14 A AWR D ERIVATION In this section, we derive the AWR algorithm as an approximate optimization of a constrained policy search problem. Our goal is to find a policy that maximize the expected improvement η(π) = J(π) −J(µ) over a sampling policy µ(a|s). We start ...

  19. [22]

    The value function is modeled by a separate network with a similar architecture, but consists of a single linear output unit for the value

    C E XPERIMENTAL SETUP In our experiments, the policy is represented by a fully-connected network with 2 hidden layers consisting of 128 and 64 ReLU units respectively (Nair & Hinton, 2010), followed by a linear output layer. The value function is modeled by a separate network with a similar architecture, but consists of a single linear output unit for the...

  20. [23]

    The replay buffer stores 50k of the most recent samples

    At each iteration, the agent collects a batch of approximately 2000 samples, which are stored in the replay buffer D along with samples from previous iterations. The replay buffer stores 50k of the most recent samples. Updates to the value function and policy are performed by uniformly sampling minibatches of 256 samples from D. The value function is upda...