arxiv: 2211.15657 · v4 · submitted 2022-11-28 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay , Yilun Du , Abhi Gupta , Joshua Tenenbaum , Tommi Jaakkola , Pulkit Agrawal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion modelsoffline reinforcement learningconditional generative modelingpolicy learningdecision makingconstraintsskill composition

0 comments

The pith

Modeling a policy as a return-conditional diffusion model generates effective decisions directly from offline data and outperforms traditional offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes sequential decision-making as a problem of conditional generative modeling rather than reinforcement learning. It trains a diffusion model to produce sequences of actions conditioned on target returns, using only offline trajectories. This formulation avoids value estimation and dynamic programming, the core mechanisms of standard offline RL. The resulting policies achieve higher performance than existing offline RL methods on standard benchmarks. The same conditioning approach also supports generating behaviors that satisfy multiple constraints or compose multiple skills when trained on single constraints or skills.

Core claim

By modeling a policy as a return-conditional diffusion model, high-return action sequences can be generated directly from offline data without dynamic programming, producing policies that outperform existing offline RL approaches across standard benchmarks. Conditioning the same model on constraints or skills during training yields test-time behaviors that satisfy several constraints together or demonstrate skill composition.

What carries the argument

Return-conditional diffusion model that generates action sequences from offline data conditioned on target returns.

If this is right

Offline RL can be performed without explicit value functions or dynamic programming.
A single model trained with one constraint produces behaviors satisfying multiple constraints at test time.
A single model trained with individual skills produces composed skill sequences at test time.
Generative modeling advances can be applied directly to policy learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce the engineering overhead of maintaining separate value estimators and planners in deployed systems.
Advances in faster or more controllable diffusion sampling could immediately improve decision-making speed without changing the RL pipeline.
The same conditioning mechanism might be tested on datasets that mix multiple tasks to check whether one model can handle broader goal specifications.

Load-bearing premise

A diffusion model trained to generate actions conditioned only on returns can produce sequences that actually achieve those returns when executed in the environment.

What would settle it

If the actions sampled from the trained diffusion model at a given high return condition produce substantially lower actual returns than the condition when rolled out in the original or held-out environments, the central claim is false.

read the original abstract

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Return-conditional diffusion policies match or beat offline RL benchmarks with simpler training, but the bypass of dynamic programming hinges on whether they actually generate returns beyond the offline data max.

read the letter

The main thing to know is that this paper shows a diffusion model trained to generate actions conditioned on state and return can produce policies that outperform standard offline RL methods on the usual benchmarks, all while skipping value functions and Bellman backups. They also demonstrate that the same model, when trained with single constraint or skill labels, can compose multiple constraints or skills at test time without retraining. That composition trick is the cleanest part of the work and feels like a real practical win for generative approaches in control. The empirical results are what carry the paper; if the comparisons hold up with solid baselines and multiple seeds, the simplicity argument lands. Training is just score matching on the offline trajectories plus the extra conditioning variables, which removes a lot of the usual RL machinery. On the soft spots, the extrapolation concern is worth pressing. Diffusion models are trained to match the empirical distribution, so it is not obvious they will reliably sample trajectories with returns higher than anything seen in the data. If the reported gains come mostly from better density estimation inside the observed support rather than true out-of-distribution high-return generation, then this is closer to strong behavior cloning than a fundamental replacement for credit assignment. The abstract is thin on experimental details, so the full results section needs to show the return histograms, confirm that sampled returns exceed the dataset maximum, and rule out confounds like hyperparameter differences. This is useful reading for anyone working on generative models for robotics or offline decision-making. A reader already comfortable with diffusion training will pick up the conditioning extensions quickly and see whether the benchmark numbers justify trying it themselves. I would send it to peer review. The core formulation is coherent and the composition results are worth checking even if the strongest claims need more evidence.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that framing sequential decision-making as conditional generative modeling—specifically by training return-conditional diffusion models as policies on offline trajectories—outperforms standard offline RL methods on benchmarks while circumventing dynamic programming, value estimation, and Bellman backups. It further shows that conditioning on constraints or skills during training enables compositional behaviors (satisfying multiple constraints or combining skills) at test time.

Significance. If the central empirical claim holds and the diffusion model demonstrably generates trajectories with returns exceeding the maximum observed in the offline data, the work would represent a meaningful simplification of offline RL by removing the need for explicit value functions and backups. The compositional conditioning results would additionally strengthen the case for generative models in structured decision-making tasks.

major comments (3)

[Abstract and §1] Abstract and §1: the claim that return-conditional diffusion models 'circumvent the need for dynamic programming' is load-bearing for the paper's contribution, yet standard score-matching training on the empirical conditional p(a_{1:T}|s_{1:T},R) matches the support of the training distribution and supplies no explicit mechanism for reliable extrapolation to R values higher than those present in the dataset.
[§4 (Experiments)] §4 (Experiments): benchmark outperformance is reported without accompanying analysis showing that sampled action sequences achieve returns strictly above the dataset maximum; without this check, the results are consistent with improved behavior cloning on already-high-return trajectories rather than a fundamental bypass of RL machinery.
[§4 (Experiments)] §4 (Experiments): the reported results lack details on the number of random seeds, statistical tests, and direct comparison against a strong behavior-cloning baseline conditioned on the same high-return subset, which are required to rule out confounds and establish that gains are attributable to the generative formulation.

minor comments (2)

[§3] Notation for the diffusion forward and reverse processes should be aligned with standard references (e.g., Ho et al.) and the conditioning variables (return, constraint, skill) should be explicitly denoted in all equations.
[§4] Figure captions and axis labels in the experimental plots should indicate whether the plotted returns are normalized or raw and whether they reflect the maximum, mean, or median across sampled trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing clarifications on our claims and committing to revisions where empirical details or exposition can be strengthened. Our core argument is that the return-conditional diffusion formulation avoids explicit dynamic programming and value estimation during training, with empirical gains arising from the generative sampling procedure.

read point-by-point responses

Referee: [Abstract and §1] the claim that return-conditional diffusion models 'circumvent the need for dynamic programming' is load-bearing for the paper's contribution, yet standard score-matching training on the empirical conditional p(a_{1:T}|s_{1:T},R) matches the support of the training distribution and supplies no explicit mechanism for reliable extrapolation to R values higher than those present in the dataset.

Authors: We agree that score-matching learns the empirical conditional and does not guarantee extrapolation. However, the circumvention claim refers to the training procedure: unlike offline RL, we perform no Bellman backups, value function learning, or dynamic programming. At inference we simply condition on a target return (including values above the dataset maximum) and sample from the learned model. Our experiments demonstrate that this yields trajectories whose realized returns exceed the dataset maximum on several benchmarks, indicating that the diffusion process can produce higher-return behavior even when trained only on observed data. We will revise §1 and the abstract to explicitly distinguish the training-time avoidance of DP from the inference-time conditioning mechanism. revision: partial
Referee: [§4 (Experiments)] benchmark outperformance is reported without accompanying analysis showing that sampled action sequences achieve returns strictly above the dataset maximum; without this check, the results are consistent with improved behavior cloning on already-high-return trajectories rather than a fundamental bypass of RL machinery.

Authors: We will add a new analysis in §4 (and an accompanying figure) that reports, for each task, the maximum return present in the offline dataset versus the mean and distribution of returns obtained by sampling from the return-conditional diffusion model conditioned on a target return higher than that maximum. On the environments where we claim outperformance, the sampled trajectories do achieve returns strictly above the dataset maximum, supporting that the gains are not solely from reweighting high-return data. revision: yes
Referee: [§4 (Experiments)] the reported results lack details on the number of random seeds, statistical tests, and direct comparison against a strong behavior-cloning baseline conditioned on the same high-return subset, which are required to rule out confounds and establish that gains are attributable to the generative formulation.

Authors: We acknowledge these omissions. In the revised manuscript we will (i) report all results as mean ± standard deviation over 5 independent random seeds, (ii) include paired t-tests or Wilcoxon tests against baselines, and (iii) add a direct comparison to a behavior-cloning policy trained exclusively on the top-return trajectories (same return threshold used for conditioning the diffusion model). These additions will appear in §4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark evaluation

full rationale

The paper's central derivation consists of training a standard conditional diffusion model on offline trajectories to model p(a|s, R) and then sampling from it at test time conditioned on high returns. This is a direct modeling choice whose performance is assessed via empirical comparison to offline RL baselines on standard benchmarks. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on self-citation chains or imported uniqueness theorems. The approach is self-contained against external data and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds directly on existing conditional diffusion models and standard offline RL benchmarks.

pith-pipeline@v0.9.0 · 5471 in / 949 out tokens · 33130 ms · 2026-05-15T15:30:08.057734+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
cs.RO 2023-03 accept novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
Decoupled Guidance Diffusion for Adaptive Offline Safe Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

SDGD uses cost-conditioned classifier-free guidance plus reward guidance with feasible trajectory relabeling to generate safe high-reward trajectories that adapt to changing safety budgets in offline RL.
ZODIAC: Zero-shot Offline Diffusion for Inferring Multi-xApps Conflicts in Open Radio Access Networks
cs.NI 2026-04 unverdicted novelty 7.0

ZODIAC enables zero-shot inference of conflict-inducing conditions in O-RAN xApps from marginal offline data alone via uncertainty-penalized compositional diffusion reasoning.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
cs.RO 2026-04 unverdicted novelty 7.0

RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.
Receding-Horizon Control via Drifting Models
cs.AI 2026-04 unverdicted novelty 7.0

Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.
Factorization Regret mediates compositional generalization in latent space
cs.LG 2026-03 unverdicted novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
cs.RO 2026-03 conditional novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
RoboDreamer: Learning Compositional World Models for Robot Imagination
cs.RO 2024-04 unverdicted novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Accelerating trajectory optimization with Sobolev-trained diffusion policies
cs.LG 2026-04 unverdicted novelty 6.0

Sobolev-trained diffusion policies using trajectories and feedback gains provide warm-starts that reduce trajectory optimization solving time by 2x to 20x while avoiding compounding errors.
Real-Time Execution of Action Chunking Flow Policies
cs.RO 2025-06 unverdicted novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
cs.RO 2024-03 unverdicted novelty 6.0

DP3 uses compact 3D representations from sparse point clouds inside diffusion policies to learn generalizable visuomotor skills from few demonstrations, reporting 24% gains in simulation and 85% success on real robots.
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
cs.LG 2023-04 conditional novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
Insider Attacks in Multi-Agent LLM Consensus Systems
cs.MA 2026-05 unverdicted novelty 5.0

A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 20 Pith papers · 22 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

Advances in neural information processing systems , volume=

Learning to poke by poking: Experiential learning of intuitive physics , author=. Advances in neural information processing systems , volume=

work page
[3]

Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

Zero-shot visual imitation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

work page
[4]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[5]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[6]

Dieleman, Sander , title =

work page
[7]

2014 , publisher=

Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

work page 2014
[8]

NIPS , year =

Visual Interaction Networks: Learning a Physics Simulator from Video , author =. NIPS , year =

work page
[9]

Advances in Neural Information Processing Systems , year=

Learning to See Physics via Visual De-animation , author=. Advances in Neural Information Processing Systems , year=

work page
[10]

International Conference on Learning Representations , year=

Learning Visual Predictive Models of Physics for Playing Billiards , author=. International Conference on Learning Representations , year=

work page
[11]

Stochastic Adversarial Video Prediction

Stochastic Adversarial Video Prediction , author=. arXiv preprint arXiv:1804.01523 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

IEEE Conference on Computer Vision and Pattern Recognition , year=

Neural Scene De-rendering , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[13]

2006 , pages =

Pattern recognition and machine learning , author=. 2006 , pages =

work page 2006
[14]

ICLR , year=

A Compositional Object-Based Approach to Learning Physical Dynamics , author=. ICLR , year=

work page
[15]

Proceedings of the 33rd annual conference of the cognitive science society , year=

Internal physics models guide probabilistic judgments about object dynamics , author=. Proceedings of the 33rd annual conference of the cognitive science society , year=

work page
[16]

ICML , year=

Lerer, Adam and Gross, Sam and Fergus, Rob , title =. ICML , year=

work page
[17]

ICLR , year=

Stochastic Variational Video Prediction , author=. ICLR , year=

work page
[18]

Advances in Neural Information Processing Systems , title =

Ha, David and Schmidhuber, J\". Advances in Neural Information Processing Systems , title =

work page
[19]

Irina Higgins and Loic Matthey and Arka Pal and Christopher Burgess and Xavier Glorot and Matthew Botvinick and Shakir Mohamed and Alexander Lerchner , booktitle=

work page
[20]

Advances in Neural Information Processing Systems , year=

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , author=. Advances in Neural Information Processing Systems , year=

work page
[21]

and Whitney, William F

Kulkarni, Tejas D. and Whitney, William F. and Kohli, Pushmeet and Tenenbaum, Joshua B. , title =. Advances in Neural Information Processing Systems , year =

work page
[22]

Advances in Neural Information Processing Systems , year =

Generative Adversarial Nets , author =. Advances in Neural Information Processing Systems , year =

work page
[23]

Kingma and Max Welling , title =

Diederik P. Kingma and Max Welling , title =. CoRR , volume =

work page
[24]

and Kinzler, Katherine D

Spelke, Elizabeth S. and Kinzler, Katherine D. , title =. Developmental Science , volume =

work page
[25]

Learning structural descriptions from examples

Winston, Patrick Henry. Learning structural descriptions from examples. 1970

work page 1970
[26]

IEEE International Conference on Robotics and Automation , year=

Deep visual foresight for planning robot motion , author=. IEEE International Conference on Robotics and Automation , year=

work page
[27]

Proceedings of the 34th International Conference on Machine Learning , year =

Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , author =. Proceedings of the 34th International Conference on Machine Learning , year =

work page
[28]

, title =

Diuk, Carlos and Cohen, Andre and Littman, Michael L. , title =. Proceedings of the 25th International Conference on Machine Learning , year =

work page
[29]

Deep Sets , author =

work page
[30]

A simple neural network module for relational reasoning , author =

work page
[31]

ECCV , year=

Perceptual losses for real-time style transfer and super-resolution , author=. ECCV , year=

work page
[32]

and Ba, Jimmy , title =

Kingma, Diederik P. and Ba, Jimmy , title =. International Conference on Learning Representations , year =

work page
[33]

Efros and Martial Hebert

Abhinav Gupta and Alexei A. Efros and Martial Hebert. Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. European Conference on Computer Vision(ECCV). 2010

work page 2010
[34]

IEEE Conference on Computer Vision and Pattern Recognition , year =

Mottaghi, Roozbeh and Bagherinezhad, Hessam and Rastegari, Mohammad and Farhadi, Ali , title =. IEEE Conference on Computer Vision and Pattern Recognition , year =

work page
[35]

Li and A

W. Li and A. Leonardis and M. Fritz , booktitle=. Visual stability prediction for robotic manipulation , year=

work page
[36]

Jia and A

Z. Jia and A. C. Gallagher and A. Saxena and T. Chen , journal=. 3D Reasoning from Blocks to Stability , year=

work page
[37]

CoRR , volume =

Roozbeh Mottaghi and Mohammad Rastegari and Abhinav Gupta and Ali Farhadi , title =. CoRR , volume =

work page
[38]

Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding

Tianjia Shao and Aron Monszpart and Youyi Zheng and Bongjin Koo and Weiwei Xu and Kun Zhou and Niloy Mitra. Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding. 2014 , journal =

work page 2014
[39]

International Journal of Computer Vision , year=

Scene Understanding by Reasoning Stability and Safety , author=. International Journal of Computer Vision , year=

work page
[40]

Taking Visual Motion Prediction To New Heightfields

work page
[41]

Advances in Neural Information Processing Systems 29 , year =

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , author =. Advances in Neural Information Processing Systems 29 , year =

work page
[42]

Independently Controllable Factors , journal =

Valentin Thomas and Jules Pondard and Emmanuel Bengio and Marc Sarfati and Philippe Beaudoin and Marie. Independently Controllable Factors , journal =. 2017 , url =

work page 2017
[43]

Advances in Neural Information Processing Systems 30 , editor =

Neural Expectation Maximization , author =. Advances in Neural Information Processing Systems 30 , editor =

work page
[44]

Proceedings of the 18th International Joint Conference on Artificial Intelligence , year =

Guestrin, Carlos and Koller, Daphne and Gearhart, Chris and Kanodia, Neal , title =. Proceedings of the 18th International Joint Conference on Artificial Intelligence , year =

work page
[45]

Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Strategic Object Oriented Reinforcement Learning , author=. arXiv preprint arXiv:1806.00175 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Proceedings of the 31st International Conference on Machine Learning , year =

A Physics-Based Model Prior for Object-Oriented MDPs , author =. Proceedings of the 31st International Conference on Machine Learning , year =

work page
[47]

CoRR , year=

Deep Object-Centric Representations for Generalizable Robot Learning , author=. CoRR , year=

work page
[48]

CoRR , year=

Unsupervised Video Object Segmentation for Deep Reinforcement Learning , author=. CoRR , year=

work page
[49]

, series =

Roberts, Lawrence G. , series =

work page
[50]

and Kroese, Dirk P

Rubinstein, Reuven Y. and Kroese, Dirk P. , title =. 2004 , isbn =

work page 2004
[51]

Todorov, Emanuel and Erez, Tom and Tassa, Yuval , booktitle =

work page
[52]

and Zisserman, A

Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2014

work page 2014
[53]

ICLR , year=

Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , author=. ICLR , year=

work page
[54]

Levine, Sergey and Finn, Chelsea and Darrell, Trevor and Abbeel, Pieter , title =. J. Mach. Learn. Res. , issue_date =. 2016 , issn =

work page 2016
[55]

and Veness, Joel and Bellemare, Marc G

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K. and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

work page
[56]

Proximal Policy Optimization Algorithms

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[57]

International Conference on Machine Learning , year =

Trust Region Policy Optimization , author =. International Conference on Machine Learning , year =

work page
[58]

and Ullman, Tomer D

Lake, Brenden M. and Ullman, Tomer D. and Tenenbaum, Joshua B. and Gershman, Samuel J. , biburl =. Building Machines That Learn and Think Like People. , volume =. CoRR , keywords =

work page
[59]

Advances in Neural Information Processing Systems , year =

Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , author =. Advances in Neural Information Processing Systems , year =

work page
[60]

International Conference on Robotics and Automation , year=

Optimal control with learned local models: Application to dexterous manipulation , author=. International Conference on Robotics and Automation , year=

work page
[61]

Advances in Neural Information Processing Systems , year =

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , author =. Advances in Neural Information Processing Systems , year =

work page
[62]

International Conference on Learning Representations , year=

Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations , year=

work page
[63]

Jordan and Joseph E

Vladimir Feinberg and Alvin Wan and Ion Stoica and Michael I. Jordan and Joseph E. Gonzalez and Sergey Levine , title =. International Conference on Machine Learning , year =

work page
[64]

Continuous Deep

Shixiang Gu and Timothy Lillicrap and Ilya Sutskever and Sergey Levine , booktitle =. Continuous Deep

work page
[65]

Sutton , title =

Richard S. Sutton , title =. International Conference on Machine Learning , year =

work page
[66]

International Conference on Machine Learning , year =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , year =

work page
[67]

Littman and Andrew P

Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore. Journal of Artificial Intelligence Research. 1996

work page 1996
[68]

Deisenroth, Marc and Rasmussen, Carl E , booktitle=

work page
[69]

Asadi, Kavosh and Misra, Dipendra and Littman, Michael , booktitle =

work page
[70]

Ng and Adam Coates and Mark Diel and Varun Ganapathi and Jamie Schulte and Ben Tse and Eric Berger and Eric Liang , title =

Andrew Y. Ng and Adam Coates and Mark Diel and Varun Ganapathi and Jamie Schulte and Ben Tse and Eric Berger and Eric Liang , title =. Experimental Robotics IX, The 9th International Symposium on Experimental Robotics. 2004 , crossref =. doi:10.1007/11552246\_35 , timestamp =

work page doi:10.1007/11552246 2004
[71]

Nature , pages =

Mastering the game of Go with deep neural networks and tree search , author =. Nature , pages =. 2016 , URL =

work page 2016
[72]

2014 , publisher=

Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=

work page 2014
[73]

Lillicrap and Jonathan J

Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Heess and Tom Erez and Yuval Tassa and David Silver and Daan Wierstra , title =. International Conference on Learning Representations , year =

work page
[74]

International Conference on Machine Learning , year =

Sergey Levine and Vladlen Koltun , title =. International Conference on Machine Learning , year =

work page
[75]

Learning and policy search in stochastic dynamical systems with bayesian neural networks , booktitle =

Depeweg, Stefan and Hern. Learning and policy search in stochastic dynamical systems with bayesian neural networks , booktitle =

work page
[76]

IEEE Control Systems Magazine , title=

A. IEEE Control Systems Magazine , title=

work page
[77]

Advances in Neural Information Processing Systems , year=

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , author=. Advances in Neural Information Processing Systems , year=

work page
[78]

Fearing, Ronald and Levine, Sergey , year =

Nagabandi, Anusha and Kahn, Gregory and S. Fearing, Ronald and Levine, Sergey , year =. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , booktitle =

work page
[79]

Conference on Robot Learning , year =

Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , author =. Conference on Robot Learning , year =

work page
[80]

C. G. International Conference on Robotics and Automation , title=

work page

Showing first 80 references.