pith. machine review for the scientific record. sign in

arxiv: 2211.15657 · v4 · submitted 2022-11-28 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Is Conditional Generative Modeling all you need for Decision-Making?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords diffusion modelsoffline reinforcement learningconditional generative modelingpolicy learningdecision makingconstraintsskill composition
0
0 comments X

The pith

Modeling a policy as a return-conditional diffusion model generates effective decisions directly from offline data and outperforms traditional offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes sequential decision-making as a problem of conditional generative modeling rather than reinforcement learning. It trains a diffusion model to produce sequences of actions conditioned on target returns, using only offline trajectories. This formulation avoids value estimation and dynamic programming, the core mechanisms of standard offline RL. The resulting policies achieve higher performance than existing offline RL methods on standard benchmarks. The same conditioning approach also supports generating behaviors that satisfy multiple constraints or compose multiple skills when trained on single constraints or skills.

Core claim

By modeling a policy as a return-conditional diffusion model, high-return action sequences can be generated directly from offline data without dynamic programming, producing policies that outperform existing offline RL approaches across standard benchmarks. Conditioning the same model on constraints or skills during training yields test-time behaviors that satisfy several constraints together or demonstrate skill composition.

What carries the argument

Return-conditional diffusion model that generates action sequences from offline data conditioned on target returns.

If this is right

  • Offline RL can be performed without explicit value functions or dynamic programming.
  • A single model trained with one constraint produces behaviors satisfying multiple constraints at test time.
  • A single model trained with individual skills produces composed skill sequences at test time.
  • Generative modeling advances can be applied directly to policy learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the engineering overhead of maintaining separate value estimators and planners in deployed systems.
  • Advances in faster or more controllable diffusion sampling could immediately improve decision-making speed without changing the RL pipeline.
  • The same conditioning mechanism might be tested on datasets that mix multiple tasks to check whether one model can handle broader goal specifications.

Load-bearing premise

A diffusion model trained to generate actions conditioned only on returns can produce sequences that actually achieve those returns when executed in the environment.

What would settle it

If the actions sampled from the trained diffusion model at a given high return condition produce substantially lower actual returns than the condition when rolled out in the original or held-out environments, the central claim is false.

read the original abstract

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that framing sequential decision-making as conditional generative modeling—specifically by training return-conditional diffusion models as policies on offline trajectories—outperforms standard offline RL methods on benchmarks while circumventing dynamic programming, value estimation, and Bellman backups. It further shows that conditioning on constraints or skills during training enables compositional behaviors (satisfying multiple constraints or combining skills) at test time.

Significance. If the central empirical claim holds and the diffusion model demonstrably generates trajectories with returns exceeding the maximum observed in the offline data, the work would represent a meaningful simplification of offline RL by removing the need for explicit value functions and backups. The compositional conditioning results would additionally strengthen the case for generative models in structured decision-making tasks.

major comments (3)
  1. [Abstract and §1] Abstract and §1: the claim that return-conditional diffusion models 'circumvent the need for dynamic programming' is load-bearing for the paper's contribution, yet standard score-matching training on the empirical conditional p(a_{1:T}|s_{1:T},R) matches the support of the training distribution and supplies no explicit mechanism for reliable extrapolation to R values higher than those present in the dataset.
  2. [§4 (Experiments)] §4 (Experiments): benchmark outperformance is reported without accompanying analysis showing that sampled action sequences achieve returns strictly above the dataset maximum; without this check, the results are consistent with improved behavior cloning on already-high-return trajectories rather than a fundamental bypass of RL machinery.
  3. [§4 (Experiments)] §4 (Experiments): the reported results lack details on the number of random seeds, statistical tests, and direct comparison against a strong behavior-cloning baseline conditioned on the same high-return subset, which are required to rule out confounds and establish that gains are attributable to the generative formulation.
minor comments (2)
  1. [§3] Notation for the diffusion forward and reverse processes should be aligned with standard references (e.g., Ho et al.) and the conditioning variables (return, constraint, skill) should be explicitly denoted in all equations.
  2. [§4] Figure captions and axis labels in the experimental plots should indicate whether the plotted returns are normalized or raw and whether they reflect the maximum, mean, or median across sampled trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing clarifications on our claims and committing to revisions where empirical details or exposition can be strengthened. Our core argument is that the return-conditional diffusion formulation avoids explicit dynamic programming and value estimation during training, with empirical gains arising from the generative sampling procedure.

read point-by-point responses
  1. Referee: [Abstract and §1] the claim that return-conditional diffusion models 'circumvent the need for dynamic programming' is load-bearing for the paper's contribution, yet standard score-matching training on the empirical conditional p(a_{1:T}|s_{1:T},R) matches the support of the training distribution and supplies no explicit mechanism for reliable extrapolation to R values higher than those present in the dataset.

    Authors: We agree that score-matching learns the empirical conditional and does not guarantee extrapolation. However, the circumvention claim refers to the training procedure: unlike offline RL, we perform no Bellman backups, value function learning, or dynamic programming. At inference we simply condition on a target return (including values above the dataset maximum) and sample from the learned model. Our experiments demonstrate that this yields trajectories whose realized returns exceed the dataset maximum on several benchmarks, indicating that the diffusion process can produce higher-return behavior even when trained only on observed data. We will revise §1 and the abstract to explicitly distinguish the training-time avoidance of DP from the inference-time conditioning mechanism. revision: partial

  2. Referee: [§4 (Experiments)] benchmark outperformance is reported without accompanying analysis showing that sampled action sequences achieve returns strictly above the dataset maximum; without this check, the results are consistent with improved behavior cloning on already-high-return trajectories rather than a fundamental bypass of RL machinery.

    Authors: We will add a new analysis in §4 (and an accompanying figure) that reports, for each task, the maximum return present in the offline dataset versus the mean and distribution of returns obtained by sampling from the return-conditional diffusion model conditioned on a target return higher than that maximum. On the environments where we claim outperformance, the sampled trajectories do achieve returns strictly above the dataset maximum, supporting that the gains are not solely from reweighting high-return data. revision: yes

  3. Referee: [§4 (Experiments)] the reported results lack details on the number of random seeds, statistical tests, and direct comparison against a strong behavior-cloning baseline conditioned on the same high-return subset, which are required to rule out confounds and establish that gains are attributable to the generative formulation.

    Authors: We acknowledge these omissions. In the revised manuscript we will (i) report all results as mean ± standard deviation over 5 independent random seeds, (ii) include paired t-tests or Wilcoxon tests against baselines, and (iii) add a direct comparison to a behavior-cloning policy trained exclusively on the top-return trajectories (same return threshold used for conditioning the diffusion model). These additions will appear in §4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark evaluation

full rationale

The paper's central derivation consists of training a standard conditional diffusion model on offline trajectories to model p(a|s, R) and then sampling from it at test time conditioned on high returns. This is a direct modeling choice whose performance is assessed via empirical comparison to offline RL baselines on standard benchmarks. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on self-citation chains or imported uniqueness theorems. The approach is self-contained against external data and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds directly on existing conditional diffusion models and standard offline RL benchmarks.

pith-pipeline@v0.9.0 · 5471 in / 949 out tokens · 33130 ms · 2026-05-15T15:30:08.057734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    cs.RO 2023-03 accept novelty 8.0

    Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

  2. Muninn: Your Trajectory Diffusion Model But Faster

    cs.RO 2026-05 unverdicted novelty 7.0

    Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

  3. Decoupled Guidance Diffusion for Adaptive Offline Safe Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    SDGD uses cost-conditioned classifier-free guidance plus reward guidance with feasible trajectory relabeling to generate safe high-reward trajectories that adapt to changing safety budgets in offline RL.

  4. ZODIAC: Zero-shot Offline Diffusion for Inferring Multi-xApps Conflicts in Open Radio Access Networks

    cs.NI 2026-04 unverdicted novelty 7.0

    ZODIAC enables zero-shot inference of conflict-inducing conditions in O-RAN xApps from marginal offline data alone via uncertainty-penalized compositional diffusion reasoning.

  5. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  6. Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

    cs.RO 2026-04 unverdicted novelty 7.0

    RSBM exploits velocity field invariance across regularization levels to achieve over 94% cosine similarity and 92% success in visual navigation using only 3 integration steps.

  7. Receding-Horizon Control via Drifting Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.

  8. Factorization Regret mediates compositional generalization in latent space

    cs.LG 2026-03 unverdicted novelty 7.0

    Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

  9. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  10. RoboDreamer: Learning Compositional World Models for Robot Imagination

    cs.RO 2024-04 unverdicted novelty 7.0

    RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.

  11. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    cs.LG 2022-08 unverdicted novelty 7.0

    Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

  12. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  13. Accelerating trajectory optimization with Sobolev-trained diffusion policies

    cs.LG 2026-04 unverdicted novelty 6.0

    Sobolev-trained diffusion policies using trajectories and feedback gains provide warm-starts that reduce trajectory optimization solving time by 2x to 20x while avoiding compounding errors.

  14. Real-Time Execution of Action Chunking Flow Policies

    cs.RO 2025-06 unverdicted novelty 6.0

    Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

  15. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    cs.RO 2024-03 unverdicted novelty 6.0

    DP3 uses compact 3D representations from sparse point clouds inside diffusion policies to learn generalizable visuomotor skills from few demonstrations, reporting 24% gains in simulation and 85% success on real robots.

  16. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  17. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    cs.LG 2023-04 conditional novelty 6.0

    IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

  18. Insider Attacks in Multi-Agent LLM Consensus Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.

  19. Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.

  20. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 20 Pith papers · 22 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    Advances in neural information processing systems , volume=

    Learning to poke by poking: Experiential learning of intuitive physics , author=. Advances in neural information processing systems , volume=

  3. [3]

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

    Zero-shot visual imitation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

  4. [4]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  5. [5]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  6. [6]

    Dieleman, Sander , title =

  7. [7]

    2014 , publisher=

    Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

  8. [8]

    NIPS , year =

    Visual Interaction Networks: Learning a Physics Simulator from Video , author =. NIPS , year =

  9. [9]

    Advances in Neural Information Processing Systems , year=

    Learning to See Physics via Visual De-animation , author=. Advances in Neural Information Processing Systems , year=

  10. [10]

    International Conference on Learning Representations , year=

    Learning Visual Predictive Models of Physics for Playing Billiards , author=. International Conference on Learning Representations , year=

  11. [11]

    Stochastic Adversarial Video Prediction

    Stochastic Adversarial Video Prediction , author=. arXiv preprint arXiv:1804.01523 , year=

  12. [12]

    IEEE Conference on Computer Vision and Pattern Recognition , year=

    Neural Scene De-rendering , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=

  13. [13]

    2006 , pages =

    Pattern recognition and machine learning , author=. 2006 , pages =

  14. [14]

    ICLR , year=

    A Compositional Object-Based Approach to Learning Physical Dynamics , author=. ICLR , year=

  15. [15]

    Proceedings of the 33rd annual conference of the cognitive science society , year=

    Internal physics models guide probabilistic judgments about object dynamics , author=. Proceedings of the 33rd annual conference of the cognitive science society , year=

  16. [16]

    ICML , year=

    Lerer, Adam and Gross, Sam and Fergus, Rob , title =. ICML , year=

  17. [17]

    ICLR , year=

    Stochastic Variational Video Prediction , author=. ICLR , year=

  18. [18]

    Advances in Neural Information Processing Systems , title =

    Ha, David and Schmidhuber, J\". Advances in Neural Information Processing Systems , title =

  19. [19]

    Irina Higgins and Loic Matthey and Arka Pal and Christopher Burgess and Xavier Glorot and Matthew Botvinick and Shakir Mohamed and Alexander Lerchner , booktitle=

  20. [20]

    Advances in Neural Information Processing Systems , year=

    InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , author=. Advances in Neural Information Processing Systems , year=

  21. [21]

    and Whitney, William F

    Kulkarni, Tejas D. and Whitney, William F. and Kohli, Pushmeet and Tenenbaum, Joshua B. , title =. Advances in Neural Information Processing Systems , year =

  22. [22]

    Advances in Neural Information Processing Systems , year =

    Generative Adversarial Nets , author =. Advances in Neural Information Processing Systems , year =

  23. [23]

    Kingma and Max Welling , title =

    Diederik P. Kingma and Max Welling , title =. CoRR , volume =

  24. [24]

    and Kinzler, Katherine D

    Spelke, Elizabeth S. and Kinzler, Katherine D. , title =. Developmental Science , volume =

  25. [25]

    Learning structural descriptions from examples

    Winston, Patrick Henry. Learning structural descriptions from examples. 1970

  26. [26]

    IEEE International Conference on Robotics and Automation , year=

    Deep visual foresight for planning robot motion , author=. IEEE International Conference on Robotics and Automation , year=

  27. [27]

    Proceedings of the 34th International Conference on Machine Learning , year =

    Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , author =. Proceedings of the 34th International Conference on Machine Learning , year =

  28. [28]

    , title =

    Diuk, Carlos and Cohen, Andre and Littman, Michael L. , title =. Proceedings of the 25th International Conference on Machine Learning , year =

  29. [29]

    Deep Sets , author =

  30. [30]

    A simple neural network module for relational reasoning , author =

  31. [31]

    ECCV , year=

    Perceptual losses for real-time style transfer and super-resolution , author=. ECCV , year=

  32. [32]

    and Ba, Jimmy , title =

    Kingma, Diederik P. and Ba, Jimmy , title =. International Conference on Learning Representations , year =

  33. [33]

    Efros and Martial Hebert

    Abhinav Gupta and Alexei A. Efros and Martial Hebert. Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. European Conference on Computer Vision(ECCV). 2010

  34. [34]

    IEEE Conference on Computer Vision and Pattern Recognition , year =

    Mottaghi, Roozbeh and Bagherinezhad, Hessam and Rastegari, Mohammad and Farhadi, Ali , title =. IEEE Conference on Computer Vision and Pattern Recognition , year =

  35. [35]

    Li and A

    W. Li and A. Leonardis and M. Fritz , booktitle=. Visual stability prediction for robotic manipulation , year=

  36. [36]

    Jia and A

    Z. Jia and A. C. Gallagher and A. Saxena and T. Chen , journal=. 3D Reasoning from Blocks to Stability , year=

  37. [37]

    CoRR , volume =

    Roozbeh Mottaghi and Mohammad Rastegari and Abhinav Gupta and Ali Farhadi , title =. CoRR , volume =

  38. [38]

    Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding

    Tianjia Shao and Aron Monszpart and Youyi Zheng and Bongjin Koo and Weiwei Xu and Kun Zhou and Niloy Mitra. Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding. 2014 , journal =

  39. [39]

    International Journal of Computer Vision , year=

    Scene Understanding by Reasoning Stability and Safety , author=. International Journal of Computer Vision , year=

  40. [40]

    Taking Visual Motion Prediction To New Heightfields

  41. [41]

    Advances in Neural Information Processing Systems 29 , year =

    Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , author =. Advances in Neural Information Processing Systems 29 , year =

  42. [42]

    Independently Controllable Factors , journal =

    Valentin Thomas and Jules Pondard and Emmanuel Bengio and Marc Sarfati and Philippe Beaudoin and Marie. Independently Controllable Factors , journal =. 2017 , url =

  43. [43]

    Advances in Neural Information Processing Systems 30 , editor =

    Neural Expectation Maximization , author =. Advances in Neural Information Processing Systems 30 , editor =

  44. [44]

    Proceedings of the 18th International Joint Conference on Artificial Intelligence , year =

    Guestrin, Carlos and Koller, Daphne and Gearhart, Chris and Kanodia, Neal , title =. Proceedings of the 18th International Joint Conference on Artificial Intelligence , year =

  45. [45]

    Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

    Strategic Object Oriented Reinforcement Learning , author=. arXiv preprint arXiv:1806.00175 , year=

  46. [46]

    Proceedings of the 31st International Conference on Machine Learning , year =

    A Physics-Based Model Prior for Object-Oriented MDPs , author =. Proceedings of the 31st International Conference on Machine Learning , year =

  47. [47]

    CoRR , year=

    Deep Object-Centric Representations for Generalizable Robot Learning , author=. CoRR , year=

  48. [48]

    CoRR , year=

    Unsupervised Video Object Segmentation for Deep Reinforcement Learning , author=. CoRR , year=

  49. [49]

    , series =

    Roberts, Lawrence G. , series =

  50. [50]

    and Kroese, Dirk P

    Rubinstein, Reuven Y. and Kroese, Dirk P. , title =. 2004 , isbn =

  51. [51]

    Todorov, Emanuel and Erez, Tom and Tassa, Yuval , booktitle =

  52. [52]

    and Zisserman, A

    Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2014

  53. [53]

    ICLR , year=

    Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , author=. ICLR , year=

  54. [54]

    Levine, Sergey and Finn, Chelsea and Darrell, Trevor and Abbeel, Pieter , title =. J. Mach. Learn. Res. , issue_date =. 2016 , issn =

  55. [55]

    and Veness, Joel and Bellemare, Marc G

    Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K. and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

  56. [56]

    Proximal Policy Optimization Algorithms

    John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. arXiv preprint arXiv:1707.06347 , year =

  57. [57]

    International Conference on Machine Learning , year =

    Trust Region Policy Optimization , author =. International Conference on Machine Learning , year =

  58. [58]

    and Ullman, Tomer D

    Lake, Brenden M. and Ullman, Tomer D. and Tenenbaum, Joshua B. and Gershman, Samuel J. , biburl =. Building Machines That Learn and Think Like People. , volume =. CoRR , keywords =

  59. [59]

    Advances in Neural Information Processing Systems , year =

    Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , author =. Advances in Neural Information Processing Systems , year =

  60. [60]

    International Conference on Robotics and Automation , year=

    Optimal control with learned local models: Application to dexterous manipulation , author=. International Conference on Robotics and Automation , year=

  61. [61]

    Advances in Neural Information Processing Systems , year =

    Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , author =. Advances in Neural Information Processing Systems , year =

  62. [62]

    International Conference on Learning Representations , year=

    Model-Ensemble Trust-Region Policy Optimization , author=. International Conference on Learning Representations , year=

  63. [63]

    Jordan and Joseph E

    Vladimir Feinberg and Alvin Wan and Ion Stoica and Michael I. Jordan and Joseph E. Gonzalez and Sergey Levine , title =. International Conference on Machine Learning , year =

  64. [64]

    Continuous Deep

    Shixiang Gu and Timothy Lillicrap and Ilya Sutskever and Sergey Levine , booktitle =. Continuous Deep

  65. [65]

    Sutton , title =

    Richard S. Sutton , title =. International Conference on Machine Learning , year =

  66. [66]

    International Conference on Machine Learning , year =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , year =

  67. [67]

    Littman and Andrew P

    Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore. Journal of Artificial Intelligence Research. 1996

  68. [68]

    Deisenroth, Marc and Rasmussen, Carl E , booktitle=

  69. [69]

    Asadi, Kavosh and Misra, Dipendra and Littman, Michael , booktitle =

  70. [70]

    Ng and Adam Coates and Mark Diel and Varun Ganapathi and Jamie Schulte and Ben Tse and Eric Berger and Eric Liang , title =

    Andrew Y. Ng and Adam Coates and Mark Diel and Varun Ganapathi and Jamie Schulte and Ben Tse and Eric Berger and Eric Liang , title =. Experimental Robotics IX, The 9th International Symposium on Experimental Robotics. 2004 , crossref =. doi:10.1007/11552246\_35 , timestamp =

  71. [71]

    Nature , pages =

    Mastering the game of Go with deep neural networks and tree search , author =. Nature , pages =. 2016 , URL =

  72. [72]

    2014 , publisher=

    Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=

  73. [73]

    Lillicrap and Jonathan J

    Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Heess and Tom Erez and Yuval Tassa and David Silver and Daan Wierstra , title =. International Conference on Learning Representations , year =

  74. [74]

    International Conference on Machine Learning , year =

    Sergey Levine and Vladlen Koltun , title =. International Conference on Machine Learning , year =

  75. [75]

    Learning and policy search in stochastic dynamical systems with bayesian neural networks , booktitle =

    Depeweg, Stefan and Hern. Learning and policy search in stochastic dynamical systems with bayesian neural networks , booktitle =

  76. [76]

    IEEE Control Systems Magazine , title=

    A. IEEE Control Systems Magazine , title=

  77. [77]

    Advances in Neural Information Processing Systems , year=

    Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , author=. Advances in Neural Information Processing Systems , year=

  78. [78]

    Fearing, Ronald and Levine, Sergey , year =

    Nagabandi, Anusha and Kahn, Gregory and S. Fearing, Ronald and Levine, Sergey , year =. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , booktitle =

  79. [79]

    Conference on Robot Learning , year =

    Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , author =. Conference on Robot Learning , year =

  80. [80]

    C. G. International Conference on Robotics and Automation , title=

Showing first 80 references.