pith. machine review for the scientific record. sign in

arxiv: 2310.06114 · v3 · submitted 2023-10-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Learning Interactive Real-World Simulators

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:03 UTC · model grok-4.3

classification 💻 cs.AI
keywords generative modelingreal-world simulatorsim-to-real transferembodied AIvision-language policiesreinforcement learningzero-shot deploymentinteractive simulation
0
0 comments X

The pith

A generative model simulates real-world interactions from static image, robotics and navigation datasets to enable zero-shot policy transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that careful orchestration of complementary static datasets can train a single generative model to predict realistic visual outcomes of high-level instructions and low-level controls. This universal simulator supports end-to-end training of vision-language and reinforcement learning policies entirely inside simulation. Policies trained this way deploy directly in the physical world without further adaptation. The approach also improves auxiliary tasks such as video captioning by supplying additional simulated experience.

Core claim

We present UniSim, a generative model that learns to simulate realistic visual responses to actions in real-world scenes. By integrating abundant objects from image data, densely sampled actions from robotics data, and diverse movements from navigation data, the model generates plausible outcomes for instructions such as opening a drawer even when starting from otherwise static scenes. High-level vision-language policies and low-level reinforcement learning policies trained exclusively in this simulator transfer to real-world deployment with no additional real-world data.

What carries the argument

UniSim, the generative simulator that predicts action-conditioned visual changes by orchestrating complementary information across image, robotics, and navigation datasets.

If this is right

  • High-level vision-language policies trained purely in simulation transfer to real-world execution without adaptation.
  • Low-level reinforcement learning policies trained in simulation transfer directly to physical robots.
  • Simulated experience generated by the model improves training of video captioning systems.
  • The same simulator supports controllable generation of interactive content for games and movies.
  • Embodied agents for both high-level planning and low-level control can be developed entirely in simulation before real-world use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the orchestration to include audio or tactile data could capture richer multi-modal dynamics.
  • The method implies that dataset composition rather than explicit physics engines may be the dominant route to closing the sim-to-real gap.
  • Scaling the approach to longer-horizon multi-step tasks would test whether current dataset coverage is enough for complex sequences.
  • Policies trained this way could serve as initial seeds for continued real-world fine-tuning, reducing overall data needs.

Load-bearing premise

Careful orchestration of existing static image, robotics, and navigation datasets is sufficient to capture the full interactive dynamics needed for zero-shot real-world transfer.

What would settle it

Train a low-level policy in the simulator to perform a specific action such as drawer opening and observe whether it succeeds or fails when executed on the matching physical robot and scene.

read the original abstract

Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as "open the drawer" and low-level controls from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces UniSim, a generative model trained via careful orchestration of static image, robotics trajectory, and navigation datasets to simulate visual outcomes of both high-level instructions and low-level controls. It claims this simulator enables training of vision-language policies and RL policies that transfer zero-shot to the real world, while also providing auxiliary benefits to video captioning models.

Significance. If the zero-shot transfer results hold under rigorous evaluation, the work would be significant for embodied AI by demonstrating that orchestrated static datasets can substitute for expensive interactive data collection in sim-to-real transfer. The dataset-mixing approach is a practical contribution that could scale with existing internet-scale corpora.

major comments (2)
  1. [Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.
  2. [§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.
minor comments (1)
  1. [§3] The manuscript would benefit from a dedicated subsection detailing the exact mixing weights, sampling strategy, and loss balancing used in the orchestration step, as these are free parameters that directly affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to revisions that provide additional quantitative support and validation experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.

    Authors: We thank the referee for highlighting the need for clearer quantitative support in the abstract. The full manuscript reports quantitative success rates for zero-shot real-world policy deployment in Section 5, along with baseline comparisons and discussion of failure modes. To address this directly, we will revise the abstract to summarize these key metrics and evidence. revision: yes

  2. Referee: [§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.

    Authors: We agree that explicit long-horizon validation would strengthen the paper. While the policy training results in Section 4 demonstrate effective multi-step real-world control (which depends on accurate causal dynamics), we did not include standalone long-horizon prediction benchmarks. We will add dedicated long-horizon prediction accuracy tests and action-effect matching experiments on extended sequences in the revised manuscript. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that heterogeneous datasets can be combined to approximate full interactive experience, plus standard generative modeling assumptions; no new entities are postulated and free parameters are the usual training hyperparameters.

free parameters (1)
  • dataset mixing weights
    Relative sampling rates or loss weights across image, robotics, and navigation sources are chosen to produce the reported simulator behavior.
axioms (1)
  • domain assumption Diverse static and action datasets can be orchestrated to simulate full real-world interaction dynamics
    Invoked when the authors state that careful orchestration enables simulation of both high-level instructions and low-level controls from otherwise static scenes.

pith-pipeline@v0.9.0 · 5563 in / 1207 out tokens · 55000 ms · 2026-05-16T02:03:50.905362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  3. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  4. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

  5. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.

  6. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  7. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  8. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  9. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  10. RoboDreamer: Learning Compositional World Models for Robot Imagination

    cs.RO 2024-04 unverdicted novelty 7.0

    RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.

  11. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  12. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  13. Safety, Security, and Cognitive Risks in World Models

    cs.CR 2026-04 unverdicted novelty 6.0

    World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...

  14. Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

    cs.RO 2026-03 unverdicted novelty 6.0

    SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.

  15. Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.

  16. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  17. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  18. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  19. Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

    cs.RO 2026-04 unverdicted novelty 5.0

    Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...

  20. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  21. Designing Digital Humans with Ambient Intelligence

    cs.HC 2026-04 unverdicted novelty 5.0

    Integrating ambient intelligence with digital humans creates context-aware virtual agents capable of anticipatory assistance based on the user's surroundings.

  22. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  23. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

276 extracted references · 276 canonical work pages · cited by 20 Pith papers · 46 internal anchors

  1. [1]

    A separation principle for control in the age of deep learning

    Alessandro Achille and Stefano Soatto. A separation principle for control in the age of deep learning. Annual Review of Control, Robotics, and Autonomous Systems, 1: 0 287--307, 2018

  2. [2]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3674--3683, 2018

  3. [3]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. 2017

  4. [4]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  5. [5]

    Bertsekas

    Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995

  6. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  7. [9]

    Scalable methods for computing state similarity in deterministic markov decision processes

    Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 10069--10076, 2020

  8. [14]

    Animating pictures with stochastic motion textures

    Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian Curless, David H Salesin, and Richard Szeliski. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers, pp.\ 853--860. 2005

  9. [15]

    3d u-net: learning dense volumetric segmentation from sparse annotation

    \"O zg \"u n C i c ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp.\ 424--432. Sprin...

  10. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp.\ 720--736, 2018

  11. [18]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  12. [20]

    Guiding pretraining in reinforcement learning with large language models

    Yuqing Du, Olivia Watkins, Zihan Wang, C \'e dric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023 b

  13. [23]

    Metrics for finite markov decision processes

    Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In UAI, volume 4, pp.\ 162--169, 2004

  14. [24]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

  15. [25]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

  16. [28]

    Controllable video generation with sparse trajectories

    Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7854--7863, 2018

  17. [30]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

  18. [32]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022 b

  19. [33]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

  20. [34]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp.\ 706--715, 2017

  21. [35]

    State representation learning for control: An overview

    Timoth \'e e Lesort, Natalia D \' az-Rodr \' guez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview. Neural Networks, 108: 0 379--392, 2018

  22. [36]

    Generative image dynamics, 2023

    Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics, 2023

  23. [37]

    Modeling of dynamic systems

    Lennart Ljung and Torkel Glad. Modeling of dynamic systems. Prentice-Hall, Inc., 1994

  24. [39]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

  25. [40]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp.\ 879--893. PMLR, 2018

  26. [42]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2630--2640, 2019

  27. [43]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015

  28. [44]

    Spoken moments: Learning joint audio-visual representations from video descriptions

    Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14871--14881, 2021

  29. [45]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  30. [46]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (140): 0 1--67, 2020

  31. [47]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset ( HM 3d): 1000 large-scale 3d environments for embodied AI . In Thirty-fifth Conference on Neural Informa...

  32. [48]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  33. [49]

    Hindsight policy gradients

    Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. Hindsight policy gradients. In International Conference on Learning Representations, 2019

  34. [50]

    Sim-to-real robot learning from pixels with progressive nets

    Andrei A Rusu, Matej Ve c er \' k, Thomas Roth \"o rl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning, pp.\ 262--270. PMLR, 2017

  35. [51]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9339--9347, 2019

  36. [53]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

  37. [54]

    Reinforcement learning with action-free pre-training from videos

    Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp.\ 19561--19579. PMLR, 2022

  38. [55]

    Animating arbitrary objects via deep motion transfer

    Aliaksandr Siarohin, St \'e phane Lathuili \`e re, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2377--2386, 2019

  39. [56]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815

  40. [58]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.\ 2256--2265. PMLR, 2015

  41. [59]

    Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988

  42. [60]

    Dyna, an integrated architecture for learning, planning, and reacting

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991

  43. [62]

    Ul2: Unifying language learning paradigms

    Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022

  44. [63]

    End-to-end dense video captioning with parallel decoding

    Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 6847--6857, 2021

  45. [64]

    Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4581--4591, 2019

  46. [66]

    Photo wake-up: 3d character animation from a single photo

    Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 5908--5917, 2019

  47. [67]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

  48. [70]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016

  49. [71]

    Visual dynamics: Stochastic future generation via layered cross convolutional networks

    Tianfan Xue, Jiajun Wu, Katherine L Bouman, and William T Freeman. Visual dynamics: Stochastic future generation via layered cross convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 41 0 (9): 0 2236--2250, 2018

  50. [72]

    Dichotomy of control: Separating what you can control from what you cannot

    Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022

  51. [74]

    Video probabilistic diffusion models in projected latent space

    Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18456--18466, 2023

  52. [75]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12104--12113, 2022

  53. [76]

    Text-to-image diffusion model in generative ai: A survey

    Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023

  54. [78]

    Åström and Björn Wittenmark

    Karl J. Åström and Björn Wittenmark. Adaptive control of linear time-invariant systems. Automatica, 9 0 (6): 0 551--564, 1973

  55. [79]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  56. [80]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  57. [81]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  58. [82]

    1971 , publisher=

    The optimal control of partially observable Markov processes , author=. 1971 , publisher=

  59. [83]

    Operations research , volume=

    The optimal control of partially observable Markov processes over a finite horizon , author=. Operations research , volume=. 1973 , publisher=

  60. [84]

    International Conference on Machine Learning , pages=

    Deep unsupervised learning using nonequilibrium thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

  61. [85]

    Advances in neural information processing systems , volume=

    Skill discovery in continuous reinforcement learning domains using skill chaining , author=. Advances in neural information processing systems , volume=

  62. [86]

    Artificial intelligence , volume=

    Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , author=. Artificial intelligence , volume=. 1999 , publisher=

  63. [87]

    2011 IEEE International Conference on Robotics and Automation , pages=

    Hierarchical task and motion planning in the now , author=. 2011 IEEE International Conference on Robotics and Automation , pages=. 2011 , organization=

  64. [88]

    Advances in neural information processing systems , volume=

    Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=

  65. [89]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=

  66. [90]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

  67. [91]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    An image is worth one word: Personalizing text-to-image generation using textual inversion , author=. arXiv preprint arXiv:2208.01618 , year=

  68. [92]

    Neurocomputing , volume=

    Srdiff: Single image super-resolution with diffusion probabilistic models , author=. Neurocomputing , volume=. 2022 , publisher=

  69. [93]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Repaint: Inpainting using denoising diffusion probabilistic models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  70. [94]

    arXiv preprint arXiv:2201.11793 , year=

    Denoising diffusion restoration models , author=. arXiv preprint arXiv:2201.11793 , year=

  71. [95]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=

  72. [96]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Prompt-to-prompt image editing with cross attention control , author=. arXiv preprint arXiv:2208.01626 , year=

  73. [97]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  74. [98]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=

  75. [99]

    2023 , eprint=

    Adding Conditional Control to Text-to-Image Diffusion Models , author=. 2023 , eprint=

  76. [100]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  77. [101]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  78. [102]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  79. [103]

    International Conference on Learning Representations , year=

    Sdedit: Guided image synthesis and editing with stochastic differential equations , author=. International Conference on Learning Representations , year=

  80. [104]

    arXiv preprint arXiv:2208.12242 , year=

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. arXiv preprint arXiv:2208.12242 , year=

Showing first 80 references.