arxiv: 2310.06114 · v3 · submitted 2023-10-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Learning Interactive Real-World Simulators

Sherry Yang , Yilun Du , Kamyar Ghasemipour , Jonathan Tompson , Leslie Kaelbling , Dale Schuurmans , Pieter Abbeel

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords generative modelingreal-world simulatorsim-to-real transferembodied AIvision-language policiesreinforcement learningzero-shot deploymentinteractive simulation

0 comments

The pith

A generative model simulates real-world interactions from static image, robotics and navigation datasets to enable zero-shot policy transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that careful orchestration of complementary static datasets can train a single generative model to predict realistic visual outcomes of high-level instructions and low-level controls. This universal simulator supports end-to-end training of vision-language and reinforcement learning policies entirely inside simulation. Policies trained this way deploy directly in the physical world without further adaptation. The approach also improves auxiliary tasks such as video captioning by supplying additional simulated experience.

Core claim

We present UniSim, a generative model that learns to simulate realistic visual responses to actions in real-world scenes. By integrating abundant objects from image data, densely sampled actions from robotics data, and diverse movements from navigation data, the model generates plausible outcomes for instructions such as opening a drawer even when starting from otherwise static scenes. High-level vision-language policies and low-level reinforcement learning policies trained exclusively in this simulator transfer to real-world deployment with no additional real-world data.

What carries the argument

UniSim, the generative simulator that predicts action-conditioned visual changes by orchestrating complementary information across image, robotics, and navigation datasets.

If this is right

High-level vision-language policies trained purely in simulation transfer to real-world execution without adaptation.
Low-level reinforcement learning policies trained in simulation transfer directly to physical robots.
Simulated experience generated by the model improves training of video captioning systems.
The same simulator supports controllable generation of interactive content for games and movies.
Embodied agents for both high-level planning and low-level control can be developed entirely in simulation before real-world use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the orchestration to include audio or tactile data could capture richer multi-modal dynamics.
The method implies that dataset composition rather than explicit physics engines may be the dominant route to closing the sim-to-real gap.
Scaling the approach to longer-horizon multi-step tasks would test whether current dataset coverage is enough for complex sequences.
Policies trained this way could serve as initial seeds for continued real-world fine-tuning, reducing overall data needs.

Load-bearing premise

Careful orchestration of existing static image, robotics, and navigation datasets is sufficient to capture the full interactive dynamics needed for zero-shot real-world transfer.

What would settle it

Train a low-level policy in the simulator to perform a specific action such as drawer opening and observe whether it succeeds or fails when executed on the matching physical robot and scene.

read the original abstract

Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as "open the drawer" and low-level controls from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniSim mixes static image, robot, and navigation datasets into one generative simulator that supports both high-level and low-level control, with zero-shot policy transfer as the main claim.

read the letter

The paper's core move is to take existing datasets that each cover only part of real-world interaction and blend them so a single model can generate visual outcomes for both vague instructions and precise actions. That orchestration step is the concrete new piece; prior video or world-model work usually sticks to one data type or one control level. They then train vision-language policies and RL policies inside the simulator and report that both transfer directly to hardware. They also note side benefits for captioning models trained on the generated trajectories. Those results are the parts worth looking at if you care about cutting down real-robot data collection.

Referee Report

2 major / 1 minor

Summary. The paper introduces UniSim, a generative model trained via careful orchestration of static image, robotics trajectory, and navigation datasets to simulate visual outcomes of both high-level instructions and low-level controls. It claims this simulator enables training of vision-language policies and RL policies that transfer zero-shot to the real world, while also providing auxiliary benefits to video captioning models.

Significance. If the zero-shot transfer results hold under rigorous evaluation, the work would be significant for embodied AI by demonstrating that orchestrated static datasets can substitute for expensive interactive data collection in sim-to-real transfer. The dataset-mixing approach is a practical contribution that could scale with existing internet-scale corpora.

major comments (2)

[Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.
[§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.

minor comments (1)

[§3] The manuscript would benefit from a dedicated subsection detailing the exact mixing weights, sampling strategy, and loss balancing used in the orchestration step, as these are free parameters that directly affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to revisions that provide additional quantitative support and validation experiments.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.

Authors: We thank the referee for highlighting the need for clearer quantitative support in the abstract. The full manuscript reports quantitative success rates for zero-shot real-world policy deployment in Section 5, along with baseline comparisons and discussion of failure modes. To address this directly, we will revise the abstract to summarize these key metrics and evidence. revision: yes
Referee: [§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.

Authors: We agree that explicit long-horizon validation would strengthen the paper. While the policy training results in Section 4 demonstrate effective multi-step real-world control (which depends on accurate causal dynamics), we did not include standalone long-horizon prediction benchmarks. We will add dedicated long-horizon prediction accuracy tests and action-effect matching experiments on extended sequences in the revised manuscript. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that heterogeneous datasets can be combined to approximate full interactive experience, plus standard generative modeling assumptions; no new entities are postulated and free parameters are the usual training hyperparameters.

free parameters (1)

dataset mixing weights
Relative sampling rates or loss weights across image, robotics, and navigation sources are chosen to produce the reported simulator behavior.

axioms (1)

domain assumption Diverse static and action datasets can be orchestrated to simulate full real-world interaction dynamics
Invoked when the authors state that careful orchestration enables simulation of both high-level instructions and low-level controls from otherwise static scenes.

pith-pipeline@v0.9.0 · 5563 in / 1207 out tokens · 55000 ms · 2026-05-16T02:03:50.905362+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate the action-in-video-out framework as an observation prediction model conditioned on finite history and parametrized by a video diffusion model... rolled out autoregressively
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

careful orchestration of diverse datasets... simulate the visual outcome of both high-level instructions and low-level controls

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
RoboDreamer: Learning Compositional World Models for Robot Imagination
cs.RO 2024-04 unverdicted novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
Safety, Security, and Cognitive Risks in World Models
cs.CR 2026-04 unverdicted novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
cs.RO 2026-03 unverdicted novelty 6.0

SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
cs.RO 2025-10 unverdicted novelty 6.0

A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
cs.CV 2025-03 unverdicted novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
cs.RO 2026-04 unverdicted novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
Designing Digital Humans with Ambient Intelligence
cs.HC 2026-04 unverdicted novelty 5.0

Integrating ambient intelligence with digital humans creates context-aware virtual agents capable of anticipatory assistance based on the user's surroundings.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

276 extracted references · 276 canonical work pages · cited by 20 Pith papers · 46 internal anchors

[1]

A separation principle for control in the age of deep learning

Alessandro Achille and Stefano Soatto. A separation principle for control in the age of deep learning. Annual Review of Control, Robotics, and Autonomous Systems, 1: 0 287--307, 2018

work page 2018
[2]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3674--3683, 2018

work page 2018
[3]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. 2017

work page 2017
[4]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Bertsekas

Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995

work page 1995
[8]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Scalable methods for computing state similarity in deterministic markov decision processes

Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 10069--10076, 2020

work page 2020
[14]

Animating pictures with stochastic motion textures

Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian Curless, David H Salesin, and Richard Szeliski. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers, pp.\ 853--860. 2005

work page 2005
[15]

3d u-net: learning dense volumetric segmentation from sparse annotation

\"O zg \"u n C i c ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp.\ 424--432. Sprin...

work page 2016
[16]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp.\ 720--736, 2018

work page 2018
[18]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Guiding pretraining in reinforcement learning with large language models

Yuqing Du, Olivia Watkins, Zihan Wang, C \'e dric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023 b

work page arXiv 2023
[23]

Metrics for finite markov decision processes

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In UAI, volume 4, pp.\ 162--169, 2004

work page 2004
[24]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

work page 2017
[25]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

work page 2022
[28]

Controllable video generation with sparse trajectories

Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7854--7863, 2018

work page 2018
[30]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

work page 2020
[32]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

work page 2021
[34]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp.\ 706--715, 2017

work page 2017
[35]

State representation learning for control: An overview

Timoth \'e e Lesort, Natalia D \' az-Rodr \' guez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview. Neural Networks, 108: 0 379--392, 2018

work page 2018
[36]

Generative image dynamics, 2023

Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics, 2023

work page 2023
[37]

Modeling of dynamic systems

Lennart Ljung and Torkel Glad. Modeling of dynamic systems. Prentice-Hall, Inc., 1994

work page 1994
[39]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

work page 2023
[40]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp.\ 879--893. PMLR, 2018

work page 2018
[42]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2630--2640, 2019

work page 2019
[43]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015

work page 2015
[44]

Spoken moments: Learning joint audio-visual representations from video descriptions

Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14871--14881, 2021

work page 2021
[45]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[46]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (140): 0 1--67, 2020

work page 2020
[47]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset ( HM 3d): 1000 large-scale 3d environments for embodied AI . In Thirty-fifth Conference on Neural Informa...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Hindsight policy gradients

Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. Hindsight policy gradients. In International Conference on Learning Representations, 2019

work page 2019
[50]

Sim-to-real robot learning from pixels with progressive nets

Andrei A Rusu, Matej Ve c er \' k, Thomas Roth \"o rl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning, pp.\ 262--270. PMLR, 2017

work page 2017
[51]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9339--9347, 2019

work page 2019
[53]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

work page 2022
[54]

Reinforcement learning with action-free pre-training from videos

Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp.\ 19561--19579. PMLR, 2022

work page 2022
[55]

Animating arbitrary objects via deep motion transfer

Aliaksandr Siarohin, St \'e phane Lathuili \`e re, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2377--2386, 2019

work page 2019
[56]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.\ 2256--2265. PMLR, 2015

work page 2015
[59]

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988

work page 1988
[60]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991

work page 1991
[62]

Ul2: Unifying language learning paradigms

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[63]

End-to-end dense video captioning with parallel decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 6847--6857, 2021

work page 2021
[64]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4581--4591, 2019

work page 2019
[66]

Photo wake-up: 3d character animation from a single photo

Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 5908--5917, 2019

work page 2019
[67]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

work page 1992
[70]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016

work page 2016
[71]

Visual dynamics: Stochastic future generation via layered cross convolutional networks

Tianfan Xue, Jiajun Wu, Katherine L Bouman, and William T Freeman. Visual dynamics: Stochastic future generation via layered cross convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 41 0 (9): 0 2236--2250, 2018

work page 2018
[72]

Dichotomy of control: Separating what you can control from what you cannot

Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022

work page arXiv 2022
[74]

Video probabilistic diffusion models in projected latent space

Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18456--18466, 2023

work page 2023
[75]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12104--12113, 2022

work page 2022
[76]

Text-to-image diffusion model in generative ai: A survey

Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023

work page arXiv 2023
[78]

Åström and Björn Wittenmark

Karl J. Åström and Björn Wittenmark. Adaptive control of linear time-invariant systems. Automatica, 9 0 (6): 0 551--564, 1973

work page 1973
[79]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[80]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[81]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[82]

1971 , publisher=

The optimal control of partially observable Markov processes , author=. 1971 , publisher=

work page 1971
[83]

Operations research , volume=

The optimal control of partially observable Markov processes over a finite horizon , author=. Operations research , volume=. 1973 , publisher=

work page 1973
[84]

International Conference on Machine Learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015
[85]

Advances in neural information processing systems , volume=

Skill discovery in continuous reinforcement learning domains using skill chaining , author=. Advances in neural information processing systems , volume=

work page
[86]

Artificial intelligence , volume=

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , author=. Artificial intelligence , volume=. 1999 , publisher=

work page 1999
[87]

2011 IEEE International Conference on Robotics and Automation , pages=

Hierarchical task and motion planning in the now , author=. 2011 IEEE International Conference on Robotics and Automation , pages=. 2011 , organization=

work page 2011
[88]

Advances in neural information processing systems , volume=

Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[89]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[91]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

An image is worth one word: Personalizing text-to-image generation using textual inversion , author=. arXiv preprint arXiv:2208.01618 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[92]

Neurocomputing , volume=

Srdiff: Single image super-resolution with diffusion probabilistic models , author=. Neurocomputing , volume=. 2022 , publisher=

work page 2022
[93]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Repaint: Inpainting using denoising diffusion probabilistic models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[94]

arXiv preprint arXiv:2201.11793 , year=

Denoising diffusion restoration models , author=. arXiv preprint arXiv:2201.11793 , year=

work page arXiv
[95]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[96]

Prompt-to-Prompt Image Editing with Cross Attention Control

Prompt-to-prompt image editing with cross attention control , author=. arXiv preprint arXiv:2208.01626 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[97]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[98]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[99]

2023 , eprint=

Adding Conditional Control to Text-to-Image Diffusion Models , author=. 2023 , eprint=

work page 2023
[100]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[101]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[102]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[103]

International Conference on Learning Representations , year=

Sdedit: Guided image synthesis and editing with stochastic differential equations , author=. International Conference on Learning Representations , year=

work page
[104]

arXiv preprint arXiv:2208.12242 , year=

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. arXiv preprint arXiv:2208.12242 , year=

work page arXiv

Showing first 80 references.