Recognition: 2 theorem links
· Lean TheoremLearning Interactive Real-World Simulators
Pith reviewed 2026-05-16 02:03 UTC · model grok-4.3
The pith
A generative model simulates real-world interactions from static image, robotics and navigation datasets to enable zero-shot policy transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present UniSim, a generative model that learns to simulate realistic visual responses to actions in real-world scenes. By integrating abundant objects from image data, densely sampled actions from robotics data, and diverse movements from navigation data, the model generates plausible outcomes for instructions such as opening a drawer even when starting from otherwise static scenes. High-level vision-language policies and low-level reinforcement learning policies trained exclusively in this simulator transfer to real-world deployment with no additional real-world data.
What carries the argument
UniSim, the generative simulator that predicts action-conditioned visual changes by orchestrating complementary information across image, robotics, and navigation datasets.
If this is right
- High-level vision-language policies trained purely in simulation transfer to real-world execution without adaptation.
- Low-level reinforcement learning policies trained in simulation transfer directly to physical robots.
- Simulated experience generated by the model improves training of video captioning systems.
- The same simulator supports controllable generation of interactive content for games and movies.
- Embodied agents for both high-level planning and low-level control can be developed entirely in simulation before real-world use.
Where Pith is reading between the lines
- Extending the orchestration to include audio or tactile data could capture richer multi-modal dynamics.
- The method implies that dataset composition rather than explicit physics engines may be the dominant route to closing the sim-to-real gap.
- Scaling the approach to longer-horizon multi-step tasks would test whether current dataset coverage is enough for complex sequences.
- Policies trained this way could serve as initial seeds for continued real-world fine-tuning, reducing overall data needs.
Load-bearing premise
Careful orchestration of existing static image, robotics, and navigation datasets is sufficient to capture the full interactive dynamics needed for zero-shot real-world transfer.
What would settle it
Train a low-level policy in the simulator to perform a specific action such as drawer opening and observe whether it succeeds or fails when executed on the matching physical robot and scene.
read the original abstract
Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as "open the drawer" and low-level controls from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UniSim, a generative model trained via careful orchestration of static image, robotics trajectory, and navigation datasets to simulate visual outcomes of both high-level instructions and low-level controls. It claims this simulator enables training of vision-language policies and RL policies that transfer zero-shot to the real world, while also providing auxiliary benefits to video captioning models.
Significance. If the zero-shot transfer results hold under rigorous evaluation, the work would be significant for embodied AI by demonstrating that orchestrated static datasets can substitute for expensive interactive data collection in sim-to-real transfer. The dataset-mixing approach is a practical contribution that could scale with existing internet-scale corpora.
major comments (2)
- [Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.
- [§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.
minor comments (1)
- [§3] The manuscript would benefit from a dedicated subsection detailing the exact mixing weights, sampling strategy, and loss balancing used in the orchestration step, as these are free parameters that directly affect reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to revisions that provide additional quantitative support and validation experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: The central claim of successful zero-shot real-world deployment of both high-level vision-language policies and low-level RL policies is stated without any quantitative metrics, baseline comparisons, or failure-case analysis. This leaves the fidelity of the learned dynamics unverified and the transfer guarantee unsupported by evidence.
Authors: We thank the referee for highlighting the need for clearer quantitative support in the abstract. The full manuscript reports quantitative success rates for zero-shot real-world policy deployment in Section 5, along with baseline comparisons and discussion of failure modes. To address this directly, we will revise the abstract to summarize these key metrics and evidence. revision: yes
-
Referee: [§3 and §4] §3 (Dataset Orchestration) and §4 (Policy Training): The assumption that mixing static datasets induces accurate multi-step causal dynamics and action effects is load-bearing for the zero-shot claim, yet no explicit long-horizon prediction tests or action-effect matching experiments are reported to validate extrapolation beyond the short-sequence source data.
Authors: We agree that explicit long-horizon validation would strengthen the paper. While the policy training results in Section 4 demonstrate effective multi-step real-world control (which depends on accurate causal dynamics), we did not include standalone long-horizon prediction benchmarks. We will add dedicated long-horizon prediction accuracy tests and action-effect matching experiments on extended sequences in the revised manuscript. revision: yes
Axiom & Free-Parameter Ledger
free parameters (1)
- dataset mixing weights
axioms (1)
- domain assumption Diverse static and action datasets can be orchestrated to simulate full real-world interaction dynamics
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate the action-in-video-out framework as an observation prediction model conditioned on finite history and parametrized by a video diffusion model... rolled out autoregressively
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
careful orchestration of diverse datasets... simulate the visual outcome of both high-level instructions and low-level controls
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
RoboDreamer: Learning Compositional World Models for Robot Imagination
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
Safety, Security, and Cognitive Risks in World Models
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
-
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
Designing Digital Humans with Ambient Intelligence
Integrating ambient intelligence with digital humans creates context-aware virtual agents capable of anticipatory assistance based on the user's surroundings.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
A separation principle for control in the age of deep learning
Alessandro Achille and Stefano Soatto. A separation principle for control in the age of deep learning. Annual Review of Control, Robotics, and Autonomous Systems, 1: 0 287--307, 2018
work page 2018
-
[2]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3674--3683, 2018
work page 2018
-
[3]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems. 2017
work page 2017
-
[4]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [5]
-
[8]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Scalable methods for computing state similarity in deterministic markov decision processes
Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 10069--10076, 2020
work page 2020
-
[14]
Animating pictures with stochastic motion textures
Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian Curless, David H Salesin, and Richard Szeliski. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers, pp.\ 853--860. 2005
work page 2005
-
[15]
3d u-net: learning dense volumetric segmentation from sparse annotation
\"O zg \"u n C i c ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp.\ 424--432. Sprin...
work page 2016
-
[16]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp.\ 720--736, 2018
work page 2018
-
[18]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Guiding pretraining in reinforcement learning with large language models
Yuqing Du, Olivia Watkins, Zihan Wang, C \'e dric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023 b
-
[23]
Metrics for finite markov decision processes
Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In UAI, volume 4, pp.\ 162--169, 2004
work page 2004
-
[24]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017
work page 2017
-
[25]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[28]
Controllable video generation with sparse trajectories
Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7854--7863, 2018
work page 2018
-
[30]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[32]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021
work page 2021
-
[34]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp.\ 706--715, 2017
work page 2017
-
[35]
State representation learning for control: An overview
Timoth \'e e Lesort, Natalia D \' az-Rodr \' guez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview. Neural Networks, 108: 0 379--392, 2018
work page 2018
-
[36]
Generative image dynamics, 2023
Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics, 2023
work page 2023
-
[37]
Lennart Ljung and Torkel Glad. Modeling of dynamic systems. Prentice-Hall, Inc., 1994
work page 1994
-
[39]
Interactive language: Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023
work page 2023
-
[40]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp.\ 879--893. PMLR, 2018
work page 2018
-
[42]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2630--2640, 2019
work page 2019
-
[43]
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015
work page 2015
-
[44]
Spoken moments: Learning joint audio-visual representations from video descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14871--14881, 2021
work page 2021
- [45]
-
[46]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (140): 0 1--67, 2020
work page 2020
-
[47]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset ( HM 3d): 1000 large-scale 3d environments for embodied AI . In Thirty-fifth Conference on Neural Informa...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[48]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. Hindsight policy gradients. In International Conference on Learning Representations, 2019
work page 2019
-
[50]
Sim-to-real robot learning from pixels with progressive nets
Andrei A Rusu, Matej Ve c er \' k, Thomas Roth \"o rl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning, pp.\ 262--270. PMLR, 2017
work page 2017
-
[51]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9339--9347, 2019
work page 2019
-
[53]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022
work page 2022
-
[54]
Reinforcement learning with action-free pre-training from videos
Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp.\ 19561--19579. PMLR, 2022
work page 2022
-
[55]
Animating arbitrary objects via deep motion transfer
Aliaksandr Siarohin, St \'e phane Lathuili \`e re, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2377--2386, 2019
work page 2019
-
[56]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.\ 2256--2265. PMLR, 2015
work page 2015
-
[59]
Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988
work page 1988
-
[60]
Dyna, an integrated architecture for learning, planning, and reacting
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991
work page 1991
-
[62]
Ul2: Unifying language learning paradigms
Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[63]
End-to-end dense video captioning with parallel decoding
Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 6847--6857, 2021
work page 2021
-
[64]
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4581--4591, 2019
work page 2019
-
[66]
Photo wake-up: 3d character animation from a single photo
Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 5908--5917, 2019
work page 2019
-
[67]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992
work page 1992
-
[70]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016
work page 2016
-
[71]
Visual dynamics: Stochastic future generation via layered cross convolutional networks
Tianfan Xue, Jiajun Wu, Katherine L Bouman, and William T Freeman. Visual dynamics: Stochastic future generation via layered cross convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 41 0 (9): 0 2236--2250, 2018
work page 2018
-
[72]
Dichotomy of control: Separating what you can control from what you cannot
Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022
-
[74]
Video probabilistic diffusion models in projected latent space
Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18456--18466, 2023
work page 2023
-
[75]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12104--12113, 2022
work page 2022
-
[76]
Text-to-image diffusion model in generative ai: A survey
Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023
-
[78]
Karl J. Åström and Björn Wittenmark. Adaptive control of linear time-invariant systems. Automatica, 9 0 (6): 0 551--564, 1973
work page 1973
-
[79]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[80]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [81]
-
[82]
The optimal control of partially observable Markov processes , author=. 1971 , publisher=
work page 1971
-
[83]
The optimal control of partially observable Markov processes over a finite horizon , author=. Operations research , volume=. 1973 , publisher=
work page 1973
-
[84]
International Conference on Machine Learning , pages=
Deep unsupervised learning using nonequilibrium thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=
work page 2015
-
[85]
Advances in neural information processing systems , volume=
Skill discovery in continuous reinforcement learning domains using skill chaining , author=. Advances in neural information processing systems , volume=
-
[86]
Artificial intelligence , volume=
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , author=. Artificial intelligence , volume=. 1999 , publisher=
work page 1999
-
[87]
2011 IEEE International Conference on Robotics and Automation , pages=
Hierarchical task and motion planning in the now , author=. 2011 IEEE International Conference on Robotics and Automation , pages=. 2011 , organization=
work page 2011
-
[88]
Advances in neural information processing systems , volume=
Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[89]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[90]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[91]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
An image is worth one word: Personalizing text-to-image generation using textual inversion , author=. arXiv preprint arXiv:2208.01618 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[92]
Srdiff: Single image super-resolution with diffusion probabilistic models , author=. Neurocomputing , volume=. 2022 , publisher=
work page 2022
-
[93]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Repaint: Inpainting using denoising diffusion probabilistic models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[94]
arXiv preprint arXiv:2201.11793 , year=
Denoising diffusion restoration models , author=. arXiv preprint arXiv:2201.11793 , year=
-
[95]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[96]
Prompt-to-Prompt Image Editing with Cross Attention Control
Prompt-to-prompt image editing with cross attention control , author=. arXiv preprint arXiv:2208.01626 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[97]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[98]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[99]
Adding Conditional Control to Text-to-Image Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[100]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
- [101]
-
[102]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[103]
International Conference on Learning Representations , year=
Sdedit: Guided image synthesis and editing with stochastic differential equations , author=. International Conference on Learning Representations , year=
-
[104]
arXiv preprint arXiv:2208.12242 , year=
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. arXiv preprint arXiv:2208.12242 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.