Recognition: 2 theorem links
Training Agents Inside of Scalable World Models
Pith reviewed 2026-05-15 02:01 UTC · model grok-4.3
The pith
Dreamer 4 obtains diamonds in Minecraft by training reinforcement learning behaviors inside a world model learned from offline videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dreamer 4 learns to solve control tasks by reinforcement learning inside a fast and accurate world model. In Minecraft the model predicts object interactions and game mechanics over long horizons, outperforming prior world models by a large margin. It achieves real-time inference on one GPU via a shortcut forcing objective and efficient transformer architecture, while extracting most of its knowledge from diverse unlabeled videos through general action conditioning learned from limited data. The result is the first agent to obtain diamonds purely from offline data without environment interaction.
What carries the argument
Scalable world model with shortcut forcing objective and efficient transformer architecture that enables accurate long-horizon prediction and imagination-based reinforcement learning.
If this is right
- Complex tasks requiring thousands of actions become solvable from raw pixels using only imagined experience.
- The majority of knowledge can be extracted from diverse unlabeled videos rather than task-specific interaction.
- Real-time interactive inference on a single GPU becomes feasible for world-model-based agents.
- Imagination training offers a scalable recipe that aligns with safety constraints in robotics and similar domains.
Where Pith is reading between the lines
- The same world-model approach could let robots learn manipulation skills from passive video footage without risking hardware damage.
- Extending prediction horizons further might enable agents to plan toward open-ended goals beyond fixed tasks like diamond collection.
- If the model generalizes across game variants or real-world videos, it could reduce the need for any environment-specific data collection.
Load-bearing premise
The world model must accurately predict object interactions and game mechanics over the long action sequences required for the diamond task.
What would settle it
Deploy the trained agent in the actual Minecraft environment and observe whether it successfully obtains diamonds; failure despite high video prediction accuracy on held-out data would falsify the claim.
read the original abstract
World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Dreamer 4, a scalable world model using an efficient transformer architecture and shortcut forcing objective. Trained primarily on unlabeled videos, it claims accurate long-horizon prediction of object interactions and game mechanics in Minecraft, enabling real-time inference on a single GPU. Policies are trained entirely via reinforcement learning inside this model, yielding the first reported agent to obtain diamonds from raw pixels using only offline data and sequences exceeding 20,000 actions, without any environment interaction.
Significance. If the long-horizon prediction accuracy and policy transfer claims hold under rigorous validation, the work would mark a meaningful advance in scalable imagination-based training for high-dimensional, long-horizon control. It supplies a concrete recipe for extracting general knowledge from diverse offline videos and demonstrates practical applicability to domains where real interaction is unsafe or expensive, such as robotics.
major comments (2)
- [§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.
- [§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.
minor comments (2)
- [Figure 3] Figure 3 and associated text: Direct numerical comparisons to prior world models (e.g., DreamerV3) should be tabulated with exact margins rather than described only qualitatively as 'large margin'.
- [§3.2] §3.2 (Shortcut Forcing): The objective is introduced without an explicit equation contrasting it to standard reconstruction or latent prediction losses, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.
Authors: We agree that additional quantitative rollout metrics would strengthen the central claims. The current manuscript reports short-horizon prediction accuracies and relies on qualitative long-horizon visualizations plus downstream policy success as indirect evidence. Full 20,000-step quantitative evaluation is computationally prohibitive and subject to rapid compounding of errors in pixel space. In revision we will add pixel-level and object-level prediction errors for rollouts of 500–1,000 steps, together with an analysis of how key game mechanics (object positions, inventory state) remain predictable over longer horizons. We will also clarify that policy transfer success in the real environment provides the primary empirical validation for the long-horizon regime. revision: partial
-
Referee: [§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.
Authors: We accept this criticism and will revise accordingly. The success criterion is collecting at least one diamond item in the agent’s inventory, consistent with the standard Minecraft achievement definition; we will state this explicitly. Results are already averaged over multiple random seeds; we will add standard-deviation error bars to all reported success rates. We will also expand the data section with a precise description of the exclusion criteria (no interactive trajectories were used) and include an ablation table showing performance when subsets of the offline video data are withheld, thereby supporting the claim that the majority of knowledge comes from unlabeled videos. revision: yes
Circularity Check
No circularity: empirical task success on held-out Minecraft data
full rationale
The paper's central claim is an empirical result: Dreamer 4 obtains diamonds in Minecraft from offline data by training policies inside a learned world model. This outcome is measured by actual environment interaction after imagination training, not by any quantity defined inside the paper's equations or by self-citation. World-model training uses standard reconstruction and prediction losses on video data; policy optimization uses standard RL objectives inside the model. No step reduces a prediction to a fitted input by construction, and no uniqueness theorem or ansatz is smuggled via self-citation. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- transformer hyperparameters and shortcut forcing weight
axioms (1)
- domain assumption A learned world model can accurately simulate object interactions and game mechanics over sequences of 20,000+ actions
Forward citations
Cited by 23 Pith papers
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
Learning POMDP World Models from Observations with Language-Model Priors
Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.
-
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
-
Learning Visual Feature-Based World Models via Residual Latent Action
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
-
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
-
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
Envisioning the Future, One Step at a Time
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
-
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
On Training in Imagination
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Back to Basics: Let Denoising Generative Models Denoise
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score ...
-
On Training in Imagination
The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
Reference graph
Works this paper leans on
-
[1]
Mastering diverse control tasks through world models.Nature, pages 1–7, 2025
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025
work page 2025
-
[2]
Daydreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023
work page 2023
-
[3]
TD-MPC2: Scalable, Robust World Models for Continuous Control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
work page 2024
-
[5]
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019
-
[6]
Muesli: Combining improvements in policy optimization
Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado Van Hasselt. Muesli: Combining improvements in policy optimization. InInternational Conference on Machine Learning, pages 4214–4226. PMLR, 2021
work page 2021
-
[7]
Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, AgrimGupta, KristianHolsheimer, AleksanderHolynski, JiriHron, ChristosKaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Cip Bae...
work page 2025
-
[8]
Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025
Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025
-
[9]
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025
Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, and Yan Lu. From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025. 19
-
[11]
Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
-
[12]
Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025
Yan Team. Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025
-
[13]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[14]
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
work page 2024
-
[15]
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35: 24639–24654, 2022
work page 2022
-
[16]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[18]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[21]
One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024
-
[22]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[23]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[24]
Masked autoencoders are effective tokenizers for diffusion models
Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[25]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023. 20
work page 2023
-
[27]
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024
-
[28]
Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
work page 1988
-
[29]
Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024
Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, et al. Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024
work page 2024
-
[30]
A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948
Claude E Shannon. A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948
work page 1948
-
[31]
Attention is all you need.Advances in Neural Information Processing Systems, 2017
A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017
work page 2017
-
[32]
Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[33]
Enhanced transformer with rotary position embedding
J Su, H Zhang, X Li, J Zhang, and Y RoFormer Li. Enhanced transformer with rotary position embedding. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL- IJCNLP), Association for Computational Linguistics, Online, pages 1–6, 2021
work page 2021
-
[34]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[35]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480–7512. PMLR, 2023
work page 2023
-
[36]
Neural Combinatorial Optimization with Reinforcement Learning
Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[37]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Ax- ial attention in multidimensional transformers,
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers.arXiv preprint arXiv:1912.12180, 2019
-
[39]
Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025
Ajit Singh. Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025
work page 2025
-
[40]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020
work page 2020
-
[46]
Oasis: A universe in a transformer
Decart and Etched. Oasis: A universe in a transformer. https://www.decart.ai/articles/oasis- interactive-ai-video-game-model, 2024
work page 2024
-
[47]
Rami Seid and Alberto Hojel. Lucid v1: Real-tiem latent world models.International Journal of Current Research in Science, Engineering & Technology, 2024
work page 2024
-
[48]
Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025
-
[49]
Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Walke, Oier Mees, and Sergey Levine. Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:2407.20635, 2024
-
[50]
ThomasUnterthiner, SjoerdVanSteenkiste, KarolKurach, RaphaëlMarinier, MarcinMichalski, and Sylvain Gelly. Fvd: A new metric for video generation.ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019
work page 2019
-
[51]
The malmo platform for artificial intelligence experimentation
Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. InIJCAI, pages 4246–4247. Citeseer, 2016
work page 2016
-
[52]
William H Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, et al. The minerl competition on sample efficient reinforcement learning using human priors.arXiv e-prints, pages arXiv–1904, 2019
work page 1904
-
[53]
Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, et al. Minerl diamond 2021 competition: Overview, results, and lessons learned.NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022
work page 2021
-
[54]
arXiv preprint arXiv:2106.14876 , year=
Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft.arXiv preprint arXiv:2106.14876, 2021
-
[55]
Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural Information Processing Systems, 36:69900–69929, 2023
work page 2023
-
[56]
ShaofeiCai,BoweiZhang,ZihaoWang,XiaojianMa,AnjiLiu,andYitaoLiang. Groot: Learning to follow instructions by watching gameplay videos.arXiv preprint arXiv:2310.08235, 2023
-
[57]
Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, and Jing Shao. Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control.arXiv preprint arXiv:2403.12037, 2024. 22
-
[58]
Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021
Juan José Nieto, Roger Creus, and Xavier Giro-i Nieto. Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021
-
[59]
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991
work page 1991
-
[60]
Pilco: A model-based and data-efficient approach to policy search
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on machine learning (ICML- 11), pages 465–472, 2011
work page 2011
-
[61]
Embed to control: A locally linear latent dynamics model for control from raw images
Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in neural information processing systems, pages 2746–2754, 2015
work page 2015
-
[62]
Deepvisualforesightforplanningrobotmotion
ChelseaFinnandSergeyLevine. Deepvisualforesightforplanningrobotmotion. In2017IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017
work page 2017
-
[63]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[64]
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels.arXiv preprint arXiv:1811.04551, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[65]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[66]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[67]
Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022
Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022
-
[68]
Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023
-
[69]
Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023
work page 2023
-
[70]
A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025
Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025
-
[71]
Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024
-
[72]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025
-
[74]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022
work page 2022
-
[75]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 23
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Streamdit: Real-time streaming text-to-video generation
Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025
-
[77]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. International Conference on Machine Learning, 2023
work page 2023
-
[78]
Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023
Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023
-
[79]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023
Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.