pith. machine review for the scientific record. sign in

arxiv: 2509.24527 · v1 · submitted 2025-09-29 · 💻 cs.AI · cs.LG· cs.RO· stat.ML

Recognition: 2 theorem links

Training Agents Inside of Scalable World Models

Danijar Hafner , Wilson Yan , Timothy Lillicrap

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:01 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.ROstat.ML
keywords world modelsreinforcement learningMinecraftimagination trainingoffline learningvideo predictioncontrol taskstransformer architecture
0
0 comments X

The pith

Dreamer 4 obtains diamonds in Minecraft by training reinforcement learning behaviors inside a world model learned from offline videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scalable world model that accurately simulates object interactions and long game sequences in Minecraft from video data. Agents then learn control policies entirely through imagined rollouts using reinforcement learning, without ever stepping in the real environment. This setup solves a task that requires selecting over 20,000 mouse and keyboard actions from raw pixels. A sympathetic reader would care because it shows how model-based imagination training can replace dangerous or slow real-world interaction in domains such as robotics.

Core claim

Dreamer 4 learns to solve control tasks by reinforcement learning inside a fast and accurate world model. In Minecraft the model predicts object interactions and game mechanics over long horizons, outperforming prior world models by a large margin. It achieves real-time inference on one GPU via a shortcut forcing objective and efficient transformer architecture, while extracting most of its knowledge from diverse unlabeled videos through general action conditioning learned from limited data. The result is the first agent to obtain diamonds purely from offline data without environment interaction.

What carries the argument

Scalable world model with shortcut forcing objective and efficient transformer architecture that enables accurate long-horizon prediction and imagination-based reinforcement learning.

If this is right

  • Complex tasks requiring thousands of actions become solvable from raw pixels using only imagined experience.
  • The majority of knowledge can be extracted from diverse unlabeled videos rather than task-specific interaction.
  • Real-time interactive inference on a single GPU becomes feasible for world-model-based agents.
  • Imagination training offers a scalable recipe that aligns with safety constraints in robotics and similar domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same world-model approach could let robots learn manipulation skills from passive video footage without risking hardware damage.
  • Extending prediction horizons further might enable agents to plan toward open-ended goals beyond fixed tasks like diamond collection.
  • If the model generalizes across game variants or real-world videos, it could reduce the need for any environment-specific data collection.

Load-bearing premise

The world model must accurately predict object interactions and game mechanics over the long action sequences required for the diamond task.

What would settle it

Deploy the trained agent in the actual Minecraft environment and observe whether it successfully obtains diamonds; failure despite high video prediction accuracy on held-out data would falsify the claim.

read the original abstract

World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Dreamer 4, a scalable world model using an efficient transformer architecture and shortcut forcing objective. Trained primarily on unlabeled videos, it claims accurate long-horizon prediction of object interactions and game mechanics in Minecraft, enabling real-time inference on a single GPU. Policies are trained entirely via reinforcement learning inside this model, yielding the first reported agent to obtain diamonds from raw pixels using only offline data and sequences exceeding 20,000 actions, without any environment interaction.

Significance. If the long-horizon prediction accuracy and policy transfer claims hold under rigorous validation, the work would mark a meaningful advance in scalable imagination-based training for high-dimensional, long-horizon control. It supplies a concrete recipe for extracting general knowledge from diverse offline videos and demonstrates practical applicability to domains where real interaction is unsafe or expensive, such as robotics.

major comments (2)
  1. [§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.
  2. [§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.
minor comments (2)
  1. [Figure 3] Figure 3 and associated text: Direct numerical comparisons to prior world models (e.g., DreamerV3) should be tabulated with exact margins rather than described only qualitatively as 'large margin'.
  2. [§3.2] §3.2 (Shortcut Forcing): The objective is introduced without an explicit equation contrasting it to standard reconstruction or latent prediction losses, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.

    Authors: We agree that additional quantitative rollout metrics would strengthen the central claims. The current manuscript reports short-horizon prediction accuracies and relies on qualitative long-horizon visualizations plus downstream policy success as indirect evidence. Full 20,000-step quantitative evaluation is computationally prohibitive and subject to rapid compounding of errors in pixel space. In revision we will add pixel-level and object-level prediction errors for rollouts of 500–1,000 steps, together with an analysis of how key game mechanics (object positions, inventory state) remain predictable over longer horizons. We will also clarify that policy transfer success in the real environment provides the primary empirical validation for the long-horizon regime. revision: partial

  2. Referee: [§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.

    Authors: We accept this criticism and will revise accordingly. The success criterion is collecting at least one diamond item in the agent’s inventory, consistent with the standard Minecraft achievement definition; we will state this explicitly. Results are already averaged over multiple random seeds; we will add standard-deviation error bars to all reported success rates. We will also expand the data section with a precise description of the exclusion criteria (no interactive trajectories were used) and include an ablation table showing performance when subsets of the offline video data are withheld, thereby supporting the claim that the majority of knowledge comes from unlabeled videos. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical task success on held-out Minecraft data

full rationale

The paper's central claim is an empirical result: Dreamer 4 obtains diamonds in Minecraft from offline data by training policies inside a learned world model. This outcome is measured by actual environment interaction after imagination training, not by any quantity defined inside the paper's equations or by self-citation. World-model training uses standard reconstruction and prediction losses on video data; policy optimization uses standard RL objectives inside the model. No step reduces a prediction to a fitted input by construction, and no uniqueness theorem or ansatz is smuggled via self-citation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions that world models trained on video can generalize to long-horizon control and that reinforcement learning inside simulation transfers to the real environment. No new physical entities are introduced.

free parameters (1)
  • transformer hyperparameters and shortcut forcing weight
    Chosen to achieve real-time inference and accurate prediction; values are tuned on Minecraft data.
axioms (1)
  • domain assumption A learned world model can accurately simulate object interactions and game mechanics over sequences of 20,000+ actions
    Invoked to justify training entirely in imagination without environment interaction.

pith-pipeline@v0.9.0 · 5524 in / 1169 out tokens · 45266 ms · 2026-05-15T02:01:18.530066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  2. Learning POMDP World Models from Observations with Language-Model Priors

    cs.LG 2026-05 unverdicted novelty 7.0

    Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.

  3. 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

  4. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  5. AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

    cs.AI 2026-05 unverdicted novelty 7.0

    AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.

  6. Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

    cs.CV 2026-04 unverdicted novelty 7.0

    Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.

  7. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  8. Envisioning the Future, One Step at a Time

    cs.CV 2026-04 unverdicted novelty 7.0

    An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

  9. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

  10. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  11. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  12. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 6.0

    The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...

  13. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  14. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  15. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  16. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  17. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  18. Back to Basics: Let Denoising Generative Models Denoise

    cs.CV 2025-11 unverdicted novelty 6.0

    Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

  19. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  20. Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

    cs.LG 2026-05 unverdicted novelty 5.0

    Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score ...

  21. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.

  22. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  23. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 21 Pith papers · 21 internal anchors

  1. [1]

    Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

  2. [2]

    Daydreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

  3. [3]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

  4. [4]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  5. [5]

    Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

  6. [6]

    Muesli: Combining improvements in policy optimization

    Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado Van Hasselt. Muesli: Combining improvements in policy optimization. InInternational Conference on Machine Learning, pages 4214–4226. PMLR, 2021

  7. [7]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, AgrimGupta, KristianHolsheimer, AleksanderHolynski, JiriHron, ChristosKaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Cip Bae...

  8. [8]

    Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

    Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

  9. [9]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  10. [10]

    From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

    Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, and Yan Lu. From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025. 19

  11. [11]

    Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

    Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

  12. [12]

    Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025

    Yan Team. Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025

  13. [13]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  14. [14]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  15. [15]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35: 24639–24654, 2022

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35: 24639–24654, 2022

  16. [16]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  17. [17]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  18. [18]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  19. [19]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  21. [21]

    One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

  22. [22]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  23. [23]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  24. [24]

    Masked autoencoders are effective tokenizers for diffusion models

    Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

  25. [25]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  26. [26]

    Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023

    Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023. 20

  27. [27]

    Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

  28. [28]

    Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

    Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

  29. [29]

    Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024

    Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, et al. Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024

  30. [30]

    A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948

    Claude E Shannon. A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948

  31. [31]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017

  32. [32]

    Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

  33. [33]

    Enhanced transformer with rotary position embedding

    J Su, H Zhang, X Li, J Zhang, and Y RoFormer Li. Enhanced transformer with rotary position embedding. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL- IJCNLP), Association for Computational Linguistics, Online, pages 1–6, 2021

  34. [34]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  35. [35]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480–7512. PMLR, 2023

  36. [36]

    Neural Combinatorial Optimization with Reinforcement Learning

    Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

  37. [37]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  38. [38]

    Ax- ial attention in multidimensional transformers,

    Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers.arXiv preprint arXiv:1912.12180, 2019

  39. [39]

    Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025

    Ajit Singh. Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025

  40. [40]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  41. [41]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  42. [42]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 21

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  44. [44]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  45. [45]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

  46. [46]

    Oasis: A universe in a transformer

    Decart and Etched. Oasis: A universe in a transformer. https://www.decart.ai/articles/oasis- interactive-ai-video-game-model, 2024

  47. [47]

    Lucid v1: Real-tiem latent world models.International Journal of Current Research in Science, Engineering & Technology, 2024

    Rami Seid and Alberto Hojel. Lucid v1: Real-tiem latent world models.International Journal of Current Research in Science, Engineering & Technology, 2024

  48. [48]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

  49. [49]

    Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:2407.20635, 2024

    Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Walke, Oier Mees, and Sergey Levine. Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:2407.20635, 2024

  50. [50]

    Fvd: A new metric for video generation.ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

    ThomasUnterthiner, SjoerdVanSteenkiste, KarolKurach, RaphaëlMarinier, MarcinMichalski, and Sylvain Gelly. Fvd: A new metric for video generation.ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

  51. [51]

    The malmo platform for artificial intelligence experimentation

    Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. InIJCAI, pages 4246–4247. Citeseer, 2016

  52. [52]

    The minerl competition on sample efficient reinforcement learning using human priors.arXiv e-prints, pages arXiv–1904, 2019

    William H Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, et al. The minerl competition on sample efficient reinforcement learning using human priors.arXiv e-prints, pages arXiv–1904, 2019

  53. [53]

    Minerl diamond 2021 competition: Overview, results, and lessons learned.NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022

    Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, et al. Minerl diamond 2021 competition: Overview, results, and lessons learned.NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022

  54. [54]

    arXiv preprint arXiv:2106.14876 , year=

    Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft.arXiv preprint arXiv:2106.14876, 2021

  55. [55]

    Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural Information Processing Systems, 36:69900–69929, 2023

    Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural Information Processing Systems, 36:69900–69929, 2023

  56. [56]

    Groot: Learning to follow instructions by watching gameplay videos.arXiv preprint arXiv:2310.08235, 2023

    ShaofeiCai,BoweiZhang,ZihaoWang,XiaojianMa,AnjiLiu,andYitaoLiang. Groot: Learning to follow instructions by watching gameplay videos.arXiv preprint arXiv:2310.08235, 2023

  57. [57]

    Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control.arXiv preprint arXiv:2403.12037, 2024

    Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, and Jing Shao. Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control.arXiv preprint arXiv:2403.12037, 2024. 22

  58. [58]

    Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021

    Juan José Nieto, Roger Creus, and Xavier Giro-i Nieto. Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021

  59. [59]

    Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991

  60. [60]

    Pilco: A model-based and data-efficient approach to policy search

    Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on machine learning (ICML- 11), pages 465–472, 2011

  61. [61]

    Embed to control: A locally linear latent dynamics model for control from raw images

    Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in neural information processing systems, pages 2746–2754, 2015

  62. [62]

    Deepvisualforesightforplanningrobotmotion

    ChelseaFinnandSergeyLevine. Deepvisualforesightforplanningrobotmotion. In2017IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017

  63. [63]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  64. [64]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels.arXiv preprint arXiv:1811.04551, 2018

  65. [65]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  66. [66]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  67. [67]

    Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

  68. [68]

    Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

    Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

  69. [69]

    Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

  70. [70]

    A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

    Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

  71. [71]

    Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024

  72. [72]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  73. [73]

    Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  74. [74]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  75. [75]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 23

  76. [76]

    Streamdit: Real-time streaming text-to-video generation

    Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

  77. [77]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. International Conference on Machine Learning, 2023

  78. [78]

    Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

  79. [79]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

  80. [80]

    Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023

    Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023

Showing first 80 references.