arxiv: 2509.24527 · v1 · submitted 2025-09-29 · 💻 cs.AI · cs.LG· cs.RO· stat.ML

Recognition: 2 theorem links

Training Agents Inside of Scalable World Models

Danijar Hafner , Wilson Yan , Timothy Lillicrap

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:01 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.ROstat.ML

keywords world modelsreinforcement learningMinecraftimagination trainingoffline learningvideo predictioncontrol taskstransformer architecture

0 comments

The pith

Dreamer 4 obtains diamonds in Minecraft by training reinforcement learning behaviors inside a world model learned from offline videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scalable world model that accurately simulates object interactions and long game sequences in Minecraft from video data. Agents then learn control policies entirely through imagined rollouts using reinforcement learning, without ever stepping in the real environment. This setup solves a task that requires selecting over 20,000 mouse and keyboard actions from raw pixels. A sympathetic reader would care because it shows how model-based imagination training can replace dangerous or slow real-world interaction in domains such as robotics.

Core claim

Dreamer 4 learns to solve control tasks by reinforcement learning inside a fast and accurate world model. In Minecraft the model predicts object interactions and game mechanics over long horizons, outperforming prior world models by a large margin. It achieves real-time inference on one GPU via a shortcut forcing objective and efficient transformer architecture, while extracting most of its knowledge from diverse unlabeled videos through general action conditioning learned from limited data. The result is the first agent to obtain diamonds purely from offline data without environment interaction.

What carries the argument

Scalable world model with shortcut forcing objective and efficient transformer architecture that enables accurate long-horizon prediction and imagination-based reinforcement learning.

If this is right

Complex tasks requiring thousands of actions become solvable from raw pixels using only imagined experience.
The majority of knowledge can be extracted from diverse unlabeled videos rather than task-specific interaction.
Real-time interactive inference on a single GPU becomes feasible for world-model-based agents.
Imagination training offers a scalable recipe that aligns with safety constraints in robotics and similar domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same world-model approach could let robots learn manipulation skills from passive video footage without risking hardware damage.
Extending prediction horizons further might enable agents to plan toward open-ended goals beyond fixed tasks like diamond collection.
If the model generalizes across game variants or real-world videos, it could reduce the need for any environment-specific data collection.

Load-bearing premise

The world model must accurately predict object interactions and game mechanics over the long action sequences required for the diamond task.

What would settle it

Deploy the trained agent in the actual Minecraft environment and observe whether it successfully obtains diamonds; failure despite high video prediction accuracy on held-out data would falsify the claim.

read the original abstract

World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Dreamer 4, a scalable world model using an efficient transformer architecture and shortcut forcing objective. Trained primarily on unlabeled videos, it claims accurate long-horizon prediction of object interactions and game mechanics in Minecraft, enabling real-time inference on a single GPU. Policies are trained entirely via reinforcement learning inside this model, yielding the first reported agent to obtain diamonds from raw pixels using only offline data and sequences exceeding 20,000 actions, without any environment interaction.

Significance. If the long-horizon prediction accuracy and policy transfer claims hold under rigorous validation, the work would mark a meaningful advance in scalable imagination-based training for high-dimensional, long-horizon control. It supplies a concrete recipe for extracting general knowledge from diverse offline videos and demonstrates practical applicability to domains where real interaction is unsafe or expensive, such as robotics.

major comments (2)

[§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.
[§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.

minor comments (2)

[Figure 3] Figure 3 and associated text: Direct numerical comparisons to prior world models (e.g., DreamerV3) should be tabulated with exact margins rather than described only qualitatively as 'large margin'.
[§3.2] §3.2 (Shortcut Forcing): The objective is introduced without an explicit equation contrasting it to standard reconstruction or latent prediction losses, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (World Model Evaluation) and §5 (Diamond Task Results): No quantitative rollout metrics (pixel-level, object-level, or state-level prediction error) are reported for sequences whose length and complexity match the >20,000-action diamond horizon. This omission is load-bearing for the central claim that imagination-trained policies transfer successfully to the real environment.

Authors: We agree that additional quantitative rollout metrics would strengthen the central claims. The current manuscript reports short-horizon prediction accuracies and relies on qualitative long-horizon visualizations plus downstream policy success as indirect evidence. Full 20,000-step quantitative evaluation is computationally prohibitive and subject to rapid compounding of errors in pixel space. In revision we will add pixel-level and object-level prediction errors for rollouts of 500–1,000 steps, together with an analysis of how key game mechanics (object positions, inventory state) remain predictable over longer horizons. We will also clarify that policy transfer success in the real environment provides the primary empirical validation for the long-horizon regime. revision: partial
Referee: [§5.2] §5.2 (Offline Diamond Success): The headline result that Dreamer 4 is the first agent to obtain diamonds purely from offline data lacks error bars, ablation details on data exclusion criteria, and explicit success criteria (e.g., exact definition of 'obtaining a diamond'). This weakens confidence in the outperformance and novelty assertions.

Authors: We accept this criticism and will revise accordingly. The success criterion is collecting at least one diamond item in the agent’s inventory, consistent with the standard Minecraft achievement definition; we will state this explicitly. Results are already averaged over multiple random seeds; we will add standard-deviation error bars to all reported success rates. We will also expand the data section with a precise description of the exclusion criteria (no interactive trajectories were used) and include an ablation table showing performance when subsets of the offline video data are withheld, thereby supporting the claim that the majority of knowledge comes from unlabeled videos. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical task success on held-out Minecraft data

full rationale

The paper's central claim is an empirical result: Dreamer 4 obtains diamonds in Minecraft from offline data by training policies inside a learned world model. This outcome is measured by actual environment interaction after imagination training, not by any quantity defined inside the paper's equations or by self-citation. World-model training uses standard reconstruction and prediction losses on video data; policy optimization uses standard RL objectives inside the model. No step reduces a prediction to a fitted input by construction, and no uniqueness theorem or ansatz is smuggled via self-citation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions that world models trained on video can generalize to long-horizon control and that reinforcement learning inside simulation transfers to the real environment. No new physical entities are introduced.

free parameters (1)

transformer hyperparameters and shortcut forcing weight
Chosen to achieve real-time inference and accurate prediction; values are tuned on Minecraft data.

axioms (1)

domain assumption A learned world model can accurately simulate object interactions and game mechanics over sequences of 20,000+ actions
Invoked to justify training entirely in imagination without environment interaction.

pith-pipeline@v0.9.0 · 5524 in / 1169 out tokens · 45266 ms · 2026-05-15T02:01:18.530066+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Learning POMDP World Models from Observations with Language-Model Priors
cs.LG 2026-05 unverdicted novelty 7.0

Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
cs.AI 2026-05 unverdicted novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Envisioning the Future, One Step at a Time
cs.CV 2026-04 unverdicted novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Back to Basics: Let Denoising Generative Models Denoise
cs.CV 2025-11 unverdicted novelty 6.0

Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
cs.LG 2026-05 unverdicted novelty 5.0

Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score ...
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 5.0

The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 21 Pith papers · 21 internal anchors

[1]

Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

work page 2025
[2]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

work page 2023
[3]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024
[5]

Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

work page arXiv 1911
[6]

Muesli: Combining improvements in policy optimization

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado Van Hasselt. Muesli: Combining improvements in policy optimization. InInternational Conference on Machine Learning, pages 4214–4226. PMLR, 2021

work page 2021
[7]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, AgrimGupta, KristianHolsheimer, AleksanderHolynski, JiriHron, ChristosKaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Cip Bae...

work page 2025
[8]

Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, and Hengshuang Zhao. Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

work page arXiv 2025
[9]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, and Yan Lu. From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025. 19

work page arXiv 2025
[11]

Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole- body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025
[12]

Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025

Yan Team. Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025

work page arXiv 2025
[13]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[14]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[15]

Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35: 24639–24654, 2022

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35: 24639–24654, 2022

work page 2022
[16]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015
[17]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[18]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[21]

One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

work page arXiv 2024
[22]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[24]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[25]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023. 20

work page 2023
[27]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

work page arXiv 2024
[28]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988
[29]

Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024

Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, et al. Preference optimization as probabilistic inference.arXiv e-prints, pages arXiv–2410, 2024

work page 2024
[30]

A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948

Claude E Shannon. A mathematical theory of communication.Bell system technical journal, 27 (3):379–423, 1948

work page 1948
[31]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017

work page 2017
[32]

Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[33]

Enhanced transformer with rotary position embedding

J Su, H Zhang, X Li, J Zhang, and Y RoFormer Li. Enhanced transformer with rotary position embedding. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL- IJCNLP), Association for Computational Linguistics, Online, pages 1–6, 2021

work page 2021
[34]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[35]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480–7512. PMLR, 2023

work page 2023
[36]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Ax- ial attention in multidimensional transformers,

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers.arXiv preprint arXiv:1912.12180, 2019

work page arXiv 1912
[39]

Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025

Ajit Singh. Meta llama 4: The future of multimodal ai.Available at SSRN 5208228, 2025

work page 2025
[40]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

work page 2020
[46]

Oasis: A universe in a transformer

Decart and Etched. Oasis: A universe in a transformer. https://www.decart.ai/articles/oasis- interactive-ai-video-game-model, 2024

work page 2024
[47]

Lucid v1: Real-tiem latent world models.International Journal of Current Research in Science, Engineering & Technology, 2024

Rami Seid and Alberto Hojel. Lucid v1: Real-tiem latent world models.International Journal of Current Research in Science, Engineering & Technology, 2024

work page 2024
[48]

Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

work page arXiv 2025
[49]

Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:2407.20635, 2024

Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Walke, Oier Mees, and Sergey Levine. Autonomous improvement of instruction following skills via foundation models.arXiv preprint arXiv:2407.20635, 2024

work page arXiv 2024
[50]

Fvd: A new metric for video generation.ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

ThomasUnterthiner, SjoerdVanSteenkiste, KarolKurach, RaphaëlMarinier, MarcinMichalski, and Sylvain Gelly. Fvd: A new metric for video generation.ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

work page 2019
[51]

The malmo platform for artificial intelligence experimentation

Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. InIJCAI, pages 4246–4247. Citeseer, 2016

work page 2016
[52]

The minerl competition on sample efficient reinforcement learning using human priors.arXiv e-prints, pages arXiv–1904, 2019

William H Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, et al. The minerl competition on sample efficient reinforcement learning using human priors.arXiv e-prints, pages arXiv–1904, 2019

work page 1904
[53]

Minerl diamond 2021 competition: Overview, results, and lessons learned.NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022

Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, et al. Minerl diamond 2021 competition: Overview, results, and lessons learned.NeurIPS 2021 Competitions and Demonstrations Track, pages 13–28, 2022

work page 2021
[54]

arXiv preprint arXiv:2106.14876 , year=

Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft.arXiv preprint arXiv:2106.14876, 2021

work page arXiv 2021
[55]

Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural Information Processing Systems, 36:69900–69929, 2023

Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural Information Processing Systems, 36:69900–69929, 2023

work page 2023
[56]

Groot: Learning to follow instructions by watching gameplay videos.arXiv preprint arXiv:2310.08235, 2023

ShaofeiCai,BoweiZhang,ZihaoWang,XiaojianMa,AnjiLiu,andYitaoLiang. Groot: Learning to follow instructions by watching gameplay videos.arXiv preprint arXiv:2310.08235, 2023

work page arXiv 2023
[57]

Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control.arXiv preprint arXiv:2403.12037, 2024

Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, and Jing Shao. Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control.arXiv preprint arXiv:2403.12037, 2024. 22

work page arXiv 2024
[58]

Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021

Juan José Nieto, Roger Creus, and Xavier Giro-i Nieto. Unsupervised skill-discovery and skill- learning in minecraft.arXiv preprint arXiv:2107.08398, 2021

work page arXiv 2021
[59]

Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM SIGART Bulletin, 2(4):160–163, 1991

work page 1991
[60]

Pilco: A model-based and data-efficient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on machine learning (ICML- 11), pages 465–472, 2011

work page 2011
[61]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAdvances in neural information processing systems, pages 2746–2754, 2015

work page 2015
[62]

Deepvisualforesightforplanningrobotmotion

ChelseaFinnandSergeyLevine. Deepvisualforesightforplanningrobotmotion. In2017IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017

work page 2017
[63]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels.arXiv preprint arXiv:1811.04551, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[65]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[66]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[67]

Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

work page arXiv 2022
[68]

Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

work page arXiv 2023
[69]

Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

work page 2023
[70]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

work page arXiv 2025
[71]

Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024

work page arXiv 2024
[72]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

work page arXiv 2025
[74]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

work page 2022
[75]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 23

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Streamdit: Real-time streaming text-to-video generation

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

work page arXiv 2025
[77]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. International Conference on Machine Learning, 2023

work page 2023
[78]

Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

work page arXiv 2023
[79]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109, 2023

work page arXiv 2023

Showing first 80 references.