arxiv: 2604.22847 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

Tim Merino , Sam Earle , Ryunosuke Iwai , Julian Togelius , Edoardo Cetin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords Minecraftvoxel generation3D diffusion modelsgenerative modelingcontrollable generationinpaintingMinecraft dataset3D environments

0 comments

The pith

Training 3D diffusion models on billions of Minecraft blocks produces generators that create and edit voxel worlds directly in block space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Dream-Cubed, a dataset of tens of billions of tokens from Minecraft worlds that mixes procedural biome terrain with human-authored maps. It uses this data to train families of 3D diffusion models and runs the first large-scale comparison of discrete versus continuous formulations, data compositions, and architecture choices. The models generate outputs by working straight on individual blocks rather than abstract representations, which supports fast creation of interactive 3D scenes plus user edits such as filling in or extending from blocks the user has already placed. A sympathetic reader would care because this moves generative modeling from 2D images into structured, editable 3D environments that games and design tools already use.

Core claim

Dream-Cubed supplies a large-scale voxel dataset and trains diffusion models that treat blocks as the basic compositional units. These models generate Minecraft-style 3D environments efficiently while supporting direct interactive operations such as inpainting and outpainting from user-provided blocks, with quality measured by an adapted FID metric on rendered views and by human preference tests.

What carries the argument

3D diffusion models that generate by operating directly in the discrete space of individual Minecraft blocks as compositional units.

If this is right

Models support interactive user workflows such as inpainting and outpainting from blocks supplied by the user.
Generation runs efficiently because it stays inside the native block representation rather than converting to other formats.
Large-scale comparisons identify effective choices for discrete versus continuous diffusion and for data composition in voxel tasks.
Releasing the full dataset, code, and pretrained models provides a base for further work on structured 3D generative modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Block-level diffusion could be extended by generating local chunks and assembling them into larger consistent worlds without retraining.
The same direct-block approach might transfer to other voxel or grid-based domains such as architectural layouts or scientific volume data.
Efficiency gains from staying in block space could enable real-time or on-device generation loops inside game engines.

Load-bearing premise

A carefully chosen mix of procedural terrain and human maps, when used to train diffusion models at this scale, will produce outputs that are both efficient and semantically meaningful for interactive 3D use, and that an adapted FID metric plus human preferences will reliably measure that quality.

What would settle it

A human preference study in which participants consistently rate generated worlds as less desirable than real Minecraft maps, or an adapted FID score that shows no reliable separation between real and generated renderings, would show the approach does not deliver controllable, semantically grounded generation.

Figures

Figures reproduced from arXiv: 2604.22847 by Edoardo Cetin, Julian Togelius, Ryunosuke Iwai, Sam Earle, Tim Merino.

**Figure 2.** Figure 2: Adjusted per-biome FID for MD4 (patch 2) and MD4 (patch 4) and DDPM (patch 2) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Adjusted per-biome FID comparison between MD4 (patch 4) models across three dataset [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Biome-conditioned samples generated by our MD4 and DDPM models across 5 natural [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Inpainting from an existing chunk’s blocks. (A) Inpainting using the source chunk’s biome [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Inpainting using manually authored block patterns as context. Each pair shows the context [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Large-scale “worlds” generated via outpainting. Left: A [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: 5 × 5 × 1 worlds generated by the MD4 model trained on the human-augmented dataset. Each world uses one of the six human-authored maps as biome condition for all cells human creation requires preservation of this sparse block vocabulary and overcoming the problem of metadata-induced vocabulary explosion [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Randomly selected biome-conditioned samples generated by our MD4 (patch 2) model [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Randomly selected biome-conditioned samples generated by our DDPM (patch 2) model [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Randomly selected biome-conditioned samples generated by our MD4 (patch 4) model, [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Randomly selected biome-conditioned samples generated by our MD4 (patch 4) model, [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Randomly selected biome-conditioned samples generated by our MD4 (patch 4) model, [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Six unconditional 5 × 5 × 1 worlds generated by our MD4 (patch 2) model [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Three 5 × 5 × 1 worlds generated by the three dataset variant MD4 (patch 4) models. Top row: balanced dataset. Middle: Natural occurrence dataset. Bottom: Village boosted dataset. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Block-conditioned 5 × 5 × 1 worlds generated by our MD4 (patch 2) model. Each pair shows the context block pattern (left) and the corresponding generated world (right). Block patterns are automatically sliced to 323 chunks and used as block context for inpainting, in addition to overlapping blocks from neighboring cells. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Block-conditioned inpainting using the MD4 patch 4 model. We generate 4 variants for [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Standard inpainting from a 8 × 8 × 32 chunk and from human-authored block contexts using our DDPM (patch 2) model [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Example of the user study interface. Each candidate set is rendered in 3D using the [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Train and validation loss curves for MD4 patch 2 model [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Train and validation loss curves for MD4 patch 4 model [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Train and validation loss curves for DDPM patch 2 model [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

read the original abstract

We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a large public Minecraft voxel dataset and runs the first big diffusion study on it for block-based generation, but the 2D FID and human prefs leave the 3D coherence and controllability claims under-supported.

read the letter

This paper's main deliverable is a new dataset of tens of billions of tokens from curated Minecraft worlds, mixing procedural biomes and human maps, plus diffusion models trained to generate directly in block space with support for inpainting and outpainting. They also run ablations on discrete versus continuous diffusion, data mixes, and architectures. The release of the full dataset, code, and pretrained models is the clearest positive here; that gives the community a concrete starting point for voxel generative work that prior papers lacked at this scale. The analysis of how formulation and data composition affect outputs adds some practical guidance for people building similar systems. The evaluation is the softer part. They adapt FID to 2D renderings and run a human preference study, but those proxies may not reliably catch 3D issues like inconsistent block adjacencies or topology problems, and it's not clear from the abstract whether the human raters tested actual interactive editing tasks. If the full paper has detailed failure cases or 3D-specific checks, that would strengthen the central claims about semantic grounding and efficiency. Without them the quantitative support feels incomplete. This is aimed at researchers working on generative models for games, simulation, or structured 3D data. Readers who need a large-scale voxel resource or baseline experiments will find it useful even if they end up improving the metrics themselves. The openness and scale make it worth a full referee process rather than a desk reject, though revisions on evaluation will likely be needed.

Referee Report

1 major / 2 minor

Summary. The paper introduces Dream-Cubed, a large-scale dataset of Minecraft voxel worlds comprising tens of billions of tokens from a curated mix of procedural biome terrain and human-authored maps. It conducts the first large-scale empirical study of 3D diffusion models for voxel generation, analyzing discrete versus continuous formulations, data compositions, and architectural choices. The models operate directly in block space to enable efficient, semantically grounded generation and support interactive workflows such as inpainting and outpainting from user-provided blocks. Evaluation adapts the FID metric to semantic differences in 2D renderings of real and generated worlds and includes a human preference study. The full dataset, code, and pretrained models are released.

Significance. If the central claims hold, this work provides a valuable open foundation for scalable generative modeling of structured, interactive 3D environments, with direct relevance to games, simulations, and voxel-based design tools. The release of billions of tokens of data and all pretrained models is a clear strength that enables reproducibility and follow-on research. The systematic ablations on diffusion formulations and data mixes offer practical insights for the community working on 3D generative models.

major comments (1)

[Evaluation] The evaluation section relies on an adapted FID computed on 2D renderings of the 3D voxel outputs together with human preference ratings. It is unclear whether this metric captures 3D-specific properties such as block adjacency consistency, topological coherence, or spatial structure, which are load-bearing for the claims of 'semantically grounded generation' and reliable support for interactive inpainting/outpainting workflows. Without 3D-aware metrics or explicit validation on structural fidelity, the quantitative evidence for the headline efficiency and controllability claims remains incomplete.

minor comments (2)

[Abstract] The abstract states 'tens of billions of tokens' without precise counts or a breakdown of the procedural versus human-authored portions; adding these statistics and a table summarizing dataset composition would improve clarity and allow readers to assess the data mix.
[Evaluation] The description of the FID adaptation does not specify the rendering procedure (e.g., camera angles, resolution, or feature extractor) or how semantic differences are quantified; providing these implementation details is necessary for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of the Dream-Cubed dataset and models. We address the evaluation concern point by point below and outline targeted revisions.

read point-by-point responses

Referee: The evaluation section relies on an adapted FID computed on 2D renderings of the 3D voxel outputs together with human preference ratings. It is unclear whether this metric captures 3D-specific properties such as block adjacency consistency, topological coherence, or spatial structure, which are load-bearing for the claims of 'semantically grounded generation' and reliable support for interactive inpainting/outpainting workflows. Without 3D-aware metrics or explicit validation on structural fidelity, the quantitative evidence for the headline efficiency and controllability claims remains incomplete.

Authors: We acknowledge that dedicated 3D structural metrics are not reported and that this leaves the quantitative support for certain claims incomplete. The adapted FID on 2D renderings serves as a semantic proxy because Minecraft worlds are experienced through their visual and functional structure, and renderings from multiple viewpoints capture adjacency patterns and overall coherence. The human preference study explicitly includes ratings on structural plausibility and usability for editing tasks. Because our models operate natively in discrete voxel space, block adjacency and topology are enforced by the representation itself rather than learned as soft constraints. We will revise the manuscript to add a short subsection with simple quantitative checks (e.g., empirical distributions of neighboring block types in generated versus training data) and to expand the discussion of qualitative inpainting/outpainting results that demonstrate local and global structural fidelity. These additions address the referee's concern without requiring a full redesign of the evaluation protocol. revision: partial

Circularity Check

0 steps flagged

Empirical ML study with no derivation chain or self-referential reductions

full rationale

The paper describes dataset curation from procedural biomes and human maps, training of voxel-space diffusion models, architectural ablations, and evaluation via adapted FID on 2D renderings plus human preference studies. No equations, first-principles derivations, or predictions are present that reduce by construction to fitted inputs, self-citations, or renamed quantities. Central claims rest on external training data, model outputs, and independent human validation rather than internal tautologies. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only abstract available so ledger is inferred from typical diffusion modeling assumptions; no explicit free parameters or invented entities named.

free parameters (1)

diffusion schedule and architectural hyperparameters
Standard choices in training discrete/continuous diffusion models on voxels; values not stated in abstract.

axioms (1)

domain assumption Diffusion models can effectively capture distributions over discrete voxel blocks when trained at large scale
Invoked by the decision to study discrete and continuous formulations on the Minecraft data.

pith-pipeline@v0.9.0 · 5496 in / 1325 out tokens · 37361 ms · 2026-05-10T01:32:08.814088+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 21 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA, 2021. Curran Associates Inc. ISBN 9781713845393

2021
[3]

World-gan: a generative model for minecraft worlds

Maren Awiszus, Frederik Schubert, and Bodo Rosenhahn. World-gan: a generative model for minecraft worlds. In2021 IEEE Conference on Games (CoG), page 1–8. IEEE Press,
[4]

In: IEEE Conference on Games (CoG)

doi: 10.1109/CoG52621.2021.9619133. URL https://doi.org/10.1109/CoG52621. 2021.9619133. 12

work page doi:10.1109/cog52621.2021.9619133 2021
[5]

Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes

Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P Breckon, and Chris G Willcocks. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. InEuropean Conference on Computer Vision, pages 170–188. Springer, 2022

2022
[6]

Oasis: A universe in a transformer.URL: https://oasis-model

Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.URL: https://oasis-model. github. io, 2(3):6, 2024

2024
[7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Taming transformers for high-resolution image synthesis. 2021 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), volume 10, 2020

2021
[8]

Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

2022
[9]

Beyond pixel histories: World models with persistent 3d state

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, and Jiang Bian. Beyond pixel histories: World models with persistent 3d state. arXiv preprint arXiv:2603.03482, 2026

work page arXiv 2026
[10]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. https://skylion007.github. io/OpenWebTextCorpus/, 2019. Accessed: 2026-04-10

2019
[11]

Evocraft: A new challenge for open-endedness

Djordje Grbic, Rasmus Berg Palm, Elias Najarro, Claire Glanois, and Sebastian Risi. Evocraft: A new challenge for open-endedness. InInternational Conference on the Applications of Evolutionary Computation (Part of EvoStar), pages 325–340. Springer, 2021

2021
[12]

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10686–10696, 2022. doi: 10.1109/CVPR52688.2022.01043

work page doi:10.1109/cvpr52688.2022.01043 2022
[13]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

work page internal anchor Pith review arXiv 2025
[14]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https://arxiv.org/abs/1706.08500

work page Pith review arXiv 2018
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020
[16]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in neural information processing systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr´e, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in neural information processing systems, 34:12454–12465, 2021

2021
[17]

Scaffold diffusion: Sparse multi-category voxel structure generation with discrete diffusion

Justin Jung. Scaffold diffusion: Sparse multi-category voxel structure generation with discrete diffusion. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025. URLhttps://openreview.net/forum?id=wdsPGJi40p

2025
[18]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

2022
[19]

Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768,

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025

work page arXiv 2025
[20]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012. 13

2012
[21]

Pyubiomes, a simple (wip) python wrapper for cubiomes by cubitect and other seedfinding utilities, 2021

Lilyyy411. Pyubiomes, a simple (wip) python wrapper for cubiomes by cubitect and other seedfinding utilities, 2021. URLhttps://github.com/lilyyy411/Pyubiomes

2021
[22]

Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft,

Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, and Xiaofeng Gao. Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

work page arXiv 2024
[23]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review arXiv 2023
[25]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022

2022
[26]

Diffusion probabilistic models for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021

2021
[27]

Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

1993
[28]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

2021
[29]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review arXiv 2025
[30]

Moonshine: Distilling game content generators into steerable generative models

Yuhe Nie, Michael Middleton, Tim Merino, Nidhushan Kanagaraja, Ashutosh Kumar, Zhan Zhuang, and Julian Togelius. Moonshine: Distilling game content generators into steerable generative models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14344–14351, 2025

2025
[31]

New embedding models and API updates

OpenAI. New embedding models and API updates. https://openai.com/index/ new-embedding-models-and-api-updates/, January 2024

2024
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[33]

mcaselector

Querz. mcaselector. URLhttps://github.com/Querz/mcaselector
[34]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review arXiv 2022
[35]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024

2024
[36]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[37]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024
[38]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 14

work page internal anchor Pith review arXiv 2022
[39]

Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

work page arXiv 2026
[40]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and generalized masked diffusion for discrete data. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385

2024
[41]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[42]

Improved vector quan- tized diffusion models.ArXiv, abs/2205.16007, 2022

Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Improved vector quan- tized diffusion models.ArXiv, abs/2205.16007, 2022. URL https://api.semanticscholar. org/CorpusID:249209888

work page arXiv 2022
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review arXiv 2023
[45]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open- world multi-task agents with memory-augmented multimodal language models.arXiv preprint arXiv: 2311.05997, 2023

work page arXiv 2023
[46]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[47]

volcano-ness

Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 15 A Appendix A.1 Discussion Blocks as tokens.A recurring theme across our results is that operating directly on Minecraft’s na- tive block representation u...

work page arXiv 2021
[48]

We discard samples that are primarily underground via an air-content threshold (retaining chunks>15%air)
[49]

We discard chunks that contain less than 60 village-indicator blocks, such as doors and stairs, ensuring chunks contain structures, rather than natural terrain surrounding a village
[50]

mcaselector

We discard samples where village indicator blocks touch the boundary of the volume, ensuring that our model learns to generate complete structures rather than truncated ones. Overall, this data collection level yields 178,271 village chunks. Human data collectionWe extract chunk data from human-authored by extending the “mcaselector” tool [32], to allow f...

work page arXiv
[51]

Both candidates are intended to represent the same target biome, which will be shown at the top of the screen

Each candidate contains four terrain chunks. Both candidates are intended to represent the same target biome, which will be shown at the top of the screen. Your task is to choose the better overall set. You will be asked to complete 60 trials as part of this study. When making your choice, use the following priority:
[52]

Visual quality and match to the target biome Prefer the set that looks more like plausible Minecraft terrain for the given biome and has fewer obvious artifacts, unnatural structures, broken patterns, or other visual errors. 2. Overall preference if still tied in your mind If the two sets seem equally good with respect to both quality and biome match, cho...
[53]

We finally compute per-biome win rates for each model, reported in table 15

We also construct a pairwise win-rate matrix for each pairing of sources (table 14), showing how each source performed against all others. We finally compute per-biome win rates for each model, reported in table 15. We report 95% Wilson confidence intervals and number of trials for tables 13 and 14. We also conduct a one-sided binomial test under the null...