pith. sign in

arxiv: 2606.27326 · v1 · pith:XXJGFHIYnew · submitted 2026-06-25 · 💻 cs.LG · cs.CV· cs.RO

Hallucination in World Models is Predictable and Preventable

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.RO
keywords world modelshallucinationdata coveragevisual predictionreinforcement learninggenerative modelscuriosity rewardsmodel adaptation
0
0 comments X

The pith

Hallucination in world models arises from low data coverage in state-action space and the same lightweight signals can both detect and fix it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that generative world models produce fluent but inaccurate rollouts mainly because certain regions of possible states and actions receive too little training data. Three distinct failure modes are tied to different stages of the generation pipeline, and each can be flagged in advance by simple signals derived from coverage gaps. Coverage-aware sampling during initial training and predictor-driven curiosity rewards for online data collection then let a pretrained model adapt to completely new environments using only about 50 real trajectories. If this view is correct, hallucination stops being an opaque architectural flaw and becomes a measurable, fixable data problem.

Core claim

Hallucination concentrates in low-coverage regions of the state-action space. On the new 427-hour MMBench2 dataset the authors identify three modes—perceptual, action-marginalized, and scene-diverging—each anchored to a pipeline stage. Three lightweight data-centric signals accurately predict where rollouts will diverge from ground-truth dynamics. Coverage-aware sampling closes gaps at training time; the same signals used as curiosity rewards close gaps online, enabling data-efficient adaptation to unseen environments with as few as 50 trajectories. The findings establish that hallucination is inherently a coverage issue and that detection signals double as mitigation tools.

What carries the argument

Coverage-based hallucination predictors that both flag low-coverage regions and supply curiosity rewards for targeted data collection.

If this is right

  • Hallucination rates drop when training data is actively balanced toward low-coverage regions.
  • The same predictors enable reliable finetuning to new environments with only tens of real trajectories.
  • Three separate hallucination modes can be isolated and addressed at their respective pipeline stages.
  • Detection and mitigation become two uses of one set of coverage signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other autoregressive generative models whose errors also cluster in under-represented input regions.
  • Systematic coverage measurement could become a standard diagnostic for any learned dynamics model used in planning or control.
  • The MMBench2 benchmark supplies a concrete testbed for comparing coverage-aware methods against purely architectural fixes.

Load-bearing premise

That the observed hallucinations are caused primarily by insufficient coverage rather than by model capacity, optimization choices, or other factors.

What would settle it

A controlled test in which the proposed signals fail to predict actual rollout errors on held-out simulator trajectories, or in which hallucinations occur at comparable rates inside and outside the identified low-coverage regions.

Figures

Figures reproduced from arXiv: 2606.27326 by Nicklas Hansen, Xiaolong Wang.

Figure 1
Figure 1. Figure 1: Hallucination. We categorize hallucinations as three distinct failure modes: perceptual, action marginalization, and scene divergence, and develop methods for detection and mitigation. In this work, we set out to characterize different types of hallucinations, predict when they will occur using signals internal to the model, and use that information to close the underlying cover￾age gap. We explore two con… view at source ↗
Figure 2
Figure 2. Figure 2: MMBench2 tasks. A sample of 36 of the 210 tasks in MMBench2, illustrating the visual and morphological diversity of the corpus. All observations are 224×224 RGB frames. 2 MMBench2: A Dataset for Visual World Modeling To fully understand hallucination in large generative world models, it is necessary to have control over the training process as well as access to its training data, but (to our knowledge) no … view at source ↗
Figure 3
Figure 3. Figure 3: Dataset composition. Per-task frame counts across the 210 tasks in our training corpus (20M frames), sorted in descending order and colored by domain. The distribution is heavy-tailed: the top 20 tasks account for 26% of all frames (dominated by Atari) while the bottom 20 contribute only 0.7% (short-horizon manipulation tasks). The dashed line marks the per-task median (65k). matching over spatial latent t… view at source ↗
Figure 4
Figure 4. Figure 4: Data coverage and hallucinations. From left to right: sample frame, state density of key agent/object position, and tokenizer round-trip residual ur of the world model. We use dark blue and dark purple to denote high state density and high ur, respectively. Hallucinations concentrate in low-coverage regions, especially near the periphery of the visited state distribution. Controlling for scene motion. In p… view at source ↗
Figure 5
Figure 5. Figure 5: Our three hallucination predictors track the same realized rollout error. Each point corresponds to one of 9k held-out 24-frame sequences from any of 200 training tasks, computed using our pretrained base model; the purple curve shows the median for each of 8 bins. Evaluation. We evaluate our world model and its variants across four successive training phases: (1) action-conditioned pretraining without rew… view at source ↗
Figure 6
Figure 6. Figure 6: Data coverage by collection method. We show state densities for three online data collection policies (expert, curiosity, and human) across two tasks in the unseen task set [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of task domains (1 of 2). We show sample tasks from each of the 10 task domains that we consider. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of task domains (2 of 2). We show sample tasks from each of the 10 task domains that we consider. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Web interface for human play data collection. We develop a simple web interface for data collection. A human user interacts with the environment using the keyboard, and interaction data is saved and later used for world model training. We collect a total of 1,400 trajectories using this interface. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Data coverage by collection method. We show state densities for five online data col￾lection policies (no-op, random, expert, curiosity, and human) across three additional tasks besides those in [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect it and guide mitigation. To test this, we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it. We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging -- each anchored to a different stage of the pipeline, and develop three signals that accurately predict where the model will fail. To close coverage gaps at training time, we develop a coverage-aware sampling technique; to close them online, our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories. Overall, our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation. An interactive web version of our paper is available at https://www.nicklashansen.com/mmbench2

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that hallucination in generative world models concentrates in low-coverage regions of state-action space and can be both detected and mitigated using lightweight data-centric signals. It introduces the MMBench2 dataset (427 hours, 210 tasks with ground-truth actions/rewards/simulators), trains a 350M-parameter world model, identifies three hallucination modes (perceptual, action-marginalized, scene-diverging) tied to pipeline stages, develops three corresponding predictors, proposes coverage-aware sampling at training time, and uses the predictors as curiosity rewards for online data collection, enabling adaptation to unseen environments with as few as 50 trajectories. The central conclusion is that hallucination is inherently a data coverage issue and that the same signals serve for both detection and mitigation.

Significance. If the empirical claims hold, the work supplies a concrete, data-centric recipe for improving reliability of visual world models, which is relevant to model-based RL and robotics. Strengths include the scale of the new MMBench2 dataset, the explicit linkage of hallucination modes to pipeline stages, the dual use of predictors for detection and as rewards, and the reported data efficiency of the 50-trajectory adaptation result. These elements provide testable, practical contributions beyond purely architectural fixes.

minor comments (2)
  1. [Abstract] Abstract: the claims that signals 'accurately predict where the model will fail' and that mitigation succeeds 'with as few as 50 real environment trajectories' are stated without any accompanying metrics, baselines, or statistical details; a one-sentence summary of the key quantitative results should be added for immediate clarity.
  2. [Methods / Signal definitions] The three hallucination modes and their associated signals are described at a high level; explicit pseudocode or equations showing how each signal is computed from coverage statistics (without reference to model outputs) would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of our contributions, and recommendation for minor revision. The highlighted strengths align with our intended focus on data-centric signals for hallucination detection and mitigation.

Circularity Check

0 steps flagged

No significant circularity; empirical validation on new dataset

full rationale

The paper's core argument is an empirical hypothesis tested via construction of MMBench2 (427-hour dataset with ground-truth), training a 350M world model, identification of three hallucination modes tied to pipeline stages, development of three predictive signals, coverage-aware sampling, and 50-trajectory curiosity-driven finetuning. All load-bearing claims (hallucination concentration in low-coverage regions, signals for detection/mitigation) are directly measured against live simulators and held-out environments rather than derived by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes reduce results to inputs by construction. The work is self-contained against external benchmarks (new dataset, quantitative adaptation results).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no free parameters, additional axioms, or invented entities can be identified beyond the stated hypothesis.

axioms (1)
  • domain assumption Hallucination concentrates in low-coverage regions of the state-action space
    Explicitly stated as the central hypothesis in the abstract.

pith-pipeline@v0.9.1-grok · 5772 in / 1249 out tokens · 44453 ms · 2026-06-26T04:56:10.064904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages

  1. [1]

    Lillicrap, Jimmy Ba, and Mohammad Norouzi

    Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behav- iors by latent imagination.ArXiv, abs/1912.01603,

  2. [2]

    Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

  3. [3]

    Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527,

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527,

  4. [4]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246– 4253,

  5. [5]

    Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

  6. [6]

    doi: 10.1145/3703155. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision...

  7. [7]

    When to trust your model: Model-based policy optimization.ArXiv, abs/1906.08253,

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.ArXiv, abs/1906.08253,

  8. [8]

    J., Madotto, A., and Fung, P

    doi: 10.1145/3571730. Harini Kannan, Danijar Hafner, Chelsea Finn, and Dumitru Erhan. Robodesk: A multi-task reinforcement learning benchmark.https://github.com/google-research/robodesk,

  9. [9]

    Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

  10. [10]

    Evaluating object hallu- cination in large vision-language models

    11 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallu- cination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  11. [11]

    Vincent Micheli, Eloi Alonso, and François Fleuret

    URLhttps://arxiv.org/abs/ 2601.02427. Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InThe Eleventh International Conference on Learning Representations,

  12. [12]

    URLhttps://openreview.net/ forum?id=vhFu1Acb0xb. NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  13. [13]

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

  14. [14]

    Todorov, T

    doi: 10.1109/IROS.2012.6386109. Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game en- gines,

  15. [15]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

    URLhttps://arxiv.org/abs/2408.14837. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

  16. [16]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  17. [17]

    14 A Related Work World models for control.World models that learn environment dynamics from data have a long history in model-based RL, ranging from abstract latent dynamics models [Ha and Schmidhuber, 2018, Hafner et al., 2020, Hansen et al., 2022, 2024] to high-capacity generative models that render full pixel observations [Micheli et al., 2023, Alonso...

  18. [18]

    Each data source contains 50 trajectories per task

    22 Table 8.Targeted data collection for finetuning on 10 unseen tasks (experttest set).We finetune our world model on a set of 10 seen + 10 unseen tasks, varying data source and finetuning strategy. Each data source contains 50 trajectories per task. The offline results in Table 2 are computed over a test set that consists of trajectories from both expert...