pith. sign in

arxiv: 2605.19341 · v1 · pith:HPZPRGEPnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML
keywords hallucinationbenchmarklanguage modelsreference worldsynthetic environmentsstate trackingabstentiongridworld
5
0 comments X p. Extension
pith:HPZPRGEP Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{HPZPRGEP}

Prints a linked pith:HPZPRGEP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

HalluWorld defines hallucinations as false observable claims against a fully specified reference world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks fragment hallucination measurement across summarization, question answering, and other tasks, often relying on human labels or non-reproducible settings. The paper introduces HalluWorld, which grounds evaluation in explicit reference worlds where any model output that contradicts the world counts as a hallucination. Synthetic and semi-synthetic environments such as gridworlds, chess, and terminal tasks make the reference world fully known, the model's observations controllable, and labels automatic. Frontier models nearly solve direct perceptual errors but continue to fail at multi-step state tracking, causal simulation, and deciding when to abstain. The pattern of failures indicates that hallucinations stem from separate mechanisms rather than one general deficit.

Core claim

The paper establishes that a model hallucinates precisely when it produces an observable claim false with respect to an explicitly specified reference world. Synthetic and semi-synthetic environments are built in which the full world state is known, the model's partial view is controlled, and hallucination labels are generated without human annotation. These environments span gridworlds, chess, and realistic terminal tasks, allowing systematic variation of complexity, observability, temporal dynamics, and source conflicts while producing fine-grained error categories. Evaluation of frontier and open-weight models shows perceptual hallucinations on directly observed facts are largely solved,

What carries the argument

The reference-world formulation that treats hallucination as any observable claim contradicting the fully known world state, allowing automatic labeling in controlled synthetic settings.

If this is right

  • Hallucinations can be measured reproducibly across domains without fixed memorized references or human annotation.
  • Error types can be separated into perceptual, multi-step tracking, and abstention failures for targeted study.
  • Varying world complexity and observability reveals which capabilities remain unsolved even with extended thinking.
  • Consistent failure profiles across models suggest hallucinations are not a single capability problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If synthetic reference worlds capture essential dynamics, interventions that reduce specific error categories here could be tested for transfer to production systems.
  • The same reference-world approach might be applied to agentic or multi-turn interactions to measure cumulative state errors over time.
  • Combining automatic labels with targeted human review on edge cases could refine the benchmark while keeping scalability.

Load-bearing premise

An explicitly specified reference world can be built and used to produce automatic hallucination labels that match the same phenomenon observed in real open-ended language model use.

What would settle it

Human judges marking the same model outputs on overlapping real-world examples would show low agreement with the automatic HalluWorld labels, or model improvements on HalluWorld would fail to reduce hallucinations on standard non-synthetic benchmarks.

Figures

Figures reproduced from arXiv: 2605.19341 by Emmy Liu, Karan Singh, Michael Yu, Sachin Kumar, Steven Y. Feng, Varun Gangal, Zhuofu Tao.

Figure 1
Figure 1. Figure 1: The HALLUWORLD benchmark spans three domains (gridworlds, chess, and terminals) and tests models using five probe categories targeting distinct cognitive skills: Causal (C) tests understanding of cause-effect relationships, Perceptual (P) tests spatial reasoning and object tracking, Memory (M) tests retention of past observations, Uncertainty (U) tests reasoning under partial observability, and Compound/X … view at source ↗
Figure 2
Figure 2. Figure 2: HALLUWORLD-TERMINAL Hallucination Rate vs. Trajectory Depth. Trajectory depth is defined as the index of the trigger command, i.e., the number of shell commands executed before the probe injection. Hallucination rates are smoothed using a Gaussian kernel in log-depth space. Context is truncated to 60k characters using middle truncation (retaining the start and end of trajectories). Uncertainty probes (red)… view at source ↗
Figure 3
Figure 3. Figure 3: Navigation impact on trajectory depth effects. (a) All 13 models show differential slopes between INNAV and CTRLSTATIC modes, revealing model-specific balance between epistemic grounding and cognitive load. Positive difference (red) indicates cognitive load dominates; negative (green) indicates grounding dominates. Frontier reasoning models paradoxically show MORE load accumulation, while efficiency-optimi… view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory depth effects by world category for HALLUWORLD-GRID. Depth dynamics vary considerably across domains, reflecting the balance between epistemic grounding and cognitive load. Perception and Causal worlds show reversed effects (grounding dominates: navigation helps), Memory and Uncertainty worlds show minimal depth effects (balanced), while X-level multi-zone facilities show large cognitive load ac… view at source ↗
Figure 5
Figure 5. Figure 5: Serialization comparison in In-Navigation mode for HALLUWORLD-GRID. Each bar shows the hallucination difference between the winning and losing serializer, sorted by margin. Green indicates canonical serializer wins (15/30 worlds); red indicates non-canonical wins (15/30 worlds). Notable inversions include Perception worlds (symbolic dominates despite grid being canonical: P3_rotation_challenge 21.5%, P2_co… view at source ↗
Figure 6
Figure 6. Figure 6: Level Editor interfaces across modalities. (a) Web-based editor showing C6 Flood Fire Escape’s complex mechanics. (b) Terminal ncurses editor for SSH-only environments. (c) Custom created new level demonstrating rapid prototyping. All export to the same plain-text .txt format. J.2 Trajectory Recorder: Ground Truth by Construction The Trajectory Recorder is a Flask based interactive web application (accessi… view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory Recorder workflow. (a) Interactive web UI for manual navigation and probe annotation. (b) Deterministic JSON format consumed by evaluation harness. Ground truth determined by operator’s direct observation, not post-hoc adjudication. wrong flood timing claims (“flood reaches fire at step 2”, actual: step 4)—all configured visually in the web editor. 3. Experience: Trajectory Recorder loads c6_flo… view at source ↗
read the original abstract

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces HalluWorld, an extensible benchmark for LLM hallucinations grounded in explicit reference-world models. Hallucination is defined as producing an observable claim false with respect to a fully specified world state. The authors construct synthetic and semi-synthetic environments (gridworlds, chess, realistic terminal tasks) that control complexity, observability, temporal change, and source-conflict policy, with automatic label generation. Evaluations of frontier and open-weight models show perceptual hallucinations on directly observed information are near-solved, while multi-step state tracking and causal forward simulation remain difficult even with extended thinking; models also struggle with abstention in terminal settings. The uneven error profile is taken to indicate distinct failure modes rather than a single capability gap.

Significance. If the reference-world construction and automatic labeling hold, the benchmark supplies a reproducible, scalable method for dissecting hallucinations that avoids reliance on human annotation or potentially memorized fixed references. The automatic, fine-grained error categorization and controlled variation of observability and source conflict are clear strengths that enable targeted diagnosis. The reported consistent patterns across models lend support to the claim of multiple underlying mechanisms and could guide more precise mitigation strategies.

major comments (1)
  1. The central empirical claim that perceptual errors are near-solved while multi-step state tracking and causal simulation remain difficult (reported in the evaluation across gridworld, chess, and terminal settings) rests on the untested assumption that error distributions in these fully specified synthetic worlds match those in open-ended, real-world LLM use. No cross-benchmark correlation with existing human-annotated hallucination datasets or human validation of label fidelity is described, which is load-bearing for interpreting the patterns as general insights rather than environment-specific observations.
minor comments (2)
  1. The abstract refers to 'extended thinking' without specifying whether this denotes chain-of-thought prompting, self-consistency, or a particular inference-time method; a short clarification in the methods or evaluation section would improve reproducibility.
  2. Implementation details for the semi-synthetic terminal tasks (e.g., how the reference world state is encoded and how observable claims are extracted from model outputs) could be expanded or accompanied by pseudocode to aid independent reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for raising this important point about the scope of our empirical claims. We address the major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: The central empirical claim that perceptual errors are near-solved while multi-step state tracking and causal simulation remain difficult (reported in the evaluation across gridworld, chess, and terminal settings) rests on the untested assumption that error distributions in these fully specified synthetic worlds match those in open-ended, real-world LLM use. No cross-benchmark correlation with existing human-annotated hallucination datasets or human validation of label fidelity is described, which is load-bearing for interpreting the patterns as general insights rather than environment-specific observations.

    Authors: We appreciate this observation. HalluWorld is deliberately constructed around fully specified reference worlds to support automatic, reproducible labeling and controlled variation of factors such as observability and source conflict—features that existing human-annotated benchmarks often lack. The consistent separation between near-solved perceptual errors and persistent difficulties in multi-step tracking and causal simulation across three qualitatively different environments (gridworlds, chess, and terminal tasks) is presented as evidence for distinct failure modes within these controlled settings. We do not claim that the precise error distributions transfer identically to open-ended real-world use. We agree that an explicit discussion of this scope would strengthen the manuscript and will add a dedicated limitations subsection in the revised version that (a) clarifies the intended role of HalluWorld as a diagnostic rather than exhaustive proxy and (b) outlines potential future work on cross-benchmark correlations. No new experiments are required for this clarification. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is self-contained

full rationale

The paper introduces HalluWorld as a new benchmark using explicitly specified reference worlds (gridworlds, chess, terminal tasks) to generate automatic hallucination labels by checking claims against known world states. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claims rest on the construction of synthetic environments and empirical model evaluations, which are independent of any prior fitted quantities or author-specific uniqueness theorems. Results are presented as observed patterns rather than predictions forced by the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces new synthetic environments and an automatic labeling procedure but does not rely on fitted parameters or new physical entities. It rests on the domain assumption that hallucinations can be operationalized as observable false claims relative to a fully specified world.

axioms (1)
  • domain assumption A model hallucinates when it produces an observable claim that is false with respect to the reference world.
    This definition is introduced in the abstract as the grounding for the entire benchmark.

pith-pipeline@v0.9.0 · 5843 in / 1448 out tokens · 41093 ms · 2026-05-20T05:49:11.168168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 3 internal anchors

  1. [1]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of Multimodal Large Language Models: A Survey. 4 2025

  2. [2]

    HalluLens: LLM Hallucination Benchmark

    Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM Hallucination Benchmark. 4 2025

  3. [3]

    FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

    Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, and Amin Ahmad. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs. 10 2024

  4. [4]

    A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026

    Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/s415 86-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

  5. [5]

    Benchmarking Large Language Models in Retrieval-Augmented Generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking Large Language Models in Retrieval-Augmented Generation. 12 2023

  6. [6]

    BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning

    Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning. 12 2019

  7. [7]

    Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, December 2023

  8. [8]

    Transformers as Soft Reasoners over Language

    Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as Soft Reasoners over Language. 5 2020. 10

  9. [9]

    TextWorld: A Learning Environment for Text-based Games

    Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. TextWorld: A Learning Environment for Text-based Games. 11 2019

  10. [10]

    Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? 7 2024

  11. [11]

    Standard: Portable game notation specification and implementation guide

    Steven J Edwards, SD Forsyth, J Stanback, and A Saremba. Standard: Portable game notation specification and implementation guide. 1994.URL https://ia802908.us.archive.org/26/items/pgn-standard-1994-03-12/PGN_standard_1994-03- 12.txt, 1994

  12. [12]

    Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal

    Steven Y . Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal. SAPPHIRE: Approaches for enhanced concept-to-text generation. In Anya Belz, Angela Fan, Ehud Reiter, and Yaji Sripada, editors,Proceedings of the 14th International Conference on Natural Language Generation, pages 212–225, Aberdeen, Scotland, UK, August 2021. Associ...

  13. [13]

    Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal

    Steven Y . Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal. Retrieve, caption, generate: Visual grounding for enhancing commonsense in text generation models.Proceedings of the AAAI Conference on Artificial Intelligence, 36(10): 10618–10626, Jun. 2022. doi: 10.1609/aaai.v36i10.21306. URL https://ojs.aaai.org/i n...

  14. [14]

    Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy

    Steven Y . Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy. CHARD: Clinical health-aware reasoning across dimensions for text generation models. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–327, Dubrovnik, C...

  15. [15]

    Feng, Noah D

    Steven Y . Feng, Noah D. Goodman, and Michael C. Frank. Is child-directed speech effective training data for language models? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22055–22071, Miami, Florida, USA, November 2024. Association for Computatio...

  16. [16]

    Feng, Alvin W

    Steven Y . Feng, Alvin W. M. Tan, and Michael C. Frank. Baby scale: Investigating models trained on individual children’s language input, 2026. URL https://arxiv.org/abs/2603 .29522

  17. [17]

    RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

    Robert Friel, Masha Belyi, and Atindriyo Sanyal. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. 1 2025

  18. [18]

    Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. 3 2024

  19. [19]

    World Models

    David Ha and Jürgen Schmidhuber. World Models. 5 2018. doi: 10.5281/zenodo.1207631

  20. [20]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. 6 2019

  21. [21]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. 3 2020

  22. [22]

    Tracking World States with Language Models: State-Based Evaluation Using Chess

    Romain Harang, Jason Naradowsky, Yaswitha Gujju, and Yusuke Miyao. Tracking World States with Language Models: State-Based Evaluation Using Chess. 8 2025. 11

  23. [23]

    Tracking the World State with Recurrent Entity Networks

    Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the World State with Recurrent Entity Networks. 5 2017

  24. [24]

    Feng, and Michael C

    Jennifer Hu, Alvin Wei Ming Tan, Steven Y . Feng, and Michael C. Frank. Language production is harder than comprehension for children and language models. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025. URL https://escholarship.o rg/uc/item/5rz8b9jg

  25. [25]

    Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

    Michael Y Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, and Ethan Gotlieb Wilcox. Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InThe 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learn...

  26. [26]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 11 2024. doi: 10.1145/3703155

  27. [27]

    The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

    Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshte...

  28. [28]

    Carlos E

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. 7 2024. doi: 10.1145/3571730

  29. [29]

    BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

    Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, and Deepak Ramachandran. BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information. 6 2023

  30. [30]

    ACPBench: Reasoning about Action, Change, and Planning

    Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. ACPBench: Reasoning about Action, Change, and Planning. 2 2026. doi: 10.1609/aaai.v39i25.34857

  31. [31]

    Evaluating the Factual Consistency of Abstractive Text Summarization

    Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the Factual Consistency of Abstractive Text Summarization. 10 2019

  32. [32]

    Hallucina- tions in Neural Machine Translation, 2019

    Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Hallucina- tions in Neural Machine Translation, 2019. URL https://openreview.net/forum?id=Sk xJ-309FQ

  33. [33]

    Li, Zifan Carl Guo, and Jacob Andreas

    Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (How) Do Language Models Track State? 10 2025

  34. [34]

    HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. 10 2023

  35. [35]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. 10 2023

  36. [36]

    Lichess open database, 2022

    Lichess. Lichess open database, 2022. URL https://database.lichess.org/. Accessed: 2026-05-04

  37. [37]

    CommonGen: A constrained text generation challenge for generative commonsense reasoning

    Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November

  38. [38]

    doi: 10.18653/v1/2020.findings-emnlp.165

    Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165. URLhttps://aclanthology.org/2020.findings-emnlp.165/. 12

  39. [39]

    Learning to Model the World with Language

    Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to Model the World with Language. 5 2024

  40. [40]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. 5 2022

  41. [41]

    arXiv preprint arXiv:2505.21523 (2025) 13

    Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models, 2025. URLhttps://arxiv.org/abs/2505.21523

  42. [42]

    Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, and Steven Y . Feng. A unified definition of hallucination: It’s the world model, stupid!, 2026. URLhttps://arxiv.org/abs/2512.21577

  43. [43]

    AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

    Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, and Ran He. AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents. 1 2026

  44. [44]

    The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

    Bria Long, Robert Z Sparks, Violet Xiang, Stefan Stojanov, Zi Yin, Grace E Keene, Alvin WM Tan, Steven Y Feng, Chengxu Zhuang, Virginia A Marchman, et al. The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

  45. [45]

    On Faithfulness and Factuality in Abstractive Summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. 5 2020

  46. [46]

    Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

  47. [47]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  48. [48]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants. 11 2023

  49. [49]

    FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. 10 2023

  50. [50]

    RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models. 5 2024

  51. [51]

    Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

    Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. 7 2021

  52. [52]

    Language models use lookbacks to track beliefs

    Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, and Atticus Geiger. Language models use lookbacks to track beliefs. InThe Fourteenth International Conference on Learning Representations, 2026. 13

  53. [53]

    HALoGEN: Fantastic LLM hallucinations and where to find them

    Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. HALoGEN: Fantastic LLM hallucinations and where to find them. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1402–1425, Vienna, A...

  54. [54]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucination in Image Captioning. 3 2019

  55. [55]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. 2 2020. doi: 10.1038/s41586-020-03051-4

  56. [56]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. 3 2021

  57. [57]

    Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

    Aryan Shrivastava and Ari Holtzman. Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

  58. [58]

    Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y . Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining, 2026. URL https://arxiv.org/abs/2604.00715

  59. [59]

    Hamilton

    Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. 9 2019

  60. [60]

    ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

    Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM. 8 2024

  61. [61]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning Large Multimodal Models with Factually Augmented RLHF. 9 2023

  62. [62]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  63. [63]

    ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

    Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. 6 2021

  64. [64]

    Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing

    Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing. 11 2021

  65. [65]

    Gridworlds as testbeds for planning with incomplete information

    Craig Tovey and Sven Koenig. Gridworlds as testbeds for planning with incomplete information. InAAAI/IAAI, pages 819–824, 2000

  66. [66]

    PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. 11 2023

  67. [67]

    An Audit on the Perspectives and Challenges of Hallucinations in NLP

    Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. An Audit on the Perspectives and Challenges of Hallucinations in NLP. 9 2024

  68. [68]

    Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

    Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. 4 2020

  69. [69]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation. 2 2024. 14

  70. [70]

    Resolving Knowledge Conflicts in Large Language Models

    Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving Knowledge Conflicts in Large Language Models. 10 2024

  71. [71]

    HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

    Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, and Golnaz Ghiasi. HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning. 7 2024

  72. [72]

    Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

    Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, et al. Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InProceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Le...

  73. [73]

    Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov

    Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. 12 2015

  74. [74]

    Hallucination diversity-aware active learning for text summarization

    Yu Xia, Xu Liu, Tong Yu, Sungchul Kim, Ryan Rossi, Anup Rao, Tung Mai, and Shuai Li. Hallucination diversity-aware active learning for text summarization. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8665–8677, 2024

  75. [75]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. 5 2024

  76. [76]

    Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

    Linda Zeng, Steven Y . Feng, and Michael C. Frank. Bringing up a bilingual babylm: In- vestigating multilingual language acquisition using small-scale models, 2026. URL https: //arxiv.org/abs/2603.29552

  77. [77]

    Language-Guided World Models: A Model-Based Approach to AI Control

    Alex Zhang, Khanh Nguyen, Jens Tuyls, Albert Lin, and Karthik Narasimhan. Language-Guided World Models: A Model-Based Approach to AI Control. 9 2024

  78. [78]

    MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them

    Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, and Dawn Song. MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them. 7 2025

  79. [79]

    Is there a green ball in your current field of view?

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. 4 2024. 15 A HALLUWORLDQualitative Examples A.1 HALLUWORLD-GRID Component Details World:P1 Dense ArrayGrid View (Canonical Locati...

  80. [80]

    Navigation activelyhelps state tracking in these domains

    Grounding-dominated regimes(Perception, Causal): Negative slopes (-1.0% to -1.8% per quintile) indicate that additional observations provide more benefit (disambiguation of spatial/causal relationships) than cost (working memory load). Navigation activelyhelps state tracking in these domains. 2.Balanced regimes(Memory, Uncertainty): Near-zero slopes (+0.2...

Showing first 80 references.