HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

arxiv: 2605.19341 · v1 · pith:HPZPRGEPnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Emmy Liu , Varun Gangal , Michael Yu , Zhuofu Tao , Karan Singh , Sachin Kumar , Steven Y. Feng This is my paper

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords hallucinationbenchmarklanguage modelsreference worldsynthetic environmentsstate trackingabstentiongridworld

0 comments p. Extension

pith:HPZPRGEP Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{HPZPRGEP}

Prints a linked pith:HPZPRGEP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

HalluWorld defines hallucinations as false observable claims against a fully specified reference world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks fragment hallucination measurement across summarization, question answering, and other tasks, often relying on human labels or non-reproducible settings. The paper introduces HalluWorld, which grounds evaluation in explicit reference worlds where any model output that contradicts the world counts as a hallucination. Synthetic and semi-synthetic environments such as gridworlds, chess, and terminal tasks make the reference world fully known, the model's observations controllable, and labels automatic. Frontier models nearly solve direct perceptual errors but continue to fail at multi-step state tracking, causal simulation, and deciding when to abstain. The pattern of failures indicates that hallucinations stem from separate mechanisms rather than one general deficit.

Core claim

The paper establishes that a model hallucinates precisely when it produces an observable claim false with respect to an explicitly specified reference world. Synthetic and semi-synthetic environments are built in which the full world state is known, the model's partial view is controlled, and hallucination labels are generated without human annotation. These environments span gridworlds, chess, and realistic terminal tasks, allowing systematic variation of complexity, observability, temporal dynamics, and source conflicts while producing fine-grained error categories. Evaluation of frontier and open-weight models shows perceptual hallucinations on directly observed facts are largely solved,

What carries the argument

The reference-world formulation that treats hallucination as any observable claim contradicting the fully known world state, allowing automatic labeling in controlled synthetic settings.

If this is right

Hallucinations can be measured reproducibly across domains without fixed memorized references or human annotation.
Error types can be separated into perceptual, multi-step tracking, and abstention failures for targeted study.
Varying world complexity and observability reveals which capabilities remain unsolved even with extended thinking.
Consistent failure profiles across models suggest hallucinations are not a single capability problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If synthetic reference worlds capture essential dynamics, interventions that reduce specific error categories here could be tested for transfer to production systems.
The same reference-world approach might be applied to agentic or multi-turn interactions to measure cumulative state errors over time.
Combining automatic labels with targeted human review on edge cases could refine the benchmark while keeping scalability.

Load-bearing premise

An explicitly specified reference world can be built and used to produce automatic hallucination labels that match the same phenomenon observed in real open-ended language model use.

What would settle it

Human judges marking the same model outputs on overlapping real-world examples would show low agreement with the automatic HalluWorld labels, or model improvements on HalluWorld would fail to reduce hallucinations on standard non-synthetic benchmarks.

Figures

Figures reproduced from arXiv: 2605.19341 by Emmy Liu, Karan Singh, Michael Yu, Sachin Kumar, Steven Y. Feng, Varun Gangal, Zhuofu Tao.

**Figure 1.** Figure 1: The HALLUWORLD benchmark spans three domains (gridworlds, chess, and terminals) and tests models using five probe categories targeting distinct cognitive skills: Causal (C) tests understanding of cause-effect relationships, Perceptual (P) tests spatial reasoning and object tracking, Memory (M) tests retention of past observations, Uncertainty (U) tests reasoning under partial observability, and Compound/X … view at source ↗

**Figure 2.** Figure 2: HALLUWORLD-TERMINAL Hallucination Rate vs. Trajectory Depth. Trajectory depth is defined as the index of the trigger command, i.e., the number of shell commands executed before the probe injection. Hallucination rates are smoothed using a Gaussian kernel in log-depth space. Context is truncated to 60k characters using middle truncation (retaining the start and end of trajectories). Uncertainty probes (red)… view at source ↗

**Figure 3.** Figure 3: Navigation impact on trajectory depth effects. (a) All 13 models show differential slopes between INNAV and CTRLSTATIC modes, revealing model-specific balance between epistemic grounding and cognitive load. Positive difference (red) indicates cognitive load dominates; negative (green) indicates grounding dominates. Frontier reasoning models paradoxically show MORE load accumulation, while efficiency-optimi… view at source ↗

**Figure 4.** Figure 4: Trajectory depth effects by world category for HALLUWORLD-GRID. Depth dynamics vary considerably across domains, reflecting the balance between epistemic grounding and cognitive load. Perception and Causal worlds show reversed effects (grounding dominates: navigation helps), Memory and Uncertainty worlds show minimal depth effects (balanced), while X-level multi-zone facilities show large cognitive load ac… view at source ↗

**Figure 5.** Figure 5: Serialization comparison in In-Navigation mode for HALLUWORLD-GRID. Each bar shows the hallucination difference between the winning and losing serializer, sorted by margin. Green indicates canonical serializer wins (15/30 worlds); red indicates non-canonical wins (15/30 worlds). Notable inversions include Perception worlds (symbolic dominates despite grid being canonical: P3_rotation_challenge 21.5%, P2_co… view at source ↗

**Figure 6.** Figure 6: Level Editor interfaces across modalities. (a) Web-based editor showing C6 Flood Fire Escape’s complex mechanics. (b) Terminal ncurses editor for SSH-only environments. (c) Custom created new level demonstrating rapid prototyping. All export to the same plain-text .txt format. J.2 Trajectory Recorder: Ground Truth by Construction The Trajectory Recorder is a Flask based interactive web application (accessi… view at source ↗

**Figure 7.** Figure 7: Trajectory Recorder workflow. (a) Interactive web UI for manual navigation and probe annotation. (b) Deterministic JSON format consumed by evaluation harness. Ground truth determined by operator’s direct observation, not post-hoc adjudication. wrong flood timing claims (“flood reaches fire at step 2”, actual: step 4)—all configured visually in the web editor. 3. Experience: Trajectory Recorder loads c6_flo… view at source ↗

read the original abstract

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HalluWorld offers controlled synthetic benchmarks for hallucination with automatic labels, but real-world generalization needs more evidence.

read the letter

HalluWorld introduces a benchmark using explicit reference world models in synthetic environments to automatically label and categorize hallucinations. This is a practical step for controlled testing, though how well it captures real LLM hallucinations remains to be seen. The new part is the reference-world approach combined with environments like gridworlds, chess, and terminal tasks. These allow full specification of the world state, control over what the model sees, and automatic generation of hallucination labels by checking claims against the state. The authors vary factors such as complexity, observability, and source conflicts, and they report results from evaluating several language models. The findings show that perceptual errors on directly observed information are mostly handled by frontier models, but multi-step state tracking, causal simulation, and deciding when to abstain are still problematic. Breaking down the errors into fine-grained categories helps show that different types of hallucinations may stem from separate issues. A softer area is the connection to actual use cases. The paper's definition focuses on false claims relative to a known reference world, which works well for these setups but might miss hallucinations involving unstated or external information common in open-ended outputs. There is no mention of validating the automatic labels against human judgments or checking consistency with other benchmarks, so the reported patterns could be specific to these synthetic probes. That said, the consistency across models is a positive sign. This paper targets researchers developing methods to reduce hallucinations who need reproducible and dissectible tests. Anyone interested in evaluation design or capability-specific analysis would get something out of it. The work is grounded enough in its construction and results to warrant a full peer review. I recommend accepting it for peer review, asking the authors to address the generalization question with additional checks if possible.

Referee Report

1 major / 2 minor

Summary. The paper introduces HalluWorld, an extensible benchmark for LLM hallucinations grounded in explicit reference-world models. Hallucination is defined as producing an observable claim false with respect to a fully specified world state. The authors construct synthetic and semi-synthetic environments (gridworlds, chess, realistic terminal tasks) that control complexity, observability, temporal change, and source-conflict policy, with automatic label generation. Evaluations of frontier and open-weight models show perceptual hallucinations on directly observed information are near-solved, while multi-step state tracking and causal forward simulation remain difficult even with extended thinking; models also struggle with abstention in terminal settings. The uneven error profile is taken to indicate distinct failure modes rather than a single capability gap.

Significance. If the reference-world construction and automatic labeling hold, the benchmark supplies a reproducible, scalable method for dissecting hallucinations that avoids reliance on human annotation or potentially memorized fixed references. The automatic, fine-grained error categorization and controlled variation of observability and source conflict are clear strengths that enable targeted diagnosis. The reported consistent patterns across models lend support to the claim of multiple underlying mechanisms and could guide more precise mitigation strategies.

major comments (1)

The central empirical claim that perceptual errors are near-solved while multi-step state tracking and causal simulation remain difficult (reported in the evaluation across gridworld, chess, and terminal settings) rests on the untested assumption that error distributions in these fully specified synthetic worlds match those in open-ended, real-world LLM use. No cross-benchmark correlation with existing human-annotated hallucination datasets or human validation of label fidelity is described, which is load-bearing for interpreting the patterns as general insights rather than environment-specific observations.

minor comments (2)

The abstract refers to 'extended thinking' without specifying whether this denotes chain-of-thought prompting, self-consistency, or a particular inference-time method; a short clarification in the methods or evaluation section would improve reproducibility.
Implementation details for the semi-synthetic terminal tasks (e.g., how the reference world state is encoded and how observable claims are extracted from model outputs) could be expanded or accompanied by pseudocode to aid independent reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for raising this important point about the scope of our empirical claims. We address the major comment below and describe the revisions we will make.

read point-by-point responses

Referee: The central empirical claim that perceptual errors are near-solved while multi-step state tracking and causal simulation remain difficult (reported in the evaluation across gridworld, chess, and terminal settings) rests on the untested assumption that error distributions in these fully specified synthetic worlds match those in open-ended, real-world LLM use. No cross-benchmark correlation with existing human-annotated hallucination datasets or human validation of label fidelity is described, which is load-bearing for interpreting the patterns as general insights rather than environment-specific observations.

Authors: We appreciate this observation. HalluWorld is deliberately constructed around fully specified reference worlds to support automatic, reproducible labeling and controlled variation of factors such as observability and source conflict—features that existing human-annotated benchmarks often lack. The consistent separation between near-solved perceptual errors and persistent difficulties in multi-step tracking and causal simulation across three qualitatively different environments (gridworlds, chess, and terminal tasks) is presented as evidence for distinct failure modes within these controlled settings. We do not claim that the precise error distributions transfer identically to open-ended real-world use. We agree that an explicit discussion of this scope would strengthen the manuscript and will add a dedicated limitations subsection in the revised version that (a) clarifies the intended role of HalluWorld as a diagnostic rather than exhaustive proxy and (b) outlines potential future work on cross-benchmark correlations. No new experiments are required for this clarification. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is self-contained

full rationale

The paper introduces HalluWorld as a new benchmark using explicitly specified reference worlds (gridworlds, chess, terminal tasks) to generate automatic hallucination labels by checking claims against known world states. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claims rest on the construction of synthetic environments and empirical model evaluations, which are independent of any prior fitted quantities or author-specific uniqueness theorems. Results are presented as observed patterns rather than predictions forced by the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces new synthetic environments and an automatic labeling procedure but does not rely on fitted parameters or new physical entities. It rests on the domain assumption that hallucinations can be operationalized as observable false claims relative to a fully specified world.

axioms (1)

domain assumption A model hallucinates when it produces an observable claim that is false with respect to the reference world.
This definition is introduced in the abstract as the grounding for the entire benchmark.

pith-pipeline@v0.9.0 · 5843 in / 1448 out tokens · 41093 ms · 2026-05-20T05:49:11.168168+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a model hallucinates when it produces an observable claim that is false with respect to this reference world... W=(S,H,R), view function V, and truth function TW,P

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 3 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of Multimodal Large Language Models: A Survey. 4 2025

work page 2025
[2]

HalluLens: LLM Hallucination Benchmark

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM Hallucination Benchmark. 4 2025

work page 2025
[3]

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, and Amin Ahmad. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs. 10 2024

work page 2024
[4]

A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/s415 86-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page doi:10.1038/s415 2026
[5]

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking Large Language Models in Retrieval-Augmented Generation. 12 2023

work page 2023
[6]

BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning. 12 2019

work page 2019
[7]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, December 2023

work page 2023
[8]

Transformers as Soft Reasoners over Language

Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as Soft Reasoners over Language. 5 2020. 10

work page 2020
[9]

TextWorld: A Learning Environment for Text-based Games

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. TextWorld: A Learning Environment for Text-based Games. 11 2019

work page 2019
[10]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? 7 2024

work page 2024
[11]

Standard: Portable game notation specification and implementation guide

Steven J Edwards, SD Forsyth, J Stanback, and A Saremba. Standard: Portable game notation specification and implementation guide. 1994.URL https://ia802908.us.archive.org/26/items/pgn-standard-1994-03-12/PGN_standard_1994-03- 12.txt, 1994

work page 1994
[12]

Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal

Steven Y . Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal. SAPPHIRE: Approaches for enhanced concept-to-text generation. In Anya Belz, Angela Fan, Ehud Reiter, and Yaji Sripada, editors,Proceedings of the 14th International Conference on Natural Language Generation, pages 212–225, Aberdeen, Scotland, UK, August 2021. Associ...

work page doi:10.18653/v1/2021.inlg-1.21 2021
[13]

Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal

Steven Y . Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal. Retrieve, caption, generate: Visual grounding for enhancing commonsense in text generation models.Proceedings of the AAAI Conference on Artificial Intelligence, 36(10): 10618–10626, Jun. 2022. doi: 10.1609/aaai.v36i10.21306. URL https://ojs.aaai.org/i n...

work page doi:10.1609/aaai.v36i10.21306 2022
[14]

Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy

Steven Y . Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy. CHARD: Clinical health-aware reasoning across dimensions for text generation models. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–327, Dubrovnik, C...

work page doi:10.18653/v1/2023.eacl-m 2023
[15]

Feng, Noah D

Steven Y . Feng, Noah D. Goodman, and Michael C. Frank. Is child-directed speech effective training data for language models? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22055–22071, Miami, Florida, USA, November 2024. Association for Computatio...

work page doi:10.18653/v1/2024.emnlp-main.1231 2024
[16]

Feng, Alvin W

Steven Y . Feng, Alvin W. M. Tan, and Michael C. Frank. Baby scale: Investigating models trained on individual children’s language input, 2026. URL https://arxiv.org/abs/2603 .29522

work page 2026
[17]

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Robert Friel, Masha Belyi, and Atindriyo Sanyal. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. 1 2025

work page 2025
[18]

Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. 3 2024

work page 2024
[19]

World Models

David Ha and Jürgen Schmidhuber. World Models. 5 2018. doi: 10.5281/zenodo.1207631

work page doi:10.5281/zenodo.1207631 2018
[20]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. 6 2019

work page 2019
[21]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. 3 2020

work page 2020
[22]

Tracking World States with Language Models: State-Based Evaluation Using Chess

Romain Harang, Jason Naradowsky, Yaswitha Gujju, and Yusuke Miyao. Tracking World States with Language Models: State-Based Evaluation Using Chess. 8 2025. 11

work page 2025
[23]

Tracking the World State with Recurrent Entity Networks

Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the World State with Recurrent Entity Networks. 5 2017

work page 2017
[24]

Feng, and Michael C

Jennifer Hu, Alvin Wei Ming Tan, Steven Y . Feng, and Michael C. Frank. Language production is harder than comprehension for children and language models. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025. URL https://escholarship.o rg/uc/item/5rz8b9jg

work page 2025
[25]

Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

Michael Y Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, and Ethan Gotlieb Wilcox. Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InThe 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learn...

work page 2024
[26]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 11 2024. doi: 10.1145/3703155

work page doi:10.1145/3703155 2024
[27]

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshte...

work page 2025
[28]

Carlos E

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. 7 2024. doi: 10.1145/3571730

work page doi:10.1145/3571730 2024
[29]

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, and Deepak Ramachandran. BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information. 6 2023

work page 2023
[30]

ACPBench: Reasoning about Action, Change, and Planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. ACPBench: Reasoning about Action, Change, and Planning. 2 2026. doi: 10.1609/aaai.v39i25.34857

work page doi:10.1609/aaai.v39i25.34857 2026
[31]

Evaluating the Factual Consistency of Abstractive Text Summarization

Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the Factual Consistency of Abstractive Text Summarization. 10 2019

work page 2019
[32]

Hallucina- tions in Neural Machine Translation, 2019

Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Hallucina- tions in Neural Machine Translation, 2019. URL https://openreview.net/forum?id=Sk xJ-309FQ

work page 2019
[33]

Li, Zifan Carl Guo, and Jacob Andreas

Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (How) Do Language Models Track State? 10 2025

work page 2025
[34]

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. 10 2023

work page 2023
[35]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. 10 2023

work page 2023
[36]

Lichess open database, 2022

Lichess. Lichess open database, 2022. URL https://database.lichess.org/. Accessed: 2026-05-04

work page 2022
[37]

CommonGen: A constrained text generation challenge for generative commonsense reasoning

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November

work page 2020
[38]

doi: 10.18653/v1/2020.findings-emnlp.165

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165. URLhttps://aclanthology.org/2020.findings-emnlp.165/. 12

work page doi:10.18653/v1/2020.findings-emnlp.165 2020
[39]

Learning to Model the World with Language

Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to Model the World with Language. 5 2024

work page 2024
[40]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. 5 2022

work page 2022
[41]

arXiv preprint arXiv:2505.21523 (2025) 13

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models, 2025. URLhttps://arxiv.org/abs/2505.21523

work page arXiv 2025
[42]

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, and Steven Y . Feng. A unified definition of hallucination: It’s the world model, stupid!, 2026. URLhttps://arxiv.org/abs/2512.21577

work page arXiv 2026
[43]

AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, and Ran He. AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents. 1 2026

work page 2026
[44]

The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

Bria Long, Robert Z Sparks, Violet Xiang, Stefan Stojanov, Zi Yin, Grace E Keene, Alvin WM Tan, Steven Y Feng, Chengxu Zhuang, Virginia A Marchman, et al. The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

work page arXiv 2024
[45]

On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. 5 2020

work page 2020
[46]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022
[47]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants. 11 2023

work page 2023
[49]

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. 10 2023

work page 2023
[50]

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models. 5 2024

work page 2024
[51]

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. 7 2021

work page 2021
[52]

Language models use lookbacks to track beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, and Atticus Geiger. Language models use lookbacks to track beliefs. InThe Fourteenth International Conference on Learning Representations, 2026. 13

work page 2026
[53]

HALoGEN: Fantastic LLM hallucinations and where to find them

Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. HALoGEN: Fantastic LLM hallucinations and where to find them. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1402–1425, Vienna, A...

work page doi:10.18653/v1/2025.acl-long.71 2025
[54]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucination in Image Captioning. 3 2019

work page 2019
[55]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. 2 2020. doi: 10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
[56]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. 3 2021

work page 2021
[57]

Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

Aryan Shrivastava and Ari Holtzman. Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

work page arXiv 2025
[58]

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y . Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining, 2026. URL https://arxiv.org/abs/2604.00715

work page arXiv 2026
[59]

Hamilton

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. 9 2019

work page 2019
[60]

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM. 8 2024

work page 2024
[61]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning Large Multimodal Models with Factually Augmented RLHF. 9 2023

work page 2023
[62]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[63]

ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. 6 2021

work page 2021
[64]

Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing

Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing. 11 2021

work page 2021
[65]

Gridworlds as testbeds for planning with incomplete information

Craig Tovey and Sven Koenig. Gridworlds as testbeds for planning with incomplete information. InAAAI/IAAI, pages 819–824, 2000

work page 2000
[66]

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. 11 2023

work page 2023
[67]

An Audit on the Perspectives and Challenges of Hallucinations in NLP

Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. An Audit on the Perspectives and Challenges of Hallucinations in NLP. 9 2024

work page 2024
[68]

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. 4 2020

work page 2020
[69]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation. 2 2024. 14

work page 2024
[70]

Resolving Knowledge Conflicts in Large Language Models

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving Knowledge Conflicts in Large Language Models. 10 2024

work page 2024
[71]

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, and Golnaz Ghiasi. HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning. 7 2024

work page 2024
[72]

Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, et al. Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InProceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Le...

work page 2023
[73]

Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. 12 2015

work page 2015
[74]

Hallucination diversity-aware active learning for text summarization

Yu Xia, Xu Liu, Tong Yu, Sungchul Kim, Ryan Rossi, Anup Rao, Tung Mai, and Shuai Li. Hallucination diversity-aware active learning for text summarization. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8665–8677, 2024

work page 2024
[75]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. 5 2024

work page 2024
[76]

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng, Steven Y . Feng, and Michael C. Frank. Bringing up a bilingual babylm: In- vestigating multilingual language acquisition using small-scale models, 2026. URL https: //arxiv.org/abs/2603.29552

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

Language-Guided World Models: A Model-Based Approach to AI Control

Alex Zhang, Khanh Nguyen, Jens Tuyls, Albert Lin, and Karthik Narasimhan. Language-Guided World Models: A Model-Based Approach to AI Control. 9 2024

work page 2024
[78]

MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them

Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, and Dawn Song. MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them. 7 2025

work page 2025
[79]

Is there a green ball in your current field of view?

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. 4 2024. 15 A HALLUWORLDQualitative Examples A.1 HALLUWORLD-GRID Component Details World:P1 Dense ArrayGrid View (Canonical Locati...

work page 2024
[80]

Navigation activelyhelps state tracking in these domains

Grounding-dominated regimes(Perception, Causal): Negative slopes (-1.0% to -1.8% per quintile) indicate that additional observations provide more benefit (disambiguation of spatial/causal relationships) than cost (working memory load). Navigation activelyhelps state tracking in these domains. 2.Balanced regimes(Memory, Uncertainty): Near-zero slopes (+0.2...

work page

Showing first 80 references.

[1] [1]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of Multimodal Large Language Models: A Survey. 4 2025

work page 2025

[2] [2]

HalluLens: LLM Hallucination Benchmark

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM Hallucination Benchmark. 4 2025

work page 2025

[3] [3]

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, and Amin Ahmad. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs. 10 2024

work page 2024

[4] [4]

A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/s415 86-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page doi:10.1038/s415 2026

[5] [5]

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking Large Language Models in Retrieval-Augmented Generation. 12 2023

work page 2023

[6] [6]

BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning. 12 2019

work page 2019

[7] [7]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, December 2023

work page 2023

[8] [8]

Transformers as Soft Reasoners over Language

Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as Soft Reasoners over Language. 5 2020. 10

work page 2020

[9] [9]

TextWorld: A Learning Environment for Text-based Games

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. TextWorld: A Learning Environment for Text-based Games. 11 2019

work page 2019

[10] [10]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? 7 2024

work page 2024

[11] [11]

Standard: Portable game notation specification and implementation guide

Steven J Edwards, SD Forsyth, J Stanback, and A Saremba. Standard: Portable game notation specification and implementation guide. 1994.URL https://ia802908.us.archive.org/26/items/pgn-standard-1994-03-12/PGN_standard_1994-03- 12.txt, 1994

work page 1994

[12] [12]

Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal

Steven Y . Feng, Jessica Huynh, Chaitanya Prasad Narisetty, Eduard Hovy, and Varun Gangal. SAPPHIRE: Approaches for enhanced concept-to-text generation. In Anya Belz, Angela Fan, Ehud Reiter, and Yaji Sripada, editors,Proceedings of the 14th International Conference on Natural Language Generation, pages 212–225, Aberdeen, Scotland, UK, August 2021. Associ...

work page doi:10.18653/v1/2021.inlg-1.21 2021

[13] [13]

Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal

Steven Y . Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal. Retrieve, caption, generate: Visual grounding for enhancing commonsense in text generation models.Proceedings of the AAAI Conference on Artificial Intelligence, 36(10): 10618–10626, Jun. 2022. doi: 10.1609/aaai.v36i10.21306. URL https://ojs.aaai.org/i n...

work page doi:10.1609/aaai.v36i10.21306 2022

[14] [14]

Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy

Steven Y . Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, and Eduard Hovy. CHARD: Clinical health-aware reasoning across dimensions for text generation models. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–327, Dubrovnik, C...

work page doi:10.18653/v1/2023.eacl-m 2023

[15] [15]

Feng, Noah D

Steven Y . Feng, Noah D. Goodman, and Michael C. Frank. Is child-directed speech effective training data for language models? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22055–22071, Miami, Florida, USA, November 2024. Association for Computatio...

work page doi:10.18653/v1/2024.emnlp-main.1231 2024

[16] [16]

Feng, Alvin W

Steven Y . Feng, Alvin W. M. Tan, and Michael C. Frank. Baby scale: Investigating models trained on individual children’s language input, 2026. URL https://arxiv.org/abs/2603 .29522

work page 2026

[17] [17]

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Robert Friel, Masha Belyi, and Atindriyo Sanyal. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. 1 2025

work page 2025

[18] [18]

Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallu- sionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. 3 2024

work page 2024

[19] [19]

World Models

David Ha and Jürgen Schmidhuber. World Models. 5 2018. doi: 10.5281/zenodo.1207631

work page doi:10.5281/zenodo.1207631 2018

[20] [20]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. 6 2019

work page 2019

[21] [21]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. 3 2020

work page 2020

[22] [22]

Tracking World States with Language Models: State-Based Evaluation Using Chess

Romain Harang, Jason Naradowsky, Yaswitha Gujju, and Yusuke Miyao. Tracking World States with Language Models: State-Based Evaluation Using Chess. 8 2025. 11

work page 2025

[23] [23]

Tracking the World State with Recurrent Entity Networks

Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the World State with Recurrent Entity Networks. 5 2017

work page 2017

[24] [24]

Feng, and Michael C

Jennifer Hu, Alvin Wei Ming Tan, Steven Y . Feng, and Michael C. Frank. Language production is harder than comprehension for children and language models. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025. URL https://escholarship.o rg/uc/item/5rz8b9jg

work page 2025

[25] [25]

Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

Michael Y Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, and Ethan Gotlieb Wilcox. Findings of the second babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InThe 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learn...

work page 2024

[26] [26]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 11 2024. doi: 10.1145/3703155

work page doi:10.1145/3703155 2024

[27] [27]

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshte...

work page 2025

[28] [28]

Carlos E

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. 7 2024. doi: 10.1145/3571730

work page doi:10.1145/3571730 2024

[29] [29]

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, and Deepak Ramachandran. BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information. 6 2023

work page 2023

[30] [30]

ACPBench: Reasoning about Action, Change, and Planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. ACPBench: Reasoning about Action, Change, and Planning. 2 2026. doi: 10.1609/aaai.v39i25.34857

work page doi:10.1609/aaai.v39i25.34857 2026

[31] [31]

Evaluating the Factual Consistency of Abstractive Text Summarization

Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the Factual Consistency of Abstractive Text Summarization. 10 2019

work page 2019

[32] [32]

Hallucina- tions in Neural Machine Translation, 2019

Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. Hallucina- tions in Neural Machine Translation, 2019. URL https://openreview.net/forum?id=Sk xJ-309FQ

work page 2019

[33] [33]

Li, Zifan Carl Guo, and Jacob Andreas

Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (How) Do Language Models Track State? 10 2025

work page 2025

[34] [34]

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. 10 2023

work page 2023

[35] [35]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. 10 2023

work page 2023

[36] [36]

Lichess open database, 2022

Lichess. Lichess open database, 2022. URL https://database.lichess.org/. Accessed: 2026-05-04

work page 2022

[37] [37]

CommonGen: A constrained text generation challenge for generative commonsense reasoning

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November

work page 2020

[38] [38]

doi: 10.18653/v1/2020.findings-emnlp.165

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165. URLhttps://aclanthology.org/2020.findings-emnlp.165/. 12

work page doi:10.18653/v1/2020.findings-emnlp.165 2020

[39] [39]

Learning to Model the World with Language

Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to Model the World with Language. 5 2024

work page 2024

[40] [40]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. 5 2022

work page 2022

[41] [41]

arXiv preprint arXiv:2505.21523 (2025) 13

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models, 2025. URLhttps://arxiv.org/abs/2505.21523

work page arXiv 2025

[42] [42]

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, and Steven Y . Feng. A unified definition of hallucination: It’s the world model, stupid!, 2026. URLhttps://arxiv.org/abs/2512.21577

work page arXiv 2026

[43] [43]

AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, and Ran He. AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents. 1 2026

work page 2026

[44] [44]

The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

Bria Long, Robert Z Sparks, Violet Xiang, Stefan Stojanov, Zi Yin, Grace E Keene, Alvin WM Tan, Steven Y Feng, Chengxu Zhuang, Virginia A Marchman, et al. The babyview dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.arXiv preprint arXiv:2406.10447, 2024

work page arXiv 2024

[45] [45]

On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. 5 2020

work page 2020

[46] [46]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022

[47] [47]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants. 11 2023

work page 2023

[49] [49]

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. 10 2023

work page 2023

[50] [50]

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval- Augmented Language Models. 5 2024

work page 2024

[51] [51]

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. 7 2021

work page 2021

[52] [52]

Language models use lookbacks to track beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, and Atticus Geiger. Language models use lookbacks to track beliefs. InThe Fourteenth International Conference on Learning Representations, 2026. 13

work page 2026

[53] [53]

HALoGEN: Fantastic LLM hallucinations and where to find them

Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. HALoGEN: Fantastic LLM hallucinations and where to find them. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1402–1425, Vienna, A...

work page doi:10.18653/v1/2025.acl-long.71 2025

[54] [54]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucination in Image Captioning. 3 2019

work page 2019

[55] [55]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. 2 2020. doi: 10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020

[56] [56]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. 3 2021

work page 2021

[57] [57]

Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

Aryan Shrivastava and Ari Holtzman. Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

work page arXiv 2025

[58] [58]

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y . Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining, 2026. URL https://arxiv.org/abs/2604.00715

work page arXiv 2026

[59] [59]

Hamilton

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. 9 2019

work page 2019

[60] [60]

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM. 8 2024

work page 2024

[61] [61]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning Large Multimodal Models with Factually Augmented RLHF. 9 2023

work page 2023

[62] [62]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[63] [63]

ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. 6 2021

work page 2021

[64] [64]

Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing

Ronen Tamari, Kyle Richardson, Aviad Sar-Shalom, Noam Kahlon, Nelson Liu, Reut Tsarfaty, and Dafna Shahaf. Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmark- ing. 11 2021

work page 2021

[65] [65]

Gridworlds as testbeds for planning with incomplete information

Craig Tovey and Sven Koenig. Gridworlds as testbeds for planning with incomplete information. InAAAI/IAAI, pages 819–824, 2000

work page 2000

[66] [66]

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. 11 2023

work page 2023

[67] [67]

An Audit on the Perspectives and Challenges of Hallucinations in NLP

Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. An Audit on the Perspectives and Challenges of Hallucinations in NLP. 9 2024

work page 2024

[68] [68]

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. 4 2020

work page 2020

[69] [69]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation. 2 2024. 14

work page 2024

[70] [70]

Resolving Knowledge Conflicts in Large Language Models

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving Knowledge Conflicts in Large Language Models. 10 2024

work page 2024

[71] [71]

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, and Golnaz Ghiasi. HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning. 7 2024

work page 2024

[72] [72]

Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, et al. Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. InProceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Le...

work page 2023

[73] [73]

Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. 12 2015

work page 2015

[74] [74]

Hallucination diversity-aware active learning for text summarization

Yu Xia, Xu Liu, Tong Yu, Sungchul Kim, Ryan Rossi, Anup Rao, Tung Mai, and Shuai Li. Hallucination diversity-aware active learning for text summarization. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8665–8677, 2024

work page 2024

[75] [75]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. 5 2024

work page 2024

[76] [76]

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng, Steven Y . Feng, and Michael C. Frank. Bringing up a bilingual babylm: In- vestigating multilingual language acquisition using small-scale models, 2026. URL https: //arxiv.org/abs/2603.29552

work page internal anchor Pith review Pith/arXiv arXiv 2026

[77] [77]

Language-Guided World Models: A Model-Based Approach to AI Control

Alex Zhang, Khanh Nguyen, Jens Tuyls, Albert Lin, and Karthik Narasimhan. Language-Guided World Models: A Model-Based Approach to AI Control. 9 2024

work page 2024

[78] [78]

MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them

Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, and Dawn Song. MIRAGE- Bench: LLM Agent is Hallucinating and Where to Find Them. 7 2025

work page 2025

[79] [79]

Is there a green ball in your current field of view?

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. 4 2024. 15 A HALLUWORLDQualitative Examples A.1 HALLUWORLD-GRID Component Details World:P1 Dense ArrayGrid View (Canonical Locati...

work page 2024

[80] [80]

Navigation activelyhelps state tracking in these domains

Grounding-dominated regimes(Perception, Causal): Negative slopes (-1.0% to -1.8% per quintile) indicate that additional observations provide more benefit (disambiguation of spatial/causal relationships) than cost (working memory load). Navigation activelyhelps state tracking in these domains. 2.Balanced regimes(Memory, Uncertainty): Near-zero slopes (+0.2...

work page