Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

· 2025 · cs.CL · arXiv 2506.19089

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us to find evidence of heuristic behavior and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

representative citing papers

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.

citing papers explorer

Showing 1 of 1 citing paper.

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action cs.CL · 2026-06-30 · unverdicted · none · ref 9 · internal anchor
Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

fields

years

verdicts

representative citing papers

citing papers explorer