MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Fanfei Li; Felix Wichmann; Jana Zeller; Matthias Bethge; Prasanna Mayilvahanan; Ryan Cotterell; Thadd\"aus Wiedemer; Thomas Klein; Wieland Brendel

arxiv: 2602.02465 · v2 · pith:I7K5D3PYnew · submitted 2026-02-02 · 💻 cs.AI · cs.CV· cs.LG

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Jana Zeller , Thadd\"aus Wiedemer , Fanfei Li , Thomas Klein , Prasanna Mayilvahanan , Matthias Bethge , Felix Wichmann , Ryan Cotterell

show 1 more author

Wieland Brendel

This is my paper

classification 💻 cs.AI cs.CVcs.LG

keywords reasoningvisualmodelsimagerymentisoculitheyfailfrontier

0 comments

read the original abstract

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.