arxiv: 2604.14785 · v2 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

Chunyi Li, Guangtao Zhai, Jianbo Zhang, Shengyu Guo, Tongrui Ye, Zicheng Zhang

Pith reviewed 2026-05-10 10:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords MirrorBenchself-centric intelligenceMLLMsmirror self-recognitionembodied intelligenceAI benchmarksself-referential understanding

0 comments

The pith

MirrorBench shows that leading MLLMs perform far below humans on basic self-recognition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MirrorBench, a benchmark that applies the classic mirror self-recognition test from psychology to multimodal large language models through a set of simulation tasks. These tasks increase in difficulty, beginning with simple visual perception of a mirror and advancing to higher-level understanding of the self as an agent. Experiments find that current top models still fall substantially short of human performance even on the easiest tasks. A sympathetic reader would care because the results point to a possible missing piece in how these models build an internal sense of themselves, which could limit their ability to act intelligently in physical environments.

Core claim

MirrorBench extends the Mirror Self-Recognition test into a tiered simulation framework for embodied MLLMs, with tasks that range from basic visual perception to high-level self-representation. Experiments on leading MLLMs demonstrate performance substantially inferior to human levels even at the lowest tier, revealing fundamental limitations in self-referential understanding.

What carries the argument

MirrorBench, a simulation-based benchmark that uses progressively harder tasks modeled on the Mirror Self-Recognition test to assess self-centric intelligence.

If this is right

MLLMs exhibit fundamental limitations in self-referential understanding.
The benchmark supplies a systematic way to track progress toward general intelligence in large models.
Psychological tests of self-awareness can be directly adapted to evaluate embodied AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models that close the gap on MirrorBench may also handle real-world tasks that require tracking their own position and identity.
The same tiered approach could be applied to other psychological measures of self-modeling to create a broader evaluation suite.
Persistent shortfalls might indicate that current training methods need explicit components for building internal self-representations.

Load-bearing premise

The simulation tasks in MirrorBench accurately capture the psychological construct of self-centric intelligence as measured by the classical Mirror Self-Recognition test.

What would settle it

An MLLM that reaches human-level success rates across all tiers of MirrorBench tasks would show that the reported limitations in self-referential understanding have been resolved.

Figures

Figures reproduced from arXiv: 2604.14785 by Chunyi Li, Guangtao Zhai, Jianbo Zhang, Shengyu Guo, Tongrui Ye, Zicheng Zhang.

**Figure 2.** Figure 2: Overview of MirrorBench. The evaluation features a Self-referential Inference Loop, in which an MLLM-powered embodied agent interacts with the simulation environment. Environment is constructed from an asset pool comprising multiple body, hand and mark configurations. Prompts are generated according to the level number (bottom-left). Example scenes for Levels 0–3 (right) depict the rising cognitive complex… view at source ↗

**Figure 3.** Figure 3: Radar charts showing performance of 6 representative MLLMs across 4 levels. Each displays scores for multiple metrics under 3 settings: Overall [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Cognitive Stability vs. Task Performance of MLLMs. Each point represents a model, plotted by its overall Task Performance (x-axis) and Cognitive Stability (y-axis). Cognitive Stability is computed based on the monotonicity of performance changes across levels 0–3, where higher values indicate more consistent performance degradation as task difficulty increases (see supplemental material). The bubble size e… view at source ↗

**Figure 5.** Figure 5: Metric trajectories of 4 representative MLLMs across 4 levels. For every metric, Gemini-2.5-Flash (p [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Mechanism of MLLMs underperformance relative to random policy. When a MLLM falls in mirror–self confusion failure mode, it persistently targets the mirrored body rather than its physical counterpart. results in table II reveals a clear performance stratification. At the upper bound, the human agent consistently and substantially outperforms all MLLMs, highlighting a fundamental limitation in the current c… view at source ↗

**Figure 7.** Figure 7: Cognitive Stability Ranking for MLLMs and [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror-bench-page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MirrorBench, a simulation-based benchmark for MLLMs inspired by the classical Mirror Self-Recognition (MSR) test in psychology. It defines a tiered framework of tasks progressing from basic visual perception to high-level self-representation in embodied settings, and reports that leading MLLMs perform substantially below human levels even on the easiest tasks, indicating fundamental limitations in self-referential understanding.

Significance. If the benchmark tasks validly isolate self-centric intelligence rather than general multimodal reasoning, the work would offer a useful psychology-AI bridge for tracking progress toward embodied general intelligence. The tiered structure is a constructive design choice that could support incremental evaluation.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central empirical claim that MLLMs remain substantially inferior to humans even at the lowest tier lacks any reported details on task definitions, evaluation metrics, number of trials, statistical tests, or the procedure for collecting human baselines, leaving the performance-gap result without visible evidentiary support.
[§3] §3 (MirrorBench framework): the tiered simulation tasks are not accompanied by controls, ablations, or arguments showing that success requires an internal self-model rather than external object tracking, instruction following, or multimodal pattern matching; without such evidence the inference that low scores reveal missing self-referential understanding is not load-bearing.

minor comments (1)

[Abstract] The project page is referenced but the manuscript does not state whether benchmark code, prompts, or evaluation scripts are released, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and revised the paper to strengthen the evidentiary basis and validation of our claims. Below we provide point-by-point responses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central empirical claim that MLLMs remain substantially inferior to humans even at the lowest tier lacks any reported details on task definitions, evaluation metrics, number of trials, statistical tests, or the procedure for collecting human baselines, leaving the performance-gap result without visible evidentiary support.

Authors: We agree that the original manuscript did not provide sufficient explicit details to fully support the central claim. In the revised version, we have expanded §4 with complete task definitions for each tier (including exact simulation parameters and input formats), evaluation metrics (success rate defined as correct self-identification within a fixed number of steps), number of trials (100 independent trials per model per task), statistical tests (paired t-tests with reported p-values < 0.001 for MLLM vs. human comparisons), and the human baseline procedure (data collected from 25 adult participants using the identical simulation interface and instructions, with IRB approval noted). Updated tables and figures now present these details transparently, substantiating the reported performance gaps. revision: yes
Referee: [§3] §3 (MirrorBench framework): the tiered simulation tasks are not accompanied by controls, ablations, or arguments showing that success requires an internal self-model rather than external object tracking, instruction following, or multimodal pattern matching; without such evidence the inference that low scores reveal missing self-referential understanding is not load-bearing.

Authors: We acknowledge that additional controls and ablations are necessary to isolate the self-referential component. In the revised §3, we have added a new subsection on task validation that includes: (i) explicit arguments explaining why mirror-based tasks require an internal self-model (the agent must map the reflected image to its own embodied state, which cannot be solved by external object tracking or simple pattern matching alone); and (ii) ablation experiments comparing performance on self-centric mirror tasks versus matched non-self object-tracking controls. Results show that MLLMs achieve high accuracy on external tracking but remain near chance on self-referential variants, supporting that the low scores reflect limitations in self-referential understanding rather than deficits in general multimodal reasoning or instruction following. These ablations are also summarized in §4. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark with empirical evaluation only

full rationale

The paper introduces MirrorBench as a tiered simulation benchmark inspired by the classical MSR test and reports direct experimental results on MLLMs versus human baselines. No equations, fitted parameters, self-citations, or derivations are present in the provided text. The central claim rests on observed performance gaps rather than any reduction of outputs to internal definitions or prior author work by construction. This is a standard empirical benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no mathematical free parameters, domain axioms beyond standard AI evaluation practices, or new postulated entities; it is a benchmark definition plus empirical comparison.

pith-pipeline@v0.9.0 · 5489 in / 1057 out tokens · 35029 ms · 2026-05-10T10:49:40.279573+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Alfred: A benchmark for interpret- ing grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpret- ing grounded instructions for everyday tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749

2020
[5]

Karen Liu, Jiajun Wu, and Li Fei-Fei

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín, C. Wang, G. Levine, W. Ai, B. Martinezet al., “Behavior-1k: A human- centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,”arXiv preprint arXiv:2403.09227, 2024

work page arXiv 2024
[6]

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025

Z. Cheng, Y . Tu, R. Li, S. Dai, J. Hu, S. Hu, J. Liet al., “Embod- iedeval: Evaluate multimodal llms as embodied agents,”arXiv preprint arXiv:2501.11858, 2025

work page arXiv 2025
[7]

arXiv preprint arXiv:2506.01031 , year=

Y . Qiao, H. Hong, W. Lyu, D. An, S. Zhang, Y . Xie, X. Wang, and Q. Wu, “Navbench: Probing multimodal large language models for embodied navigation,”arXiv preprint arXiv:2506.01031, 2025

work page arXiv 2025
[8]

Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jianget al., “Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 11 142–11 152

2025
[9]

A theory of objective self awareness

S. Duval and R. A. Wicklund, “A theory of objective self awareness.” 1972

1972
[10]

Chimpanzees: self-recognition,

G. G. Gallup Jr, “Chimpanzees: self-recognition,”Science, vol. 167, no. 3914, pp. 86–87, 1970

1970
[11]

Mirror self-image reactions before age two,

B. Amsterdam, “Mirror self-image reactions before age two,”Devel- opmental Psychobiology: The journal of the international society for developmental psychobiology, vol. 5, no. 4, pp. 297–305, 1972

1972
[12]

Aibench: Towards trustworthy evaluation under the 45° law,

Z. Zhang, J. Wang, Y . Guoet al., “Aibench: Towards trustworthy evaluation under the 45° law,”Displays, p. 103255, 2025

2025
[13]

Large multimodal models evaluation: A survey,

Z. Zhang, J. Wang, F. Wen, Y . Guoet al., “Large multimodal models evaluation: A survey,”SCIENCE CHINA Information Sciences, vol. 68, no. 12, pp. 221 301–221 369, 2025

2025
[14]

Ok-vqa: A visual question answering benchmark requiring external knowledge,

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204

2019
[15]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,”arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review arXiv 2023
[16]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567

2024
[17]

Nocaps: Novel object captioning at scale,

H. Agrawal, K. Desai, Y . Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel object captioning at scale,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8948–8957

2019
[18]

Mme: A comprehensive evaluation benchmark for multimodal large language models,

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[19]

Gqa: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709

2019
[20]

arXiv preprint arXiv:2505.16770 (2025) 4, 5

M.-H. Guo, X. Chu, Q. Yang, Z.-H. Mo, Y . Shen, P.-l. Li, X. Lin, J. Zhang, X.-S. Chen, Y . Zhanget al., “Rbench-v: A primary assessment for visual reasoning models with multi-modal outputs,”arXiv preprint arXiv:2505.16770, 2025

work page arXiv 2025
[21]

Embodied question answering,

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1–10

2018
[22]

arXiv preprint arXiv:2502.09560 , year=

R. Yang, H. Chen, J. Zhang, M. Zhao, C. Qian, K. Wang, Q. Wang, T. V . Koripella, M. Movahedi, M. Liet al., “Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents,”arXiv preprint arXiv:2502.09560, 2025

work page arXiv 2025
[23]

Towards a logic-based analysis and simulation of the mirror test,

N. S. Govindarajulu, “Towards a logic-based analysis and simulation of the mirror test,”Proceedings of the European Agent Systems Summer School Student Session, 2011

2011
[24]

Robot passes the mirror test by inner speech,

A. Pipitone and A. Chella, “Robot passes the mirror test by inner speech,”Robotics and Autonomous Systems, vol. 144, p. 103838, 2021

2021
[25]

The robot in the mirror,

L. Steels and M. Spranger, “The robot in the mirror,”Connection Science, vol. 20, no. 4, pp. 337–358, 2008

2008
[26]

Robot in the mirror: Toward an embodied computational model of mirror self- recognition,

M. Hoffmann, S. Wang, V . Outrata, E. Alzueta, and P. Lanillos, “Robot in the mirror: Toward an embodied computational model of mirror self- recognition,”KI-Künstliche Intelligenz, vol. 35, no. 1, pp. 37–51, 2021

2021
[27]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Gpt-4o system card,

OpenAI, “Gpt-4o system card,” https://openai.com/index/ gpt-4o-system-card/, 2024, official system card for the GPT-4o multimodal model

2024
[29]

Introducing claude sonnet 4.5,

Anthropic, “Introducing claude sonnet 4.5,” https://www.anthropic.com/ news/claude-sonnet-4-5/, 2025, official announcement for Claude Son- net 4.5 model

2025
[30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review arXiv 2023
[31]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Danget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liuet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review arXiv 2025
[33]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” https: //llava-vl.github.io/blog/2024-01-30-llava-next/, January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/

2024
[34]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang et al., “Glm-4.5v and glm-4.1v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning,”arXiv preprint arXiv:2507.01006, 2025. APPENDIXA ADDITIONALANALYSIS A. Cognitive Stability Here we demonstrate the computation of Cognitive Stabil- ity score used in fig. 4, which...

work page internal anchor Pith review arXiv 2025