ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

arxiv: 2605.18746 · v1 · pith:EBSGR3OZnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.RO

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Yining Hong , Jiageng Liu , Han Yin , Manling Li , Leonidas Guibas , Li Fei-Fei , Jiajun Wu , Yejin Choi This is my paper

Pith reviewed 2026-05-20 10:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.RO

keywords embodied spatial intelligenceperception-action loopactive explorationaction blindnessmultimodal large language modelsbenchmarkOmniGibsoncore knowledge systems

0 comments p. Extension

pith:EBSGR3OZ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{EBSGR3OZ}

Prints a linked pith:EBSGR3OZ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Spatial intelligence in agents improves by actively choosing actions to gather evidence and close the perception-action loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that spatial intelligence unfolds as a perception-action loop in which agents must select and sequence actions to uncover hidden structures and relations that passive sensing cannot resolve. ESI-Bench tests this across 10 categories and 29 subcategories on OmniGibson by requiring models to deploy perception, locomotion, and manipulation in ways that accumulate task-relevant evidence. Experiments show active exploration outperforms passive or random multi-view approaches, with models spontaneously developing spatial strategies. Failures occur mainly from action blindness, where suboptimal action choices produce poor observations that trigger cascading errors. Human comparisons expose a metacognitive gap: models commit to answers with high confidence without seeking contradictory evidence, unlike humans who revise beliefs under falsification.

Core claim

The central claim is that recasting the observer as an actor in the perception-action loop reveals action blindness as the dominant failure mode in spatial tasks. In ESI-Bench, agents decide which abilities to use and in what order to resolve ambiguities involving occlusion, dynamics, containment, and functionality. Active exploration yields better performance than passive baselines, while random multi-view inputs add noise; explicit 3D grounding offers partial stabilization on depth tasks yet can distort relations when imperfect. Most errors trace to poor action selections rather than perception limits, and models do not exhibit human-like behavior of seeking falsifying viewpoints to revise

What carries the argument

The perception-action loop, in which agents actively sequence perception, locomotion, and manipulation to accumulate evidence and update spatial reasoning.

If this is right

Active exploration substantially outperforms passive counterparts and random multi-view inputs.
Most failures stem from action blindness that produces poor observations and cascading errors rather than from weak perception.
Explicit 3D grounding stabilizes reasoning on depth-sensitive tasks but imperfect representations can harm spatial relations more than 2D baselines.
Models commit prematurely with high confidence regardless of evidence quality, unlike humans who seek falsifying viewpoints.
Agents spontaneously discover emergent spatial strategies without explicit instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models may need built-in mechanisms to monitor uncertainty and actively seek disconfirming evidence.
Benchmarks in navigation or manipulation could gain from similar requirements to close the action loop.
Robotic systems would likely benefit from training that rewards information-gathering actions over pure perception accuracy.
The metacognitive gap may require new architectures rather than scaling existing perception or interaction alone.

Load-bearing premise

The 10 task categories and 29 subcategories sufficiently isolate the perception-action loop without confounding effects from simulator physics or task design choices.

What would settle it

A test that supplies models with oracle action sequences matching human strategies and measures whether performance gaps and premature high-confidence commitments disappear.

Figures

Figures reproduced from arXiv: 2605.18746 by Han Yin, Jiageng Liu, Jiajun Wu, Leonidas Guibas, Li Fei-Fei, Manling Li, Yejin Choi, Yining Hong.

**Figure 1.** Figure 1: ESI-BENCH is a comprehensive benchmark for embodied spatial intelligence, spanning 10 task categories and 29 subcategories organized around Spelke’s four core knowledge systems [Spelke and Kinzler, 2007]: object representation, layout and geometry, number representation, and agents and goal-directed actions. Abstract Spatial intelligence unfolds through a perception–action loop: agents act to acquire obser… view at source ↗

**Figure 2.** Figure 2: Overview of ESI-BENCH: dataset example, agent action space, and task distribution. determines the initial positions of both the objects and the agent within the scene, and generates a ground-truth action trajectory providing the optimal sequence of actions needed to resolve the task. The selected objects and their spatial configuration implicitly define the task, with the ground-truth answer y ∗ derived di… view at source ↗

**Figure 3.** Figure 3: ESI-Bench task categories (L). Combination and level of embodied action types (R). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Study, showing success & failure modes and reasons behind model behavior. to identify the correct real-world correspondence altogether (Figure 4c). These cases indicate hard perceptual limits that no action strategy can overcome. The active-to-oracle gap further shows that action and perception failures cascade and compound: on Counting w Occlusion, the GPT-5 gap reaches 43.4 points, and on Str… view at source ↗

**Figure 5.** Figure 5: Average number of active exploration steps to reach a correct answer for GPT-5 (solid) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Subcategory distribution within each of the 10 ESI-B [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Additional benchmark examples from ESI-BENCH, organized by core knowledge systems: object representation, layout and geometry, number representation, and agents and goal-directed actions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative examples illustrating emergent agent behaviors and failure modes: [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Step budget ablation for Gemini 3.1 Active. Performance rises quickly up to 15–20 steps, [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESI-Bench adds a new active-exploration angle to spatial benchmarks in OmniGibson, with results that flag action selection as the bigger limiter than perception for current models.

read the letter

The main thing here is that the paper builds ESI-Bench to test spatial intelligence as a perception-action loop rather than passive viewing, and their runs on MLLMs show active policies beating passive and random multi-view baselines while most errors trace to poor action choices that produce bad observations downstream. Human comparisons add that models lock in answers too soon without hunting for contradictory views the way people do. What is actually new is the benchmark's 10 categories and 29 subcategories, grounded in Spelke's core knowledge systems, that require agents to decide when to perceive, move, or manipulate to accumulate evidence on occlusion, containment, dynamics, and the like. The setup in OmniGibson forces sequencing that prior passive spatial tasks skipped. The paper does well at documenting the directional gains from active exploration and at noting that imperfect 3D grounding can distort relations more than plain 2D on some tasks. The empirical patterns are straightforward and the metacognitive gap with humans is a useful observation. The soft spot is whether the tasks cleanly separate action blindness from simulator-specific factors. If passive views suffer systematically from object instability, occlusion patterns, or locomotion costs that active policies happen to sidestep, the attribution to action choice alone needs more support. The stress-test note on missing ablations that hold perception fixed while varying only action sequencing is fair; without those or oracle-navigation checks for the passive case, the central claim stays preliminary even if the overall direction holds. This paper is for people working on embodied agents, robotics, and multimodal models that must operate under uncertainty. Readers who care about moving benchmarks past static scenes would find the task design and human-model contrasts useful. It has enough new ground and clear empirical results to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper introduces ESI-Bench, a benchmark for embodied spatial intelligence with 10 task categories and 29 subcategories grounded in Spelke's core knowledge systems and implemented in OmniGibson. It evaluates MLLMs on active vs. passive observation, reporting that active exploration substantially outperforms passive baselines, that most failures arise from 'action blindness' (poor action choices leading to poor observations and cascading errors) rather than weak perception, that explicit 3D grounding can harm performance on some tasks, and that models commit prematurely with high confidence unlike humans who seek falsifying viewpoints.

Significance. If the central empirical claims hold after addressing controls, the work would be significant for highlighting the perception-action loop as a key bottleneck in current multimodal models and for providing a cognitively grounded benchmark that distinguishes active strategies from passive or random multi-view approaches. The direct human-model comparisons and identification of a metacognitive gap offer concrete directions for embodied AI development.

major comments (2)

[Experiments / Results] The attribution of most failures to action blindness rather than perception (abstract and results sections) is load-bearing for the central claim but rests on the assumption that passive observation quality is not systematically degraded by OmniGibson dynamics. No ablations are reported that hold perception fixed while varying only action sequencing or that quantify passive performance gains under oracle navigation or stabilized physics.
[Benchmark Construction] The claim that the 10 categories / 29 subcategories cleanly isolate the perception-action loop (task design section) requires evidence that simulator-specific factors (object stability, occlusion patterns, locomotion costs) do not confound passive baselines in ways that active policies incidentally mitigate. Without such checks, the performance gap cannot be confidently attributed to action choice.

minor comments (2)

[Experiments] Clarify the exact statistical tests and sample sizes used for active vs. passive and model vs. human comparisons; the abstract reports directional results but details on controls and significance levels are needed for reproducibility.
[Results] The statement that 'random multi-view often adds noise rather than signal' should be supported with quantitative metrics (e.g., accuracy deltas or error breakdowns) rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major concerns regarding the attribution of failures to action blindness and the potential confounding factors in the benchmark construction. We have revised the manuscript to include additional ablations and analyses to strengthen these claims.

read point-by-point responses

Referee: [Experiments / Results] The attribution of most failures to action blindness rather than perception (abstract and results sections) is load-bearing for the central claim but rests on the assumption that passive observation quality is not systematically degraded by OmniGibson dynamics. No ablations are reported that hold perception fixed while varying only action sequencing or that quantify passive performance gains under oracle navigation or stabilized physics.

Authors: We agree that these additional controls would bolster the central claim. In the revised manuscript, we include new experiments that hold the perception component fixed and vary only the action sequencing strategy. We also report passive baseline performance under oracle navigation (perfect path to target viewpoints) and with stabilized physics simulation. These results confirm that the performance gap persists, with active exploration still outperforming even oracle-assisted passive observation by a substantial margin. This supports that the failures are indeed primarily due to action blindness rather than degraded passive observations from simulator dynamics. revision: yes
Referee: [Benchmark Construction] The claim that the 10 categories / 29 subcategories cleanly isolate the perception-action loop (task design section) requires evidence that simulator-specific factors (object stability, occlusion patterns, locomotion costs) do not confound passive baselines in ways that active policies incidentally mitigate. Without such checks, the performance gap cannot be confidently attributed to action choice.

Authors: We take this concern seriously. To address it, we have added a new section in the revised paper with quantitative analyses of simulator-specific factors. Specifically, we measure object stability, occlusion statistics, and locomotion costs across active and passive trials and show they are balanced. Furthermore, we demonstrate that the active-passive gap remains significant even after normalizing for these factors. We argue that the task categories, grounded in Spelke's core knowledge systems, are designed to require active evidence accumulation, and the controls confirm that the gap is attributable to action choice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements and no derivations

full rationale

This is a benchmark introduction and empirical evaluation paper. It defines ESI-Bench tasks explicitly from Spelke's core knowledge systems and OmniGibson simulator, then reports measured performance differences between active exploration policies and passive baselines on MLLMs. No equations, parameter fitting, or first-principles derivations appear; all central claims (active > passive, action blindness as dominant failure mode) rest on direct experimental comparisons that are externally replicable. No self-citation chains, ansatzes, or renamings reduce the results to inputs by construction. The design is self-contained against the stated simulator and task categories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the existing OmniGibson simulator and Spelke's core knowledge systems as background; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Spelke's core knowledge systems provide a valid grounding for defining spatial intelligence tasks
The abstract states the benchmark is grounded in Spelke's core knowledge systems without further justification or alternative framings.

pith-pipeline@v0.9.0 · 5843 in / 1407 out tokens · 33455 ms · 2026-05-20T10:47:42.423500+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

active exploration substantially outperforms passive counterparts... Most failures stem not from weak perception but from action blindness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv
[2]

3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun

URL https://arxiv.org/abs/2406.01584. 3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multimodal llms as embodied agents,

work page arXiv
[3]

3, 4 Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang

URLhttps://arxiv.org/abs/2501.11858. 3, 4 Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Bench- marking and enhancing vision-language models for physical world understanding,

work page arXiv
[4]

3 Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei

URL https://arxiv.org/abs/2501.16411. 3 Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models,

work page arXiv
[5]

URL https://arxiv.org/abs/2406.05756. 3, 4 Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

work page arXiv
[6]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

URL https://arxiv.org/abs/ 2505.20279. 3 Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

BLINK: Multimodal Large Language Models Can See but Not Perceive

URLhttps://arxiv.org/abs/2404.12390. 3 James J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

3, 4 Fangyu Liu, Guy Emerson, and Nigel Collier

URLhttps://arxiv.org/abs/2503.11117. 3, 4 Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning,

work page arXiv
[9]

org/abs/2205.00363

URL https://arxiv. org/abs/2205.00363. 2, 3 Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Sapave: Towards active perception and manipulation in vision-language-action models for robotics, 2026a. URL https://arxiv.org/abs/2603. 12193. 4 Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhang...

work page arXiv
[10]

3 10 Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu

URL https://arxiv.org/abs/2501.10074. 3 10 Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu. Activevla: Injecting active perception into vision-language-action models for precise 3d robotic manipulation, 2026b. URLhttps://arxiv.org/abs/2601.08325. 4 Wenxin Ma, Chenlong Wang, Ruisheng Yuan, Hao Chen, Nanru Dai, S. Kevin Zhou, Yijun Yang...

work page arXiv
[11]

3 Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, and Alan Yuille

URLhttps://arxiv.org/abs/2601.13304. 3 Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark,

work page arXiv
[12]

URL https: //arxiv.org/abs/2412.07825. 3 Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, ...

work page arXiv
[13]

3 Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan

URL https://arxiv.org/abs/2506.21458. 3 Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence, 2025a. URLhttps://arxiv.org/abs/2505.23747. 3 Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards comprehensive evaluation for spatial int...

work page arXiv
[14]

org/abs/2506.15666

URL https://arxiv. org/abs/2506.15666. 4 Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025a. URL https://arxiv.org/abs/2412.14171. 2, 3 Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja ...

work page arXiv
[15]

4 Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, and Zilong Zheng

URLhttps://arxiv.org/abs/2511.20351. 4 Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, and Zilong Zheng. Espire: A diagnostic benchmark for embodied spatial reasoning of vision-language models,

work page arXiv
[16]

3, 4 12 Contents 1 Introduction 2 2 Related Works 3 3 ESI-BENCH4 3.1 Benchmark Setup

URL https:// arxiv.org/abs/2603.13033. 3, 4 12 Contents 1 Introduction 2 2 Related Works 3 3 ESI-BENCH4 3.1 Benchmark Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Task Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Task Categories and Statistics . . . . . . . . . . . . . . . . ...

work page arXiv

[1] [1]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv

[2] [2]

3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun

URL https://arxiv.org/abs/2406.01584. 3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multimodal llms as embodied agents,

work page arXiv

[3] [3]

3, 4 Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang

URLhttps://arxiv.org/abs/2501.11858. 3, 4 Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Bench- marking and enhancing vision-language models for physical world understanding,

work page arXiv

[4] [4]

3 Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei

URL https://arxiv.org/abs/2501.16411. 3 Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models,

work page arXiv

[5] [5]

URL https://arxiv.org/abs/2406.05756. 3, 4 Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

work page arXiv

[6] [6]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

URL https://arxiv.org/abs/ 2505.20279. 3 Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

BLINK: Multimodal Large Language Models Can See but Not Perceive

URLhttps://arxiv.org/abs/2404.12390. 3 James J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

3, 4 Fangyu Liu, Guy Emerson, and Nigel Collier

URLhttps://arxiv.org/abs/2503.11117. 3, 4 Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning,

work page arXiv

[9] [9]

org/abs/2205.00363

URL https://arxiv. org/abs/2205.00363. 2, 3 Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Sapave: Towards active perception and manipulation in vision-language-action models for robotics, 2026a. URL https://arxiv.org/abs/2603. 12193. 4 Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhang...

work page arXiv

[10] [10]

3 10 Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu

URL https://arxiv.org/abs/2501.10074. 3 10 Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu. Activevla: Injecting active perception into vision-language-action models for precise 3d robotic manipulation, 2026b. URLhttps://arxiv.org/abs/2601.08325. 4 Wenxin Ma, Chenlong Wang, Ruisheng Yuan, Hao Chen, Nanru Dai, S. Kevin Zhou, Yijun Yang...

work page arXiv

[11] [11]

3 Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, and Alan Yuille

URLhttps://arxiv.org/abs/2601.13304. 3 Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark,

work page arXiv

[12] [12]

URL https: //arxiv.org/abs/2412.07825. 3 Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, ...

work page arXiv

[13] [13]

3 Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan

URL https://arxiv.org/abs/2506.21458. 3 Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence, 2025a. URLhttps://arxiv.org/abs/2505.23747. 3 Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards comprehensive evaluation for spatial int...

work page arXiv

[14] [14]

org/abs/2506.15666

URL https://arxiv. org/abs/2506.15666. 4 Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025a. URL https://arxiv.org/abs/2412.14171. 2, 3 Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja ...

work page arXiv

[15] [15]

4 Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, and Zilong Zheng

URLhttps://arxiv.org/abs/2511.20351. 4 Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, and Zilong Zheng. Espire: A diagnostic benchmark for embodied spatial reasoning of vision-language models,

work page arXiv

[16] [16]

3, 4 12 Contents 1 Introduction 2 2 Related Works 3 3 ESI-BENCH4 3.1 Benchmark Setup

URL https:// arxiv.org/abs/2603.13033. 3, 4 12 Contents 1 Introduction 2 2 Related Works 3 3 ESI-BENCH4 3.1 Benchmark Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Task Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Task Categories and Statistics . . . . . . . . . . . . . . . . ...

work page arXiv