pith. machine review for the scientific record. sign in

arxiv: 2605.02130 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

Aishwarya Agrawal, Bo-Hsiang Tseng, Cindy Pan, Dhivya Piraviperumal, Hong Yu, Jihan Yang, Jimit Majmudar, Le Zhang, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Soundarya Krishnan, Xiou Ge, Yinan Ling

Pith reviewed 2026-05-09 16:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial reasoningfunctional reasoningmultimodal LLMsvideo benchmarkegocentric videoobject affordancesgrounded intelligence
0
0 comments X

The pith

Multimodal LLMs struggle to connect where objects are located with what they are used for in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new benchmark to measure whether multimodal large language models can advance from basic geometric perception to understanding object purposes and utilities. It divides the evaluation into structured spatial reasoning about room layouts and functional reasoning about affordances and context-specific uses, using questions from real egocentric indoor video scans. Experiments demonstrate that existing models cannot reliably combine spatial memory across frames with functional inference or external knowledge. This integration gap matters for creating AI agents that can act effectively in physical environments rather than merely describing what they see.

Core claim

SFI-Bench is a video-based benchmark containing over 1,500 expert-annotated questions from diverse egocentric indoor scans that evaluates two dimensions of advanced reasoning: structured spatial reasoning for understanding complex layouts and forming coherent spatial representations, and functional reasoning for inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting. Experiments show that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, which the authors identify as a key

What carries the argument

SFI-Bench, a video benchmark with expert-annotated questions that tests integration of spatial layout understanding and functional affordance inference in multimodal LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The performance gap may arise because current training data rarely requires models to link remembered positions with later functional uses.
  • Extending the benchmark to test whether high scores predict success in physical robotic tasks would clarify its relevance for embodied agents.
  • This points to a need for memory mechanisms in multimodal models that retain spatial details long enough to support purpose-based reasoning.
  • The findings connect to broader challenges in creating AI that can plan actions based on object utilities rather than surface descriptions.

Load-bearing premise

The expert-annotated questions and tasks in the benchmark accurately isolate and measure spatial-functional integration without being confounded by video processing limits, language biases, or annotation subjectivity.

What would settle it

A multimodal model that achieves high accuracy across the functional reasoning and knowledge-grounded troubleshooting tasks by correctly recalling and applying spatial positions from the provided video would indicate the reported integration bottleneck has been resolved.

Figures

Figures reproduced from arXiv: 2605.02130 by Aishwarya Agrawal, Bo-Hsiang Tseng, Cindy Pan, Dhivya Piraviperumal, Hong Yu, Jihan Yang, Jimit Majmudar, Le Zhang, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Soundarya Krishnan, Xiou Ge, Yinan Ling.

Figure 1
Figure 1. Figure 1: From Spatial Cognition to Intelligent Agents. Left: Task pipeline. A video is provided as input, and a multimodal model must reason across frames over both spatial and temporal context to answer a question. Right: Two complementary reasoning abilities evaluated in our benchmark. Top (Where Things Are): spatial reasoning that requires understanding the scene layout and geometric relationships among objects … view at source ↗
Figure 2
Figure 2. Figure 2: Task examples in SFI-Bench. Required grounding cues and reasoning evidence are highlighted using bounding boxes and accompanying text. For functional reasoning tasks, potentially confounding options are underlined. Answers are simplified for readability. For the final two functional reasoning tasks, successful completion requires online search to access relevant operational manuals. the model must composit… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark curation pipeline. Metadata is extracted and consolidated across multiple MLLM passes, and combined with task-specific templates and few-shot examples to generate candidate questions. Annotators verify all questions and provide answers for the first four tasks, while answers for the two knowledge-grounded tasks are derived from expert-retrieved manuals. Finally, all questions undergo multi-turn h… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset Statistics. Task and video length distribution. Task names are abbreviated for brevity. 2.2. Benchmark Curation Process SFI-Bench is constructed through a three-stage pipeline de￾signed to produce high-quality, temporally grounded ques￾tions (details in Appendix A; pipeline shown in view at source ↗
Figure 5
Figure 5. Figure 5: Human-annotated analysis of MLLM reasoning chain. The central panel visualizes a conceptual cognitive map reconstructed from the video. Arrows denote navigation trajectories required to answer the question: red arrows indicate the ground-truth path in the environment, while blue arrows represent the model’s inferred reasoning path. Experts annotate errors by projecting the model’s reasoning steps onto the … view at source ↗
Figure 6
Figure 6. Figure 6: Left: Human analysis categorizing task-specific failures. Right: Relationship between reasoning compactness and task accuracy. Colors denote task types, while point size encodes Qwen3-VL model scale (8B, 32B, 235B). Regression lines indicate that larger models tend to produce shorter reasoning chains, which consistently correlate with higher accuracy. 0 1000 2000 3000 4000 Avg. Reasoning Length 2644 3215 1… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Average reasoning lengths across Qwen3-VL model scales (8B, 32B, 235B). Wrong denotes cases where the reasoning model failed while the instruction model succeeded. Right: Task-level reasoning length comparison across models. The Y-axis shows the average token length for Wrong cases, and the X-axis for all cases. Points above the diagonal and that lie farther away indicate tasks where models tend to p… view at source ↗
Figure 8
Figure 8. Figure 8: (Left) Performance of Gemini-2.5 Pro and GPT-5 under different reasoning budgets. (Right) Illustration of how increased reasoning budgets alter reasoning patterns: higher budgets reveal frequent behaviors such as self-rechecking and complex reasoning chains. ing objects, where the model overlooks visible enti￾ties; Object misclassification, where objects are assigned the wrong labels; Attribute mislabeling… view at source ↗
Figure 9
Figure 9. Figure 9: Simplified example of the structured metadata generated for each video. view at source ↗
Figure 10
Figure 10. Figure 10: System prompt used to drive the MLLM for meta information generation. view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for generating Global Conditional Counting tasks. view at source ↗
Figure 12
Figure 12. Figure 12: System prompt for generating Cross-View Multi-hop Path Reasoning tasks. view at source ↗
Figure 13
Figure 13. Figure 13: System prompt for generating Layout Inference tasks. view at source ↗
Figure 14
Figure 14. Figure 14: System prompt for generating Functional Association tasks. view at source ↗
Figure 15
Figure 15. Figure 15: Annotation platform for human verification of questions and answers. view at source ↗
read the original abstract

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SFI-Bench, a video-based benchmark with over 1,500 expert-annotated questions derived from egocentric indoor video scans. It evaluates multimodal LLMs on two dimensions of advanced reasoning—Structured Spatial Reasoning (complex layouts and spatial representations) and Functional Reasoning (object affordances and context-dependent utility)—via tasks including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting. The central claim is that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, identifying a bottleneck for grounded intelligence beyond existing geometric perception benchmarks.

Significance. If the benchmark tasks validly isolate higher-order spatial-functional integration without low-level confounds, the work would be significant for the field by providing a diagnostic tool that shifts evaluation from low-level perception to cognitively grounded abilities in MLLMs. The independent, expert-annotated construction from real video scans is a clear strength, offering falsifiable predictions about model capabilities.

major comments (2)
  1. [Abstract] Abstract: The claim that 'our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge' is presented without any details on model selection, quantitative metrics, error bars, statistical tests, or baseline comparisons. This leaves the central empirical claim without visible support from the reported data or methods.
  2. [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: The tasks (e.g., multi-hop relational reasoning, knowledge-grounded troubleshooting) assume that performance gaps reflect deficits in spatial-functional integration. However, no controls or ablations are described to separate these from documented MLLM weaknesses in long-context video encoding, object tracking across frames, or frame sampling (e.g., no static keyframe baselines, oracle spatial graphs, or perception-only controls). This directly undermines attribution of the observed struggles to the intended higher-order bottleneck.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement of the number of models evaluated and the primary metrics used to quantify 'struggle.'
  2. Consider including a summary table of benchmark statistics (questions per task category, video sources, annotation process) for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our central claims and strengthen the attribution of results. We have revised the manuscript accordingly, updating the abstract for better support of the empirical findings and adding controls and discussion in the evaluation sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge' is presented without any details on model selection, quantitative metrics, error bars, statistical tests, or baseline comparisons. This leaves the central empirical claim without visible support from the reported data or methods.

    Authors: We agree that the abstract would benefit from greater specificity to make the central claim more immediately supported. In the revised manuscript, we have updated the abstract to briefly reference the models evaluated (a range of state-of-the-art open-source and proprietary MLLMs), summarize key quantitative performance metrics from the experiments, and direct readers to the detailed tables, baselines, and analyses in Section 4. Full error bars, statistical tests, and comparisons remain in the main body, as they exceed abstract length limits, but the revision ensures the claim is visibly grounded in the reported data. revision: yes

  2. Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: The tasks (e.g., multi-hop relational reasoning, knowledge-grounded troubleshooting) assume that performance gaps reflect deficits in spatial-functional integration. However, no controls or ablations are described to separate these from documented MLLM weaknesses in long-context video encoding, object tracking across frames, or frame sampling (e.g., no static keyframe baselines, oracle spatial graphs, or perception-only controls). This directly undermines attribution of the observed struggles to the intended higher-order bottleneck.

    Authors: We acknowledge that the original manuscript did not explicitly describe ablations for low-level video factors. We have added a dedicated paragraph in the revised evaluation section (Section 4) that includes new static keyframe baseline comparisons and perception-only control tasks using simplified recognition questions. These show that low-level issues contribute modestly but that the primary deficits align with spatial-functional integration demands. We have also clarified in Section 3 how the expert-annotated, high-level question design reduces reliance on precise frame-to-frame tracking. Full oracle spatial graphs are noted as future work due to the substantial additional annotation required, but the current results and task structure support the intended attribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark evaluation

full rationale

The paper introduces SFI-Bench as an independent, expert-annotated video benchmark with >1500 questions targeting spatial and functional reasoning tasks. It reports direct empirical performance of existing MLLMs on these tasks without any mathematical derivations, equations, fitted parameters, or self-referential definitions that reduce claims to inputs by construction. No load-bearing steps invoke self-citations for uniqueness theorems, ansatzes, or renamed known results. The central claim (model struggles on spatial-functional integration) is an observation from benchmark runs, not a tautological prediction. This matches the default expectation for non-circular empirical benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that expert annotations validly capture functional affordances and that model failures reflect the claimed integration bottleneck rather than unrelated processing issues.

axioms (1)
  • domain assumption Expert annotations on video scans accurately reflect human-like functional reasoning and spatial relations without significant subjectivity or bias.
    Invoked implicitly when claiming the benchmark probes higher-order cognitive abilities.

pith-pipeline@v0.9.0 · 5582 in / 1228 out tokens · 45453 ms · 2026-05-09T16:56:41.300809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

107 extracted references · 42 canonical work pages · 20 internal anchors

  1. [1]

    RoboArena: Distributed real-world evaluation of generalist robot policies

    Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Ed- ward Hu, Fabio Ramos, et al. Roboarena: Distributed real- world evaluation of generalist robot policies.arXiv preprint arXiv:2506.18123, 2025. 2

  2. [2]

    Vismin: Visual minimal-change understanding

    Rabiul Awal, Saba Ahmadi, Le Zhang, and Aishwarya Agrawal. Vismin: Visual minimal-change understanding. Advances in Neural Information Processing Systems, 37: 107795–107829, 2024. 8

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 8

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 8

  5. [5]

    ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS, 2021. 2

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0 : A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 2

  7. [7]

    train on the test set

    Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 8

  8. [8]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 8

  9. [9]

    Spatialbot: Pre- cise spatial understanding with vision language models

    Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In ICRA, 2025. 8

  10. [10]

    Cheang, S

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025. 2

  11. [11]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 2

  12. [12]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, 2024. 8

  13. [13]

    Roborouter: Training-free pol- icy routing for robotic manipulation.arXiv preprint arXiv:2603.07892, 2026

    Yiteng Chen, Zhe Cao, Hongjia Ren, Chenjie Yang, Wenbo Li, Shiyi Wang, Yemin Wang, Li Zhang, Yanming Shao, Zhenjun Zhao, et al. Roborouter: Training-free pol- icy routing for robotic manipulation.arXiv preprint arXiv:2603.07892, 2026. 8

  14. [14]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 8

  15. [15]

    Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 8

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 6

  17. [17]

    HS Fang, C Liu, et al

    Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, et al. Emma: Generalizing real-world robot manipulation via generative visual transfer. arXiv preprint arXiv:2509.22407, 2025. 8

  18. [18]

    Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models. InACL, 2024. 8

  19. [19]

    The cognitive map in humans: spatial navi- gation and beyond.Nature neuroscience, 20(11):1504–1513,

    Russell A Epstein, Eva Zita Patai, Joshua B Julian, and Hugo J Spiers. The cognitive map in humans: spatial navi- gation and beyond.Nature neuroscience, 20(11):1504–1513,

  20. [20]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 8

  21. [21]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 8

  22. [22]

    Basic books, 2011

    Howard Gardner.Frames of mind: The theory of multiple intelligences. Basic books, 2011. 8

  23. [23]

    Psychology press, 2014

    James J Gibson.The ecological approach to visual percep- tion: classic edition. Psychology press, 2014. 2

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

  25. [25]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 2

  26. [26]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- 9 hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 2

  27. [27]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 4

  28. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 8

  29. [30]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2, 8

  30. [31]

    Llava-next: What else influences visual instruction tun- ing beyond data?, 2024

    Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tun- ing beyond data?, 2024. 4

  31. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 8

  32. [33]

    Topviewrs: Vision-language models as top-view spatial reasoners

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. Topviewrs: Vision-language models as top-view spatial reasoners. InEMNLP, 2024. 8

  33. [34]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,

  34. [35]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 8

  35. [36]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV,

  36. [37]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV,

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV,

  37. [38]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025. 8

  38. [39]

    Watching, reasoning, and searching: A video deep research benchmark on open web for agentic video reasoning, 2026

    Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, et al. Watching, reasoning, and searching: A video deep research benchmark on open web for agentic video reasoning.arXiv preprint arXiv:2601.06943, 2026. 8

  39. [40]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 8

  40. [41]

    Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning

    Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning. In NeurIPS, 2025. 8

  41. [42]

    Openeqa: Embodied question answering in the era of foun- dation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. InCVPR, 2024. 8

  42. [43]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 8

  43. [44]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

  44. [45]

    Deepseek-r1 thoughtology: Let's think about llm reasoning

    Sara Vera Marjanovi ´c, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han L`u, et al. Deepseek-r1 thoughtology: Let’s think about llm reasoning. arXiv preprint arXiv:2504.07128, 2025. 6

  45. [46]

    M., Shiee, N., Grasch, P., Jia, C., Yang, Y ., and Gan, Z

    Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M Patel, Navid Shiee, Peter Grasch, Chao Jia, Yin- fei Yang, and Zhe Gan. Deepmmsearch-r1: Empowering multimodal llms in multimodal web search.arXiv preprint arXiv:2510.12801, 2025. 2

  46. [47]

    Spatial cognition.Memory and Cogni- tive Processes, 3:113–163, 2004

    Nora S Newcombe. Spatial cognition.Memory and Cogni- tive Processes, 3:113–163, 2004. 8

  47. [48]

    Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025

    Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wen- zhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, et al. Swiftvla: Unlocking spa- tiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025. 8

  48. [49]

    Oxford university press, 1978

    John O’keefe and Lynn Nadel.The hippocampus as a cog- nitive map. Oxford university press, 1978. 8

  49. [50]

    GPT-5 System Card

    OpenAI. GPT-5 System Card. Online, 2025. Accessed: October 8, 2025. 2, 4

  50. [51]

    Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 8

  51. [52]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 8

  52. [53]

    Does spatial cognition emerge in frontier models? InICLR, 2025

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cognition emerge in frontier models? InICLR, 2025. 8 10

  53. [54]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kem- bhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. SAT: Spatial Aptitude Training for Multi- modal Language Models. InCOLM, 2025. 8

  54. [55]

    Roger N Shepard and Lynn A Cooper.Mental images and their transformations.The MIT Press, 1986. 8

  55. [56]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InCVPR,

  56. [57]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 8

  57. [58]

    Gigabrain-0: A world model-powered vision-language-action model.arXiv e-prints, pages arXiv– 2510, 2025

    GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv e-prints, pages arXiv– 2510, 2025. 8

  58. [59]

    Gigaworld-0: World models as data engine to empower embodied ai,

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World mod- els as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. 8

  59. [60]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 8

  60. [61]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri Iyer, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024. 8

  61. [62]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 8

  62. [63]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 8

  63. [64]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 8

  64. [65]

    Embodiedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025

    Boyuan Wang, Xinpan Meng, Xiaofeng Wang, Zheng Zhu, Angen Ye, Yang Wang, Zhiqin Yang, Chaojun Ni, Guan Huang, and Xingang Wang. Embodiedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025. 8

  65. [66]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 8

  66. [67]

    Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 6

  67. [68]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025. 8

  68. [69]

    Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

    Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xi- aodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. 8

  69. [70]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 4, 8

  70. [71]

    V-IRL: Grounding virtual intelligence in real life

    Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, and Saining Xie. V-IRL: Grounding virtual intelligence in real life. InECCV, 2024. 8

  71. [72]

    Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces. InCVPR, 2024. 8

  72. [73]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

  73. [74]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning. arXiv preprint arXiv:2511.05491, 2025. 8

  74. [75]

    MAGIC-VQA: Multimodal and grounded inference with commonsense knowledge for visual question answering

    Shuo Yang, Caren Han, Siwen Luo, and Eduard Hovy. MAGIC-VQA: Multimodal and grounded inference with commonsense knowledge for visual question answering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16967–16986, Vienna, Austria, 2025. As- sociation for Computational Linguistics. 8

  75. [76]

    Multimodal commonsense knowledge distillation for visual question an- swering (student abstract)

    Shuo Yang, Siwen Luo, and Soyeon Caren Han. Multimodal commonsense knowledge distillation for visual question an- swering (student abstract). InProceedings of the AAAI con- ference on artificial intelligence, pages 29545–29547, 2025. 8

  76. [77]

    Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025. 8

  77. [78]

    Evotool: 11 Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection.arXiv preprint arXiv:2603.04900, 2026

    Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Moham- mad Reza Ghasemi Madani, and Eduard Hovy. Evotool: 11 Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection.arXiv preprint arXiv:2603.04900, 2026. 8

  78. [79]

    Mindjourney: Test-time scaling with world models for spa- tial reasoning.arXiv preprint arXiv:2507.12508, 2025

    Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spa- tial reasoning.arXiv preprint arXiv:2507.12508, 2025. 8

  79. [80]

    Seeing from another perspective: Evaluating multi-view understanding in mllms

    Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. arXiv preprint arXiv:2504.15280, 2025. 8

  80. [81]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InICCV, 2023. 2

Showing first 80 references.