SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Bohan Zeng; Bo Wang; Guoqing Huang; Hailong Qu; Haoyang Huang; Hengkang Qiao; Hongcheng Gao; Hongyixuan Yuan; Jiahao Wang; Jianhui Liu

arxiv: 2606.09669 · v1 · pith:WSDI4MEDnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongcheng Gao , Hailong Qu , Jingyi Tang , Jiahao Wang , Zihao Huang , Hengkang Qiao , Shihong Huang , Junming Yang

show 13 more authors

Yi Li Hongyixuan Yuan Wenjie Li Bohan Zeng Wenbo Li Bo Wang Jianhui Liu Olive Huang Haoyang Huang Wentao Zhang Guoqing Huang Nan Duan Yinpeng Dong

This is my paper

Pith reviewed 2026-06-27 16:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords spatial reasoningmultimodal agentsinteractive benchmarksimulation environmentstask success ratepartial observabilitylong-horizon planning

0 comments

The pith

A new benchmark across eight simulators shows even top multimodal agents succeed on fewer than 18 percent of interactive spatial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpatialWorld as a unified evaluation platform for how well multimodal agents handle interactive spatial reasoning when they must actively explore partially observed environments and carry out real-world tasks. It supplies 760 human-annotated tasks, reference trajectories, and terminal-state verifiers that run on eight different simulation engines through one shared protocol and text-action interface. When fifteen current agents are tested, the best performer reaches only 17.4 percent average task success while the strongest open-source model reaches 14.1 percent. The results point to clear shortfalls in active exploration and long-horizon planning. The benchmark is offered as a shared testbed that future agents must clear before claims of robust spatial competence can be accepted.

Core claim

SpatialWorld integrates eight heterogeneous simulation backends under a simulator-agnostic protocol and supplies 760 tasks with human-validated initial states, reference trajectories, and terminal verifiers; under vision-only partial observability and a unified text action space, fifteen advanced agents achieve at most 17.4 percent average task success rate, exposing persistent gaps in active exploration and long-horizon planning.

What carries the argument

SpatialWorld benchmark, a simulator-agnostic collection of tasks and verifiers that forces agents to gather egocentric visual evidence and issue decisions through a single text-based action interface.

If this is right

Task success rates and execution efficiency are often mismatched, so efficiency metrics must be tracked separately.
Performance varies sharply across domains such as household routines and social collaboration, indicating domain-specific weaknesses.
Active exploration under partial observability and long-horizon planning remain the dominant bottlenecks for current agents.
A shared protocol across simulators allows direct comparison of agents without simulator-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future agents may need to add explicit spatial memory or mapping modules rather than relying solely on larger models.
The benchmark could be extended by adding human performance baselines on the same tasks to quantify the remaining gap.
Because the action interface is text-only, improvements in language-to-action grounding could raise scores without changing the visual pipeline.

Load-bearing premise

The 760 tasks, reference trajectories, and verifiers across the eight simulators accurately and representatively measure interactive spatial understanding needed for real-world tasks.

What would settle it

An agent that achieves greater than 50 percent average task success rate on the full set of 760 tasks while following the same vision-only and text-action rules would show the reported performance ceiling is not fundamental.

Figures

Figures reproduced from arXiv: 2606.09669 by Bohan Zeng, Bo Wang, Guoqing Huang, Hailong Qu, Haoyang Huang, Hengkang Qiao, Hongcheng Gao, Hongyixuan Yuan, Jiahao Wang, Jianhui Liu, Jingyi Tang, Junming Yang, Nan Duan, Olive Huang, Shihong Huang, Wenbo Li, Wenjie Li, Wentao Zhang, Yi Li, Yinpeng Dong, Zihao Huang.

**Figure 2.** Figure 2: Data construction pipeline of SpatialWorld. We first collect a series of environments, have annotators learn tutorials and write instructions, define success conditions, and then calibrate the data through automated execution validation in virtual environments and human cross-validation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The Observation and Action Interfaces. (a) Flexible environment initialization via direct state loading or action-list execution. (b) A unified interface providing standardized egocentric RGB observations. (c) A structured, unified action space A. (d) Action-to-code mapping that translates unified actions into environment-specific commands, enabling cross-simulator deployment. unified agent policy, the Obs… view at source ↗

**Figure 4.** Figure 4: Task-category counts. Task distribution across different categories. Rubik’s Cube) as controlled closed-loop probes. By stripping away visual shortcuts, these lightweight environments isolate the abstract spatial logic and topological reasoning that fundamentally underpin real-world interactive spatial understanding. Detailed descriptions are provided in Appendix D. Execution-Based Evaluation. Following OS… view at source ↗

**Figure 5.** Figure 5: Indoor and outdoor physical domains. Overall TSR across indoor and outdoor physical environments, with environment-level bars for the top-five models in each domain [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Complexity profile. Task counts, mean TSR, and mean SE across the three parallel complexity modes in the physical benchmark. 0.1 0.3 0.7 1.0 Temperature 0.0 2.5 5.0 7.5 10.0 TSR (%) Qwen3-30B Gemini3-F GLM-4.6V Gemini3.1-P (a) Temperature 10 20 30 50 Window size 0.0 2.5 5.0 7.5 10.0 TSR (%) Qwen3-30B Gemini3-F GLM-4.6V Gemini3.1-P (b) History window Kimi-VL-A3B Qwen3-30B Qwen2.5-72B Gemini3-F GLM-4.6V Gemi… view at source ↗

**Figure 7.** Figure 7: Ablation trends. TSR under temperature and history-window settings, together with the signed TSR gap between continuous and discrete action parameterizations. Complexity Modes. Categorizing tasks by the action signatures from Section 2.4 reveals distinct complexity modes derived from golden action primitives: Navigation (movement and viewpoint), Interaction (object-state), and Navigation–Interaction (both)… view at source ↗

**Figure 8.** Figure 8: Social and perceptual profiles. Three complementary additional observations: multi-agent social performance, image-resolution sensitivity, and field-of-view sensitivity. window size and motion type are completely model-dependent. As no single setting proves universally optimal, we default to standard configurations. Detailed analysis is provided in Appendix A. 4 Related Work 4.1 Multimodal Agents Multimoda… view at source ↗

**Figure 9.** Figure 9: Observation Sensitivity Analysis under the Same Viewpoint with Varying Resolutions. We progressively increase the resolution ratio along the x-axis, reaching the highest clarity at 1.0. D Environment Suite SPATIALWORLD uses its environment suite as the main source of domain diversity rather than as a passive collection of scenes. We wrap eight 3D backends with a shared agent-side API, so agents interact th… view at source ↗

**Figure 10.** Figure 10: Why GPT-5 currently outperforms GPT-5.4. GPT-5 achieves higher shared-task TSR in most physical environments, while GPT-5.4 exhibits a stronger tendency toward premature termination. The step-count plots further show that GPT-5 typically spends more actions both when it succeeds and when it fails, consistent with a slower but more persistent search strategy. and EmbodiedCity. GPT-5 succeeds on 78 of these… view at source ↗

**Figure 11.** Figure 11: Failure case of GPT-5 in the AI2-THOR environment. The failure modes include Spatial Disorientation and Premature Termination. visuals. LLMs are not incorporated as any core, original, or non-standard component of our proposed methodology. We only employ 15 multimodal LLMs as external test agents to evaluate the proposed benchmark, which does not constitute a part of our core method design. 27 [PITH_FULL… view at source ↗

**Figure 12.** Figure 12: Failure case of Gemini-3.1-Pro in the VirtualHome environment. The failure modes include Object Hallucination and Action Loop. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Failure cases of Gemini-3.1-Pro in the CARLA environment. The failure mode is Spatial Disorientation and Premature Termination. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

read the original abstract

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialWorld gives a practical multi-simulator benchmark for interactive spatial reasoning and shows current agents top out at low success rates, but the abstract skimps on task construction details.

read the letter

The main thing to know is that the paper builds SpatialWorld as a single protocol running across eight different simulators, with 760 human-annotated tasks that require agents to explore under partial vision and output text actions. The headline result is that even the best model they test hits only 17.4% task success rate on average.

What is actually new is the attempt to make the evaluation simulator-agnostic while keeping vision-only inputs and a shared action format. That moves past the usual static VQA setups or single-simulator papers. They also supply reference trajectories and terminal-state verifiers for each task, which lets them run consistent scoring.

The evaluation of 15 agents and the follow-up notes on efficiency gaps plus domain differences are straightforward and useful to see. The low numbers line up with the claim that active spatial planning is still hard.

The soft spot is that the abstract gives little on how the tasks were sampled across domains or exactly how the verifiers were built and validated. Without those steps it is harder to judge whether the 17% figure reflects agent limits or benchmark quirks. The full paper may cover this, but it is not visible here.

This is for people working on embodied or multimodal agents who need a broader testbed than current options. The construction is concrete enough and the results are reported plainly, so it deserves a serious referee even if revisions will be needed on the protocol description.

Referee Report

1 major / 1 minor

Summary. The paper introduces SpatialWorld, a unified benchmark for interactive spatial reasoning in multimodal agents. It integrates eight heterogeneous simulation backends under a simulator-agnostic protocol, with 760 human-annotated tasks across domains like household routines and social collaboration. Each task includes a human-validated initial state, reference trajectory, and terminal-state verifier. Evaluation of 15 agents shows low performance, with GPT-5 achieving 17.4% average TSR and Qwen-3.5 at 14.1%, exposing mismatches between success and efficiency plus domain variations.

Significance. If the tasks and verifiers hold, the work is significant for providing the first large-scale, cross-simulator test of active spatial understanding under partial observability. The low TSR results and identified bottlenecks in exploration/planning offer concrete evidence of current MLLM limitations, positioning the benchmark as a useful testbed. The shared protocol across backends is a clear strength for generalizability.

major comments (1)

[Benchmark construction] Benchmark construction (methods section describing task creation and verifiers): The criteria for selecting the 760 tasks, potential annotation biases, and exact implementation of the terminal-state verifiers are not detailed. This is load-bearing for the central TSR claims, as the reported performance gaps (e.g., 17.4% for GPT-5) cannot be interpreted without confirming that the tasks accurately and representatively measure interactive spatial understanding.

minor comments (1)

[Abstract] The abstract mentions 'eight heterogeneous simulation backends' but does not list them or their domains explicitly; adding this would improve clarity without altering the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of SpatialWorld's significance and for highlighting the need for greater transparency in benchmark construction. We agree that additional detail on task selection, annotation processes, and verifier implementation is warranted to strengthen interpretability of the TSR results and will revise the methods section accordingly.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (methods section describing task creation and verifiers): The criteria for selecting the 760 tasks, potential annotation biases, and exact implementation of the terminal-state verifiers are not detailed. This is load-bearing for the central TSR claims, as the reported performance gaps (e.g., 17.4% for GPT-5) cannot be interpreted without confirming that the tasks accurately and representatively measure interactive spatial understanding.

Authors: We acknowledge that the current manuscript provides only high-level descriptions of task creation and verifiers. In the revised version we will insert a new subsection (Methods 3.2) that explicitly details: (1) the multi-stage selection criteria used to curate the 760 tasks across the eight simulators (diversity in domain, horizon length, and required spatial operations, with explicit balancing to avoid over-representation of any single simulator); (2) the annotation protocol, including the number of annotators per task, inter-annotator agreement metrics, and steps taken to reduce selection and confirmation biases (e.g., blind review of initial states and reference trajectories); and (3) the precise implementation of each terminal-state verifier, including the predicate logic, simulator-specific APIs invoked, and example verification traces for representative tasks. These additions will directly support the validity of the reported performance gaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces SpatialWorld as an external benchmark with 760 human-annotated tasks, reference trajectories, and verifiers across eight simulators, then reports direct empirical TSR results from evaluating 15 agents (e.g., GPT-5 at 17.4%). No equations, fitted parameters, derivations, or self-citation chains exist that reduce any claim to prior inputs by construction. The work is self-contained as a benchmark construction and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms beyond standard evaluation practices, or invented entities are introduced; the contribution is empirical benchmark design and testing.

pith-pipeline@v0.9.1-grok · 5856 in / 1252 out tokens · 43235 ms · 2026-06-27T16:33:33.645334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 1 canonical work pages

[1]

Introducing claude opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, 2025

Anthropic. Introducing claude opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, 2025

2025
[2]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022
[3]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
[4]

URLhttps://arxiv.org/abs/2502.13923

Pith/arXiv arXiv
[5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[6]

Seed2.0, 2026

ByteDance. Seed2.0, 2026. URLhttps://seed.bytedance.com/en/seed2

2026
[7]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

arXiv 2025
[8]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

arXiv 2025
[9]

Spider2-v: How far are multi- modal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems, 37:107703–107744, 2024

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, et al. Spider2-v: How far are multi- modal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems, 37:107703–107744, 2024

2024
[10]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024
[11]

Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks.IEEE Transactions on Cognitive and Developmental Systems, 2025

Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Jinrui Liu, Haoran Li, Dongbin Zhao, and He Wang. Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks.IEEE Transactions on Cognitive and Developmental Systems, 2025

2025
[12]

EmbodiedEval: Evaluate multimodal LLMs as embodied agents.arXiv preprint arXiv:2501.11858, 2025

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. EmbodiedEval: Evaluate multimodal LLMs as embodied agents.arXiv preprint arXiv:2501.11858, 2025

arXiv 2025
[13]

Gemini 3 pro best for complex tasks and bringing creative concepts to life

Google Deepmind. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/, 2025

2025
[14]

Proc- thor: Large-scale embodied AI using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Sal- vador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Proc- thor: Large-scale embodied AI using procedural generation. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Sys...

2022
[15]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 12

2017
[16]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

2023
[17]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024

2024
[18]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024
[19]

Vlm-gronav: Robot naviga- tion using physically grounded vision-language models in outdoor environments

Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Ruiqi Xian, Tianrui Guan, Mo- hamed Khalid M Jaffar, Vignesh Rajagopal, and Dinesh Manocha. Vlm-gronav: Robot naviga- tion using physically grounded vision-language models in outdoor environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2391–2398. IEEE, 2025

2025
[20]

Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

2022
[21]

Videoagent: A memory-augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

2024
[22]

EmbodiedCity: A benchmark platform for embodied agent in real-world city environment.arXiv preprint arXiv:2410.09604, 2024

Chen Gao, Baining Zhao, Weichen Zhang, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Fan- hang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, and Yong Li. EmbodiedCity: A benchmark platform for embodied agent in real-world city environment.arXiv preprint arXiv:2410.09604, 2024

arXiv 2024
[23]

Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

arXiv 2025
[24]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

Pith/arXiv arXiv 2025
[25]

Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

Pith/arXiv arXiv 2024
[26]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

2024
[27]

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025
[28]

3d concept learning and reasoning from multi-view images

Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 3d concept learning and reasoning from multi-view images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9202–9212, 2023

2023
[29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 13

2019
[30]

OmniSpatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards comprehensive spatial reasoning benchmark for vision language models. InInternational Conference on Learning Representations, 2026

2026
[31]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017
[32]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024
[33]

Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017
[34]

Autobio: A simulation and benchmark for robotic automation in digital biology laboratory.arXiv preprint arXiv:2505.14030, 2025

Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, Xuanbing Xie, Rongkui Zhang, Yicheng Zhu, Peihang Li, Tianshuo Yang, Tianxing Chen, Haoyu Gao, et al. Autobio: A simulation and benchmark for robotic automation in digital biology laboratory.arXiv preprint arXiv:2505.14030, 2025

arXiv 2025
[35]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

arXiv 2021
[36]

Embodied agent interface: Benchmarking LLMs for embodied decision making.arXiv preprint arXiv:2410.07166, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. Embodied agent interface: Benchmarking LLMs for embodied decision making.arXiv preprint arXiv:2410.07166, 2024

arXiv 2024
[37]

M3dbench: Let’s instruct large models with multi-modal 3d prompts.arXiv preprint arXiv:2312.10763, 2023

Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, and Tao Chen. M3dbench: Let’s instruct large models with multi-modal 3d prompts.arXiv preprint arXiv:2312.10763, 2023

arXiv 2023
[38]

Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022
[39]

From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025
[40]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

2024
[41]

MMSI-Video-Bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Video-Bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

arXiv 2025
[42]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv preprint arXiv:2507.07984, 2025

arXiv 2025
[43]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023
[44]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024. 14

2024
[45]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

arXiv 2025
[46]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

arXiv 2025
[47]

3DSRBench: A comprehensive 3D spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3DSRBench: A comprehensive 3D spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

2025
[48]

Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

arXiv 2022
[49]

Introducing gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

OpenAI. Introducing gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

2025
[50]

Gpt -5.4 thinking system card, 2026

OpenAI. Gpt -5.4 thinking system card, 2026. URL https://openai.com/index/ gpt-5-4-thinking-system-card/

2026
[51]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8494–8502. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 1...

work page doi:10.1109/cvpr.2018.00886 2018
[52]

Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

Pith/arXiv arXiv 2021
[53]

Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Pith/arXiv arXiv 2024
[54]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

2019
[55]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

2020
[56]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[57]

Corso, and Eric Sax

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, and Eric Sax. Embodied4c: Measuring what matters for embodied vision-language navigation, 2025. URL https://arxiv.org/ abs/2512.18028

arXiv 2025
[58]

Gemini 3 pro: the frontier of vision ai, 2025b

Gemini Team. Gemini 3 pro: the frontier of vision ai, 2025b. URL https://blog.google/ technology/developers/gemini-3-pro-vision
[59]

Gemini 3 flash, 2025b

Gemini Team. Gemini 3 flash, 2025b. URL https://deepmind.google/models/gemini/ flash/. 15
[60]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[61]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gemini 2.5 Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/ 2507.06261

Pith/arXiv arXiv 2025
[62]

Glm-4.6v: Open source multimodal models with native tool use, 2025a

GLM-V Team. Glm-4.6v: Open source multimodal models with native tool use, 2025a. URL https://z.ai/blog/glm-4.6v
[63]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025
[64]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026
[65]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
[66]

URLhttps://qwen.ai/blog?id=qwen3.5
[67]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

Pith/arXiv arXiv 2025
[68]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024
[69]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

Pith/arXiv arXiv 2024
[70]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

arXiv 2023
[71]

SITE: Towards spatial intelligence thorough evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. SITE: Towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9058–9069, 2025

2025
[72]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

arXiv 2025
[73]

Spa- tialScore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialScore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

Pith/arXiv arXiv 2025
[74]

Gibson env: Real-world perception for embodied agents

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079, 2018

2018
[75]

Sapien: A simulated part-based interactive environ- ment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environ- ment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020
[76]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. 16

arXiv 2024
[77]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[78]

Spatial- Bench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatial- Bench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025

Pith/arXiv arXiv 2025
[79]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

2024
[80]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Introducing claude opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, 2025

Anthropic. Introducing claude opus 4.5.https://www.anthropic.com/news/claude-opus-4-5, 2025

2025

[2] [2]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022

[3] [3]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

[4] [4]

URLhttps://arxiv.org/abs/2502.13923

Pith/arXiv arXiv

[5] [5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[6] [6]

Seed2.0, 2026

ByteDance. Seed2.0, 2026. URLhttps://seed.bytedance.com/en/seed2

2026

[7] [7]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

arXiv 2025

[8] [8]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

arXiv 2025

[9] [9]

Spider2-v: How far are multi- modal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems, 37:107703–107744, 2024

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, et al. Spider2-v: How far are multi- modal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems, 37:107703–107744, 2024

2024

[10] [10]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024

[11] [11]

Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks.IEEE Transactions on Cognitive and Developmental Systems, 2025

Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Jinrui Liu, Haoran Li, Dongbin Zhao, and He Wang. Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks.IEEE Transactions on Cognitive and Developmental Systems, 2025

2025

[12] [12]

EmbodiedEval: Evaluate multimodal LLMs as embodied agents.arXiv preprint arXiv:2501.11858, 2025

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. EmbodiedEval: Evaluate multimodal LLMs as embodied agents.arXiv preprint arXiv:2501.11858, 2025

arXiv 2025

[13] [13]

Gemini 3 pro best for complex tasks and bringing creative concepts to life

Google Deepmind. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/, 2025

2025

[14] [14]

Proc- thor: Large-scale embodied AI using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Sal- vador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Proc- thor: Large-scale embodied AI using procedural generation. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Sys...

2022

[15] [15]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 12

2017

[16] [16]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

2023

[17] [17]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024

2024

[18] [18]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024

[19] [19]

Vlm-gronav: Robot naviga- tion using physically grounded vision-language models in outdoor environments

Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Ruiqi Xian, Tianrui Guan, Mo- hamed Khalid M Jaffar, Vignesh Rajagopal, and Dinesh Manocha. Vlm-gronav: Robot naviga- tion using physically grounded vision-language models in outdoor environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2391–2398. IEEE, 2025

2025

[20] [20]

Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022

2022

[21] [21]

Videoagent: A memory-augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

2024

[22] [22]

EmbodiedCity: A benchmark platform for embodied agent in real-world city environment.arXiv preprint arXiv:2410.09604, 2024

Chen Gao, Baining Zhao, Weichen Zhang, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Fan- hang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, and Yong Li. EmbodiedCity: A benchmark platform for embodied agent in real-world city environment.arXiv preprint arXiv:2410.09604, 2024

arXiv 2024

[23] [23]

Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

arXiv 2025

[24] [24]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

Pith/arXiv arXiv 2025

[25] [25]

Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

Pith/arXiv arXiv 2024

[26] [26]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

2024

[27] [27]

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025

[28] [28]

3d concept learning and reasoning from multi-view images

Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 3d concept learning and reasoning from multi-view images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9202–9212, 2023

2023

[29] [29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 13

2019

[30] [30]

OmniSpatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards comprehensive spatial reasoning benchmark for vision language models. InInternational Conference on Learning Representations, 2026

2026

[31] [31]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017

[32] [32]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024

[33] [33]

Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017

[34] [34]

Autobio: A simulation and benchmark for robotic automation in digital biology laboratory.arXiv preprint arXiv:2505.14030, 2025

Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, Xuanbing Xie, Rongkui Zhang, Yicheng Zhu, Peihang Li, Tianshuo Yang, Tianxing Chen, Haoyu Gao, et al. Autobio: A simulation and benchmark for robotic automation in digital biology laboratory.arXiv preprint arXiv:2505.14030, 2025

arXiv 2025

[35] [35]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

arXiv 2021

[36] [36]

Embodied agent interface: Benchmarking LLMs for embodied decision making.arXiv preprint arXiv:2410.07166, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. Embodied agent interface: Benchmarking LLMs for embodied decision making.arXiv preprint arXiv:2410.07166, 2024

arXiv 2024

[37] [37]

M3dbench: Let’s instruct large models with multi-modal 3d prompts.arXiv preprint arXiv:2312.10763, 2023

Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, and Tao Chen. M3dbench: Let’s instruct large models with multi-modal 3d prompts.arXiv preprint arXiv:2312.10763, 2023

arXiv 2023

[38] [38]

Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022

[39] [39]

From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

Pith/arXiv arXiv 2025

[40] [40]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

2024

[41] [41]

MMSI-Video-Bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Video-Bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

arXiv 2025

[42] [42]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv preprint arXiv:2507.07984, 2025

arXiv 2025

[43] [43]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023

[44] [44]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024. 14

2024

[45] [45]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

arXiv 2025

[46] [46]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

arXiv 2025

[47] [47]

3DSRBench: A comprehensive 3D spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3DSRBench: A comprehensive 3D spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

2025

[48] [48]

Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

arXiv 2022

[49] [49]

Introducing gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

OpenAI. Introducing gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

2025

[50] [50]

Gpt -5.4 thinking system card, 2026

OpenAI. Gpt -5.4 thinking system card, 2026. URL https://openai.com/index/ gpt-5-4-thinking-system-card/

2026

[51] [51]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8494–8502. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 1...

work page doi:10.1109/cvpr.2018.00886 2018

[52] [52]

Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

Pith/arXiv arXiv 2021

[53] [53]

Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Pith/arXiv arXiv 2024

[54] [54]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

2019

[55] [55]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

2020

[56] [56]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[57] [57]

Corso, and Eric Sax

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, and Eric Sax. Embodied4c: Measuring what matters for embodied vision-language navigation, 2025. URL https://arxiv.org/ abs/2512.18028

arXiv 2025

[58] [58]

Gemini 3 pro: the frontier of vision ai, 2025b

Gemini Team. Gemini 3 pro: the frontier of vision ai, 2025b. URL https://blog.google/ technology/developers/gemini-3-pro-vision

[59] [59]

Gemini 3 flash, 2025b

Gemini Team. Gemini 3 flash, 2025b. URL https://deepmind.google/models/gemini/ flash/. 15

[60] [60]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[61] [61]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gemini 2.5 Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/ 2507.06261

Pith/arXiv arXiv 2025

[62] [62]

Glm-4.6v: Open source multimodal models with native tool use, 2025a

GLM-V Team. Glm-4.6v: Open source multimodal models with native tool use, 2025a. URL https://z.ai/blog/glm-4.6v

[63] [63]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025

[64] [64]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026

[65] [65]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

[66] [66]

URLhttps://qwen.ai/blog?id=qwen3.5

[67] [67]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

Pith/arXiv arXiv 2025

[68] [68]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024

[69] [69]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

Pith/arXiv arXiv 2024

[70] [70]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

arXiv 2023

[71] [71]

SITE: Towards spatial intelligence thorough evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. SITE: Towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9058–9069, 2025

2025

[72] [72]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

arXiv 2025

[73] [73]

Spa- tialScore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialScore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

Pith/arXiv arXiv 2025

[74] [74]

Gibson env: Real-world perception for embodied agents

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079, 2018

2018

[75] [75]

Sapien: A simulated part-based interactive environ- ment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environ- ment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020

[76] [76]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. 16

arXiv 2024

[77] [77]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[78] [78]

Spatial- Bench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatial- Bench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025

Pith/arXiv arXiv 2025

[79] [79]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

2024

[80] [80]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025