DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Anh Nguyen; Chase Rainwater; Duc Minh Nguyen; Duy Minh Ho Nguyen; Gladys Gawugah; Hao Vo; Khoa Vo; Ngan Le; Nghi D. Q. Bui; Ngo Xuan Cuong

arxiv: 2605.23176 · v1 · pith:FKTQIMTMnew · submitted 2026-05-22 · 💻 cs.CV

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Hao Vo , Khoa Vo , Phu Loc Nguyen , Sieu Tran , Duc Minh Nguyen , Ngo Xuan Cuong , Gladys Gawugah , Sreevenkata Anjani Tishita Godavarthi

show 5 more authors

Chase Rainwater Nghi D. Q. Bui Anh Nguyen Duy Minh Ho Nguyen Ngan Le

This is my paper

Pith reviewed 2026-05-25 05:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords DriveSpatialvision-language modelsautonomous drivingspatiotemporal reasoningscene constructionmulti-view understandingtemporal reasoningbenchmark evaluation

0 comments

The pith

Vision-language models trail humans by 28.4 points on a benchmark for spatiotemporal intelligence in autonomous driving

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DriveSpatial, a benchmark of 15.6K human-verified QA pairs drawn from five large-scale autonomous driving datasets to evaluate whether vision-language models can integrate multi-view observations into coherent scenes and reason about spatial relations, interactions, and future dynamics. Questions are generated from a dynamic multi-relational scene graph to enforce genuine cross-view and temporal reasoning across four targeted abilities. Evaluation of 15 representative VLMs shows the strongest model falls 28.4 points short of human performance, with cognitive scene construction identified as the primary shortfall. These findings indicate that current models do not yet possess the scene-construction capacity required for reliable spatiotemporal driving intelligence.

Core claim

The paper establishes that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence, as shown by a 28.4-point gap between the strongest model and humans on the DriveSpatial benchmark, where cognitive scene construction emerges as the key bottleneck.

What carries the argument

The dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences to generate QA pairs enforcing genuine cross-view and spatiotemporal reasoning.

If this is right

Explicit BEV grounding consistently improves VLM performance on the benchmark tasks.
Language-only prompting is insufficient for achieving strong results on spatiotemporal driving questions.
The benchmark isolates four distinct abilities that can be measured separately through its construction pipeline.
Releasing DriveSpatial and its pipeline will enable targeted research on improving VLM scene construction for driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures with explicit mechanisms for maintaining object continuity across views and time may be needed beyond current prompting approaches.
The identified gap could motivate training methods that reward internal reconstruction of dynamic scenes rather than direct QA accuracy.
Comparable limitations may appear in other multi-sensor temporal reasoning settings such as robotics navigation.

Load-bearing premise

The QA pairs generated from the dynamic multi-relational scene graph accurately isolate and test the four claimed abilities without allowing models to exploit shortcuts or dataset artifacts.

What would settle it

A VLM that reaches human-level accuracy on DriveSpatial using only its standard training and without any explicit scene construction mechanisms or graph-derived supervision would challenge the conclusion that current models inherently lack scene-construction ability.

Figures

Figures reproduced from arXiv: 2605.23176 by Anh Nguyen, Chase Rainwater, Duc Minh Nguyen, Duy Minh Ho Nguyen, Gladys Gawugah, Hao Vo, Khoa Vo, Ngan Le, Nghi D. Q. Bui, Ngo Xuan Cuong, Phu Loc Nguyen, Sieu Tran, Sreevenkata Anjani Tishita Godavarthi.

**Figure 1.** Figure 1: We present DRIVESPATIAL: A spatiotemporal intelligence evaluation benchmark for Autonomous Driving that mirrors human navigation cognition. (I, Top) In driving scenarios, humans gather observations from multiple viewpoints to mentally construct an internal representation (Cognitive Scene Construction), infer spatial relationships between objects (Multi-view Relational Understanding), and connect these pe… view at source ↗

**Figure 2.** Figure 2: Representative question samples from DRIVESPATIAL across nine selected tasks (out of 20). Each cell shows a multiple-choice question with its visual input and answer options; correct answers are bold. Tasks are grouped by spatiotemporal ability: Const. , Unders. , Reas. Spatial and Spatiotemporal Intelligence in VLMs. A growing body of work probes whether VLMs possess genuine spatial intelligence. General-… view at source ↗

**Figure 3.** Figure 3: DRIVESPATIAL statistics. (Left) Sunburst view of the 20 tasks under abilities Const. , Unders. and Reas. . (Right) Scene-level diversity distribution ( Gen. ). relationships across viewpoints, Reas. asks whether it can leverage temporal context to infer dynamics and anticipate future events, and Gen. measures whether these abilities remain reliable across datasets and driving conditions. Task Taxonomy & S… view at source ↗

**Figure 4.** Figure 4: DRIVESPATIAL construction pipeline. (1) standardize five AV datasets into a unified schema; (2) complete scene-level metadata; (3) construct a dynamic multi-relational graph; and (4) apply 20 rule-based algorithms to generate QA pairs. To ensure quality, human-in-the-loop is applied. cam(v t i ) ∩ cam(v t j ) = ∅ for pairwise relation queries. These constraints prevent the answer from being recovered from … view at source ↗

**Figure 5.** Figure 5: Per-task comparison against human performance. (left [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Breakdown of VLM performance for testing [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveSpatial uses dynamic scene graphs to build a 15.6K-question benchmark that flags scene construction as the main VLM weakness in driving, but the evidence that questions truly isolate that ability rather than permit shortcuts is thin.

read the letter

DriveSpatial is worth knowing about because it introduces a benchmark generated from dynamic multi-relational scene graphs that tests four specific abilities in VLMs for autonomous driving, and the results point to cognitive scene construction as the main area where models fall short of human performance by 28.4 points. The paper takes five large AD datasets and builds scene graphs that include object states, spatial relations, interactions, camera visibility, and temporal correspondences. From this they create 15.6K human-verified QA pairs in 20 tasks. They evaluate 15 VLMs and compare to humans. They also test language-only vs BEV prompting and find the latter improves results. This approach is new in how it uses the graph to enforce cross-view and temporal questions at scale. Earlier benchmarks tended to be more limited in scope. The work is solid in its scale and in releasing the data and pipeline. The human baseline and the prompting diagnostic add value. Where it is softer is in the details of task validation. The central claim depends on the QA pairs actually requiring the four abilities without allowing shortcuts like language patterns or dataset biases. The abstract states that the graph enables genuine reasoning, but if the verification process focuses mainly on factual accuracy rather than confirming that each question demands the target cognitive skill, then the gap might not fully reflect model limitations. The stress-test concern about shortcut exploitation is reasonable given the information in the abstract. This paper is aimed at researchers in vision-language models applied to robotics and autonomous driving. Anyone building or evaluating such models will get concrete numbers and a new test set from it. It deserves serious peer review because the benchmark construction is a clear step forward and the results are worth examining closely, even with the need for more evidence on the task design.

Referee Report

2 major / 2 minor

Summary. The paper introduces DriveSpatial, a benchmark of 15.6K human-verified QA pairs spanning 20 tasks drawn from five large-scale autonomous driving datasets. The benchmark is constructed from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences. It targets four abilities (Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization) and evaluates 15 VLMs, reporting that the strongest model trails the human baseline by 28.4 points with Cognitive Scene Construction identified as the primary bottleneck. Additional diagnostics indicate that language-only prompting is insufficient while explicit BEV grounding improves results.

Significance. If the QA pairs validly isolate the four target abilities without permitting shortcut solutions, the reported human-model gap would constitute a clear empirical demonstration that current VLMs lack reliable scene-construction capabilities for spatiotemporal driving tasks. The public release of the benchmark and its construction pipeline would provide a reusable resource for the community.

major comments (2)

[§3] §3 (Benchmark Construction): The claim that the dynamic multi-relational scene graph 'enables QA pairs that enforce genuine cross-view and spatiotemporal reasoning' is central to attributing the 28.4-point gap to missing scene-construction ability rather than benchmark artifacts. The manuscript describes human verification of factual correctness but supplies no explicit controls, adversarial testing, or ablation showing that individual questions cannot be solved via language priors, single-view cues, or dataset biases.
[§4.3] §4.3 (Per-Ability Breakdown) and Table 2: The identification of Cognitive Scene Construction as the key bottleneck rests on the per-task accuracy tables. Without reported human error analysis or item-level validation that each of the 20 tasks requires the stated ability and resists non-spatiotemporal shortcuts, the attribution of the gap to this specific ability remains under-supported.

minor comments (2)

[§4.1] The selection criteria and full list of the 15 evaluated VLMs are referenced in §4.1 but would benefit from an explicit table or appendix entry for reproducibility.
[Figure 3] Figure 3 (example QA pairs) would be clearer with explicit annotation of which scene-graph relations are required to answer each question correctly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, proposing revisions to strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that the dynamic multi-relational scene graph 'enables QA pairs that enforce genuine cross-view and spatiotemporal reasoning' is central to attributing the 28.4-point gap to missing scene-construction ability rather than benchmark artifacts. The manuscript describes human verification of factual correctness but supplies no explicit controls, adversarial testing, or ablation showing that individual questions cannot be solved via language priors, single-view cues, or dataset biases.

Authors: The dynamic multi-relational scene graph encodes cross-camera visibility, temporal object correspondences, and multi-relational facts by construction, so that questions are generated only when the required information spans multiple views or time steps; human verification then confirms factual correctness of the resulting QA pairs. We agree, however, that this design argument would be substantially strengthened by explicit controls. We will add an ablation subsection that includes (i) performance of language-only models on the full benchmark, (ii) single-view versus multi-view question subsets, and (iii) a small set of adversarial questions designed to be solvable by dataset bias alone. These additions will appear in the revised §3. revision: yes
Referee: [§4.3] §4.3 (Per-Ability Breakdown) and Table 2: The identification of Cognitive Scene Construction as the key bottleneck rests on the per-task accuracy tables. Without reported human error analysis or item-level validation that each of the 20 tasks requires the stated ability and resists non-spatiotemporal shortcuts, the attribution of the gap to this specific ability remains under-supported.

Authors: Task groupings were derived directly from the scene-graph properties used to generate each question (e.g., tasks requiring object-state fusion across cameras are labeled Cognitive Scene Construction). Human performance was collected on the identical item set to establish the reference. We acknowledge that an item-level analysis of shortcut resistance and human error patterns would provide stronger corroboration. We will therefore append, in the revised §4.3, (a) representative question examples annotated with the minimal scene-graph facts needed to answer them correctly and (b) a brief human-error categorization on a 200-item subsample. These changes will be reflected in an updated Table 2 caption and accompanying text. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential reductions

full rationale

The paper constructs a benchmark of 15.6K QA pairs from a dynamic multi-relational scene graph and reports empirical VLM performance gaps against humans. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The claim that the graph enables QA pairs enforcing spatiotemporal reasoning is a construction description, not a reduction of an output to its inputs by definition or self-citation. The evaluation is externally benchmarked by human baselines and model scores, rendering the result self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the constructed scene graph faithfully encodes the required spatiotemporal elements for generating unbiased QA pairs that isolate the four abilities.

axioms (1)

domain assumption The dynamic multi-relational scene graph accurately encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences from the source AD datasets.
Invoked to generate QA pairs that enforce genuine cross-view and spatiotemporal reasoning (abstract).

pith-pipeline@v0.9.0 · 5840 in / 1434 out tokens · 30004 ms · 2026-05-25T05:08:17.038843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 18 internal anchors

[1]

Studies in spatial learning

Edward C Tolman, Benbow F Ritchie, and Donald Kalish. Studies in spatial learning. ii. place learning versus response learning.Journal of experimental psychology, 36(3):221, 1946

work page 1946
[2]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

work page 1948
[3]

The hippocampus and context revisited

Lynn Nadel. The hippocampus and context revisited. 2008

work page 2008
[4]

The effect of vehicle navigation systems on the formation of cognitive maps

Gary E Burnett and Kate Lee. The effect of vehicle navigation systems on the formation of cognitive maps. InInternational conference of traffic and transport psychology, 2005

work page 2005
[5]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025
[6]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024
[7]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024
[8]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6585–6597, October 2025

work page 2025
[9]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

work page 2024
[10]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025
[11]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

work page 2026
[12]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024
[13]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

work page 1933
[14]

Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

work page 2025
[15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[16]

DriveVLM: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8th Annual Conference on Robot Learning, 2024. 10

work page 2024
[17]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[18]

Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

work page arXiv 2025
[19]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[20]

Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Moham- mad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

work page arXiv 2025
[21]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. InNeurIPS, 2025

work page 2025
[22]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

work page 2025
[23]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

work page 2024
[24]

Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, and Hao Frank Yang. Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

work page 2025
[25]

Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

work page 2025
[26]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024
[27]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[28]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

work page
[29]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

work page 2024
[30]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 11

work page 2020
[31]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. Conference and Workshop on Neural Information Processing Systems, 2021

work page 2021
[32]

Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

work page arXiv 2023
[33]

A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, et al. A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

work page arXiv 2023
[34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Llava-onevision: Easy visual task transfer.TMLR, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2024

work page 2024
[40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

Khoa V o, Thinh Phan, Kashu Yamazaki, Minh Tran, and Ngan Le. Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

work page 2024
[43]

Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, and Khoa Luu. Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

work page arXiv 2025
[44]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[46]

Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 12

work page arXiv 2023
[47]

Dolphins: Multimodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024

work page 2024
[48]

Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

work page 2024
[49]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[50]

An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, and Yuan-Fang Li. An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

work page arXiv 2024
[51]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024
[52]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[53]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[54]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

work page 2025
[55]

Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

work page 2025
[56]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025
[57]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025
[60]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[61]

Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025
[62]

Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, and Yi Dong. Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models. arXiv preprint arXiv:2510.13394, 2025. 13

work page arXiv 2025
[63]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025
[65]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, and Chenming Zhu. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025
[66]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, and Zihan Zhen. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Agqa: A benchmark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[69]

Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, and Lucy Cheke. Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

work page arXiv 2026
[70]

Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, and Hokin Deng. Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

work page arXiv 2026
[71]

Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, and Weiming Zhang. Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

work page arXiv 2026
[72]

Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

Jaeyun Jang, Seunghui Shin, Taeho Park, and Hyoseok Hwang. Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

work page arXiv 2026
[73]

Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

work page arXiv 2025
[74]

Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

work page arXiv 2025
[75]

Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, and Xingxing Wei. Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

work page arXiv 2026
[76]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, and Conghui Zhu. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning. arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025
[77]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, and Xinlei Chen. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

work page arXiv 2025
[78]

Talk2car: Taking control of your self-driving car

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2088–2098, 2019

work page 2019
[79]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision (ECCV), pages 563–578, 2018. 14

work page 2018
[80]

Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

work page arXiv 2023
[81]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, and Yu Yamaguchi. Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025
[82]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Studies in spatial learning

Edward C Tolman, Benbow F Ritchie, and Donald Kalish. Studies in spatial learning. ii. place learning versus response learning.Journal of experimental psychology, 36(3):221, 1946

work page 1946

[2] [2]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

work page 1948

[3] [3]

The hippocampus and context revisited

Lynn Nadel. The hippocampus and context revisited. 2008

work page 2008

[4] [4]

The effect of vehicle navigation systems on the formation of cognitive maps

Gary E Burnett and Kate Lee. The effect of vehicle navigation systems on the formation of cognitive maps. InInternational conference of traffic and transport psychology, 2005

work page 2005

[5] [5]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025

[6] [6]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024

[7] [7]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024

[8] [8]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6585–6597, October 2025

work page 2025

[9] [9]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

work page 2024

[10] [10]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025

[11] [11]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

work page 2026

[12] [12]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024

[13] [13]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

work page 1933

[14] [14]

Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

work page 2025

[15] [15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024

[16] [16]

DriveVLM: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8th Annual Conference on Robot Learning, 2024. 10

work page 2024

[17] [17]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[18] [18]

Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

work page arXiv 2025

[19] [19]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024

[20] [20]

Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Moham- mad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

work page arXiv 2025

[21] [21]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. InNeurIPS, 2025

work page 2025

[22] [22]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

work page 2025

[23] [23]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

work page 2024

[24] [24]

Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, and Hao Frank Yang. Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

work page 2025

[25] [25]

Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

work page 2025

[26] [26]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024

[27] [27]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020

[28] [28]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

work page

[29] [29]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

work page 2024

[30] [30]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 11

work page 2020

[31] [31]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. Conference and Workshop on Neural Information Processing Systems, 2021

work page 2021

[32] [32]

Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

work page arXiv 2023

[33] [33]

A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, et al. A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

work page arXiv 2023

[34] [34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Llava-onevision: Easy visual task transfer.TMLR, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2024

work page 2024

[40] [40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

Khoa V o, Thinh Phan, Kashu Yamazaki, Minh Tran, and Ngan Le. Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

work page 2024

[43] [43]

Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, and Khoa Luu. Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

work page arXiv 2025

[44] [44]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[46] [46]

Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 12

work page arXiv 2023

[47] [47]

Dolphins: Multimodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024

work page 2024

[48] [48]

Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

work page 2024

[49] [49]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[50] [50]

An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, and Yuan-Fang Li. An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

work page arXiv 2024

[51] [51]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024

[52] [52]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024

[53] [53]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024

[54] [54]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

work page 2025

[55] [55]

Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

work page 2025

[56] [56]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025

[57] [57]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025

[59] [60]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025

[60] [61]

Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025

[61] [62]

Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, and Yi Dong. Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models. arXiv preprint arXiv:2510.13394, 2025. 13

work page arXiv 2025

[62] [63]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025

[63] [65]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, and Chenming Zhu. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025

[64] [66]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, and Zihan Zhen. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [67]

Agqa: A benchmark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[66] [68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[67] [69]

Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, and Lucy Cheke. Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

work page arXiv 2026

[68] [70]

Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, and Hokin Deng. Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

work page arXiv 2026

[69] [71]

Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, and Weiming Zhang. Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

work page arXiv 2026

[70] [72]

Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

Jaeyun Jang, Seunghui Shin, Taeho Park, and Hyoseok Hwang. Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

work page arXiv 2026

[71] [73]

Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

work page arXiv 2025

[72] [74]

Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

work page arXiv 2025

[73] [75]

Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, and Xingxing Wei. Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

work page arXiv 2026

[74] [76]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, and Conghui Zhu. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning. arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025

[75] [77]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, and Xinlei Chen. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

work page arXiv 2025

[76] [78]

Talk2car: Taking control of your self-driving car

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2088–2098, 2019

work page 2019

[77] [79]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision (ECCV), pages 563–578, 2018. 14

work page 2018

[78] [80]

Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

work page arXiv 2023

[79] [81]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, and Yu Yamaguchi. Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025

[80] [82]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

work page arXiv 2024