pith. sign in

arxiv: 2606.22617 · v1 · pith:GF535LANnew · submitted 2026-06-21 · 💻 cs.CV

OmniSpace: Efficient Geometry Awareness for Autonomous Vehicles MLLMs

Pith reviewed 2026-06-26 10:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsautonomous vehiclesgeometry awarenessspatial reasoningepipolar attentiongeometric distillationcross-view correspondence2D observations
0
0 comments X

The pith

OmniSpace adds a Camera Pose Injector, Multi-view Epipolar Attention, and 3D Geometric Distillation to MLLMs so they can reason about 3D space for autonomous driving using only 2D images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that MLLMs can acquire usable spatial intelligence for autonomous vehicles without calling separate 3D models during inference. It starts from the observation that current models struggle with matching points across camera views and estimating depth. The three added modules are meant to move geometric knowledge into the language model directly from 2D training data. If this works, the pipeline becomes simpler and avoids the risk that an auxiliary 3D model will fail and break the whole system. Readers would care because real-world driving systems need reliable 3D understanding yet often cannot afford extra models at test time.

Core claim

OmniSpace is a plug-and-play method that transfers geometric knowledge into MLLMs from purely 2D observations. It does so by injecting camera poses, applying multi-view epipolar attention to strengthen cross-view correspondence, and using a 3D geometric distillation objective to improve depth estimation. The result is higher performance on planning, risk detection, language, and generalization benchmarks without auxiliary 3D models at inference.

What carries the argument

The joint action of the Camera Pose Injector, Multi-view Epipolar Attention module, and 3D Geometric Distillation objective, which together target cross-view correspondence and depth estimation.

If this is right

  • Planning accuracy rises on nuScenes and Bench2Drive because the model now handles depth and view consistency better.
  • Risk detection improves on nuInstruct since geometric cues help identify hazards from language descriptions.
  • Language performance increases on Omnidrive because spatial relations are represented more reliably inside the model.
  • Generalization strengthens on DriveBench as the injected geometry reduces reliance on dataset-specific patterns.
  • The overall pipeline avoids cascading failures that occur when an external 3D model is required at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three modules could be tested on non-driving 3D tasks such as indoor navigation or robotic manipulation where only 2D images are available.
  • If the distillation step proves stable, similar objectives might be applied to other multimodal models that need implicit 3D structure.
  • Deployment cost drops because inference no longer requires running a separate depth or pose network alongside the language model.

Load-bearing premise

That the main bottlenecks in current MLLMs for spatial AV tasks are weak cross-view correspondence and depth estimation, and that the three modules can transfer the missing geometric knowledge without creating new failure modes.

What would settle it

A controlled test in which the three modules are added to the same base MLLM yet produce no gain (or a loss) on the nuScenes planning metric or introduce measurable new errors in cross-view point matching.

Figures

Figures reproduced from arXiv: 2606.22617 by Anh Nguyen, Duc Minh Nguyen, Duy Minh Ho Nguyen, Hao Vo, Khoa Vo, Ngan Le, Nghi D. Q. Bui, Ngo Xuan Cuong, Phu Loc Nguyen, Sieu Tran.

Figure 1
Figure 1. Figure 1: Comparison between existing paradigms and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Correlation between cross-view correspondence, depth estimation, and downstream AV [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of the proposed OmniSpace. Given synchronized multi-view images, OmniSpace encodes 2D visual tokens with camera-pose-aware ray embeddings, refines them through epipolar-constrained cross-view attention, and distills 3D geometric priors during training. The 3D teacher is discarded at inference. Preliminaries Given multi-view (N) images It = {I 1 t , . . . , I N t } at timestamp t, a sta… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative and quantitative results of VGGT [58] on nuScenes [4] validation sets under current-frame and temporally aggregated inputs. While camera pose injection and multi-view epipolar attention introduce explicit geometric structure into the visual-token pathway, they do not by themselves specify what a 3D-aware rep￾resentation should encode. A straightforward solution is to use LiDAR or an external 3D… view at source ↗
Figure 5
Figure 5. Figure 5: Generalization comparison on DriveBench [68], evaluated using GPT-Score. Evaluation on Risk Detection on NuInstruct. To evaluate the performance of OmniSpace on risk sce￾nario perception, we conduct experiments on the NuInstruct benchmark. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance on 2D visual tasks, yet enhancing their spatial intelligence for real-world applications such as Autonomous Vehicles (AV) remains an open challenge. Existing geometry-aware MLLMs typically rely on auxiliary 3D models at inference time, introducing pipeline complexity and the risk of cascading failures. In this paper, we present OmniSpace, a simple yet effective plug-and-play paradigm for geometry-aware spatial reasoning from purely 2D observations. Motivated by our finding that current MLLMs are bottlenecked by weak cross-view correspondence and depth estimation, OmniSpace introduces a Camera Pose Injector, a Multi-view Epipolar Attention module, and a 3D Geometric Distillation objective that jointly address these two limitations by transferring geometric knowledge into the model. Extensive experiments show that OmniSpace surpasses existing methods on planning benchmarks (nuScenes, Bench2Drive), risk detection (nuInstruct), language (Omnidrive), and generalization (DriveBench).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes OmniSpace, a plug-and-play approach to add geometry awareness to MLLMs for autonomous vehicle tasks using only 2D observations. It identifies weak cross-view correspondence and depth estimation as primary bottlenecks, then introduces three components (Camera Pose Injector, Multi-view Epipolar Attention module, and 3D Geometric Distillation objective) to transfer geometric knowledge. The abstract claims that extensive experiments demonstrate superiority over prior methods on planning (nuScenes, Bench2Drive), risk detection (nuInstruct), language (Omnidrive), and generalization (DriveBench) benchmarks.

Significance. If the claimed performance gains are substantiated with proper controls and the modules are shown not to introduce compensating errors, the work would offer a practical alternative to auxiliary 3D models at inference time. The explicit avoidance of cascading failures from external 3D pipelines is a clear practical strength if the results hold.

major comments (3)
  1. Abstract: the claim of surpassing existing methods on five named benchmarks is stated without any quantitative numbers, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the provided text.
  2. Motivation section (implied by 'Motivated by our finding'): the premise that cross-view correspondence and depth estimation are the dominant bottlenecks is presented as an established finding, yet no supporting analysis, diagnostic experiments, or citations to prior measurements are referenced, leaving the justification for the three modules unsupported.
  3. Method description: the three invented modules (Camera Pose Injector, Multi-view Epipolar Attention, 3D Geometric Distillation) are asserted to jointly solve the stated limitations, but no ablation isolating each module's contribution or testing for new failure modes is mentioned, which is required to substantiate the transfer claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of empirical claims, motivation, and validation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the claim of surpassing existing methods on five named benchmarks is stated without any quantitative numbers, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the provided text.

    Authors: We agree that the abstract would be strengthened by including quantitative highlights. In the revision we will add specific performance deltas (with baseline references) drawn from the experimental tables to make the superiority claims directly evaluable while remaining within abstract length constraints. revision: yes

  2. Referee: Motivation section (implied by 'Motivated by our finding'): the premise that cross-view correspondence and depth estimation are the dominant bottlenecks is presented as an established finding, yet no supporting analysis, diagnostic experiments, or citations to prior measurements are referenced, leaving the justification for the three modules unsupported.

    Authors: The stated motivation rests on observations from our preliminary diagnostics. To address the gap, the revised manuscript will include an expanded motivation subsection with the supporting diagnostic experiments, metrics, and visualizations that identified weak cross-view correspondence and depth estimation as primary limitations. revision: yes

  3. Referee: Method description: the three invented modules (Camera Pose Injector, Multi-view Epipolar Attention, 3D Geometric Distillation) are asserted to jointly solve the stated limitations, but no ablation isolating each module's contribution or testing for new failure modes is mentioned, which is required to substantiate the transfer claim.

    Authors: We acknowledge that isolating each module's contribution and checking for introduced failure modes strengthens the transfer claim. The revision will add dedicated ablation tables and failure-mode analysis experiments that separately evaluate the Camera Pose Injector, Multi-view Epipolar Attention, and 3D Geometric Distillation components. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper describes OmniSpace as introducing three independent modules (Camera Pose Injector, Multi-view Epipolar Attention, 3D Geometric Distillation) motivated by an external observation on MLLM bottlenecks, with performance claims validated on separate external benchmarks (nuScenes, Bench2Drive, etc.). No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text that would reduce any result to its own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on an unverified domain assumption about MLLM bottlenecks and on three newly introduced components whose internal parameters and training dynamics are not specified in the available text.

axioms (1)
  • domain assumption Current MLLMs are bottlenecked by weak cross-view correspondence and depth estimation
    Explicitly stated as the motivation for the work in the abstract.
invented entities (3)
  • Camera Pose Injector no independent evidence
    purpose: Inject camera pose information to enable geometry awareness
    New component introduced to address the identified bottlenecks
  • Multi-view Epipolar Attention module no independent evidence
    purpose: Improve cross-view correspondence using epipolar geometry
    New attention module proposed in the method
  • 3D Geometric Distillation objective no independent evidence
    purpose: Transfer geometric knowledge into the MLLM during training
    New training objective introduced to distill 3D information

pith-pipeline@v0.9.1-grok · 5737 in / 1478 out tokens · 34085 ms · 2026-06-26T10:56:16.062333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 22 linked inside Pith

  1. [1]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Large language model-assisted autonomous vehicle recovery from immobi- lization.arXiv preprint arXiv:2510.26023, 2025

    Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobi- lization.arXiv preprint arXiv:2510.26023, 2025

  4. [4]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

  5. [5]

    Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

  6. [6]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

  7. [7]

    Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

    Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

  8. [8]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

  9. [9]

    Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

    Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

  10. [10]

    End-to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End-to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  11. [11]

    Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving.arXiv preprint arXiv:2510.15446, 2025

    Ziang Guo and Zufeng Zhang. Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving.arXiv preprint arXiv:2510.15446, 2025

  12. [12]

    Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models

    Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, and Fatma Guney. Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  13. [13]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

  14. [14]

    Lora: Low-rank adaptation of large language models.ICLR, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022

  15. [15]

    St-p3: End-to- end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to- end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, 2022

  16. [16]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  17. [17]

    3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. 3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

  18. [18]

    Robotron-drive: All-in-one large multimodal model for autonomous driving

    Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025. 10

  19. [19]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [20]

    Emma: End-to-end multimodal model for autonomous driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

  21. [21]

    Large scale multi-view stereopsis evaluation

    Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

  22. [22]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  23. [23]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  24. [24]

    Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024

  25. [25]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2025

  26. [26]

    Senna: Bridging large vision-language models and end-to-end autonomous driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

  27. [27]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  28. [28]

    Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving

    Fanjie Kong, Yitong Li, Weihuang Chen, Chen Min, Yizhe Li, Zhiqiang Gao, Haoyang Li, Zhongyu Guo, and Hongbin Sun. Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  29. [29]

    Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  30. [30]

    Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

    Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

  31. [31]

    End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025

  32. [32]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  33. [33]

    Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

  34. [34]

    Vila: On pre- training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre- training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

  35. [35]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  36. [36]

    X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

    Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025. 11

  37. [37]

    Real-ad: Towards human-like reasoning in end-to-end autonomous driving

    Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  38. [38]

    Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

    Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

  39. [39]

    One million scenes for autonomous driving: Once dataset

    Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. arXiv preprint arXiv:2106.11037, 2021

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, 2021

  41. [41]

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021

  42. [42]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  43. [43]

    A multi-view stereo benchmark with high-resolution images and multi- camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

  44. [44]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European conference on computer vision, pages 256–274. Springer, 2024

  45. [45]

    Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

  46. [46]

    Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  47. [47]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  48. [48]

    Sparsedrive: End- to-end autonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End- to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

  49. [49]

    Hip-ad: Hierarchical and multi- granularity planning with deformable attention for autonomous driving in a single decoder.arXiv preprint arXiv:2503.08612, 2025

    Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi- granularity planning with deformable attention for autonomous driving in a single decoder.arXiv preprint arXiv:2503.08612, 2025

  50. [50]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  51. [51]

    Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  52. [52]

    Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  53. [53]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InConference on Robot Learning, 2025. 12

  54. [54]

    Directed-tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

    Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, and Khoa Luu. Directed-tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

  55. [55]

    Semlt3d: Semantic-guided expert distillation for camera-only long-tailed 3d object detection

    Hao V o, Khoa V o, Thinh Phan, Ngo Xuan Cuong, Gianfranco Doretto, Hien Nguyen, Anh Nguyen, and Ngan Le. Semlt3d: Semantic-guided expert distillation for camera-only long-tailed 3d object detection. arXiv preprint arXiv:2604.18476, 2026

  56. [56]

    Henasy: Learning to assemble scene- entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

    Khoa V o, Thinh Phan, Kashu Yamazaki, Minh Tran, and Ngan Le. Henasy: Learning to assemble scene- entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

  57. [57]

    Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025

    Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, and Yanjun Huang. Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025

  58. [58]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  59. [59]

    Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

    Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

  60. [60]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  61. [61]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

  62. [62]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  63. [63]

    Internvl3

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  64. [64]

    Para-drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

  65. [65]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

  66. [66]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  67. [67]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

  68. [68]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

  69. [69]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

    Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

  70. [70]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024. 13

  71. [71]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  72. [72]

    Cambrian-s: Towards spatial supersensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025

  73. [73]

    Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. arXiv preprint arXiv:2505.16278, 2025

  74. [74]

    Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  75. [75]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

  76. [76]

    Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

  77. [77]

    Theory of space: Can foundation models construct spatial beliefs through active exploration?arXiv preprint arXiv:2602.07055, 2026

    Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, et al. Theory of space: Can foundation models construct spatial beliefs through active exploration?arXiv preprint arXiv:2602.07055, 2026

  78. [78]

    Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking

    Wei Zhang, Pengfei Li, Junli Wang, Bingchuan Sun, Qihao Jin, Guangjun Bao, Shibo Rui, Yang Yu, Wenchao Ding, Peng Li, et al. Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

  79. [79]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

  80. [80]

    Genad: Generative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024

Showing first 80 references.