VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving
Pith reviewed 2026-06-27 09:38 UTC · model grok-4.3
The pith
VLGA adds geometry as a fourth modality to vision-language-action models, supervised by per-pixel pointmap regression against LiDAR to ground driving actions in dense 3D space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLGA is the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. Geometry enters as a fourth modality via a dedicated expert trained with a per-pixel pointmap regression loss against LiDAR. Extensive open-loop and closed-loop experiments on nuScenes and Bench2Drive show this yields state-of-the-art results among VLA methods: 0.50 m average L2 error and 0.18 percent 3-second collision rate on nuScenes, plus a 79.08 driving score on Bench2Drive.
What carries the argument
The geometry expert, a module that outputs dense 3D pointmaps and receives direct per-pixel regression supervision from LiDAR to supply spatial signal to the action policy.
If this is right
- VLGA achieves the lowest L2 trajectory error and collision rate among VLA methods without ego status on nuScenes.
- The same model reaches a new high driving score of 79.08 on closed-loop Bench2Drive evaluation at comparable efficiency.
- Dense geometry supervision overcomes the limitations of frozen 3D foundation models or sparse box/map losses used in earlier approaches.
- The four-modality architecture (vision, language, geometry, action) maintains performance parity in comfort and efficiency metrics.
Where Pith is reading between the lines
- An ablation removing only the pointmap loss would directly test whether the geometry signal is load-bearing for the reported gains.
- The dense supervision approach could be tested on other sensor inputs such as radar or camera-only depth to check broader applicability.
- Future work might examine whether the geometry expert transfers to dynamic scene elements beyond static pointmap reconstruction.
Load-bearing premise
The per-pixel pointmap regression loss will force the policy network to incorporate and use the dense 3D geometry signal for action prediction rather than learning to ignore or bypass the geometry expert.
What would settle it
An ablation that removes the geometry expert or its pointmap regression loss and measures whether driving metrics drop back to the levels of prior VLA models that lack dense 3D supervision.
read the original abstract
Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VLGA, a vision-language-geometry-action model for autonomous driving that introduces geometry as a fourth modality via a dedicated expert trained with per-pixel pointmap regression loss against LiDAR. It claims this is the first VLA model supervised to reconstruct the dense 3D world, achieving new SOTA results among VLA methods: 0.50 m average L2 error and 0.18% 3-second collision rate on nuScenes (open-loop, without ego status), and 79.08 driving score on Bench2Drive (closed-loop).
Significance. If the geometry supervision is shown to causally improve policy actions rather than being bypassed, the approach could strengthen grounding of VLA models in dense 3D for driving. The use of both open-loop nuScenes and closed-loop Bench2Drive evaluations, plus efficiency/comfort metrics, provides a reasonable testbed; explicit credit is due for attempting closed-loop validation.
major comments (3)
- [Abstract, §3] Abstract and §3 (architecture): the claim that VLGA is 'supervised to reconstruct the dense 3D world it drives through' and that this yields the reported action improvements rests on the assumption that the policy network actually incorporates the geometry expert's output. The per-pixel pointmap loss supervises only the geometry branch; no equation, loss term, or training detail is given showing that the policy is penalized for ignoring geometry features (e.g., via an auxiliary action-prediction loss conditioned on geometry or an explicit fusion objective). This is load-bearing for the 'first model supervised to reconstruct dense 3D' framing and the causal attribution of the 0.50 m L2 / 79.08 score gains.
- [§4] §4 (experiments): the superiority claims over prior VLA methods are presented without ablations that isolate the geometry expert's contribution (e.g., VLGA minus geometry expert, or frozen vs. jointly trained geometry). Without these, it is impossible to determine whether the reported metrics arise from the dense 3D signal or from other unstated differences in vision-language-action pathways, training data, or hyperparameters.
- [Table 1, §4.2] Table 1 / nuScenes results: the 0.50 m average L2 and 0.18% collision figures are stated as SOTA among VLA methods without ego status, but the manuscript provides no error bars, multiple seeds, or statistical test against the strongest baseline; a 0.71 driving-score gain on Bench2Drive is similarly reported without quantifying variance or sensitivity to the geometry loss weight.
minor comments (2)
- [§3.2] Notation for the pointmap regression loss is introduced without an explicit equation number or definition of the target LiDAR projection; readers must infer the exact supervision signal.
- [Abstract] The abstract states 'at comparable efficiency and comfort' on Bench2Drive but does not define the comfort metric or report the numerical values alongside the driving score.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (architecture): the claim that VLGA is 'supervised to reconstruct the dense 3D world it drives through' and that this yields the reported action improvements rests on the assumption that the policy network actually incorporates the geometry expert's output. The per-pixel pointmap loss supervises only the geometry branch; no equation, loss term, or training detail is given showing that the policy is penalized for ignoring geometry features (e.g., via an auxiliary action-prediction loss conditioned on geometry or an explicit fusion objective). This is load-bearing for the 'first model supervised to reconstruct dense 3D' framing and the causal attribution of the 0.50 m L2 / 79.08 score gains.
Authors: We appreciate the referee's careful reading. The manuscript describes in §3 that the geometry expert's features are integrated into the shared backbone via a fusion module before being passed to the action prediction head. The entire model is trained end-to-end with the combined loss, allowing gradients from the action loss to influence the geometry features. However, we acknowledge that an explicit term penalizing the policy for ignoring geometry is not included. In the revised version, we will add a detailed equation for the fusion objective and an ablation study to demonstrate the contribution of the geometry modality to the policy. revision: partial
-
Referee: [§4] §4 (experiments): the superiority claims over prior VLA methods are presented without ablations that isolate the geometry expert's contribution (e.g., VLGA minus geometry expert, or frozen vs. jointly trained geometry). Without these, it is impossible to determine whether the reported metrics arise from the dense 3D signal or from other unstated differences in vision-language-action pathways, training data, or hyperparameters.
Authors: We agree that isolating the geometry expert's contribution is important for validating our claims. We will include additional ablation experiments in the revised manuscript, specifically comparing VLGA with and without the geometry expert, as well as with the geometry branch frozen during training. revision: yes
-
Referee: [Table 1, §4.2] Table 1 / nuScenes results: the 0.50 m average L2 and 0.18% collision figures are stated as SOTA among VLA methods without ego status, but the manuscript provides no error bars, multiple seeds, or statistical test against the strongest baseline; a 0.71 driving-score gain on Bench2Drive is similarly reported without quantifying variance or sensitivity to the geometry loss weight.
Authors: The reported metrics are based on our primary experimental runs. Due to the high computational cost of training these large models, we did not perform multiple random seeds. We will add a discussion of this limitation in the revised paper and note that the gains are consistent across the open-loop and closed-loop benchmarks. If space permits, we can include sensitivity analysis to the geometry loss weight. revision: partial
Circularity Check
No circularity: geometry supervision from external LiDAR; metrics are empirical
full rationale
The paper's claimed advance is empirical: a geometry expert is added and trained with per-pixel pointmap regression loss against external LiDAR data, then the full VLGA model is evaluated on nuScenes (open-loop L2/collision) and Bench2Drive (closed-loop driving score). No equations, definitions, or self-citations in the provided text reduce the reported metrics to fitted parameters, self-referential quantities, or prior author results by construction. The supervision signal and benchmarks are independent of the model's internal outputs, satisfying the condition for a self-contained derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
Pith/arXiv arXiv 2025
-
[2]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020
2020
-
[3]
Impromptu vla: Open weights and open data for driving vision-language-action models
Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models. Advances in Neural Information Processing Systems, 38, 2025
2025
-
[4]
Carla: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017
2017
-
[5]
Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025
Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexandre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025
arXiv 2025
-
[6]
Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025
2025
-
[7]
St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022
2022
-
[8]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023
2023
-
[9]
Making large language models better planners with reasoning-decision alignment
Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024
2024
-
[10]
Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
Pith/arXiv arXiv 2024
-
[11]
Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving
Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7953–7963, 2023
2023
-
[12]
Think twice before driving: Towards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21983–21994, 2023
2023
-
[13]
Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
2024
-
[14]
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656, 2025
arXiv 2025
-
[15]
Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025
arXiv 2025
-
[16]
Vad: Vectorized scene representation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023
2023
-
[17]
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024. 9
Pith/arXiv arXiv 2024
-
[18]
Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025
Pith/arXiv arXiv 2025
-
[19]
What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017
2017
-
[20]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024
2024
-
[21]
Drivevla-w0: World models amplify data scaling law in autonomous driving
Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796, 2025
Pith/arXiv arXiv 2025
-
[22]
Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
Pith/arXiv arXiv 2025
-
[23]
Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving.arXiv preprint arXiv:2604.02190, 2026
arXiv 2026
-
[24]
Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024
2024
-
[25]
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024
Pith/arXiv arXiv 2024
-
[26]
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022
arXiv 2022
-
[27]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[28]
Sparsebev: High-performance sparse 3d object detection from multi-camera videos
Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF international conference on computer vision, pages 18580–18590, 2023
2023
-
[29]
Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4): 2597–2614, 2025
Haochen Liu, Zhiyu Huang, Wenhui Huang, Haohan Yang, Xiaoyu Mo, and Chen Lv. Hybrid-prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4): 2597–2614, 2025
2025
-
[30]
Petr: Position embedding transformation for multi-view 3d object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision, pages 531–548. Springer, 2022
2022
-
[31]
Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[32]
Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025
arXiv 2025
-
[33]
Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving.arXiv preprint arXiv:2406.10165, 2024
arXiv 2024
-
[34]
Simlingo: Vision-only closed-loop autonomous driving with language-action alignment
Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025
2025
-
[35]
Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2025
Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and ZHAO-XIANG ZHANG. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.Advances in Neural Information Processing Systems, 38: 81565–81585, 2025
2025
-
[36]
Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22432–22441, 2025. 10
2025
-
[37]
Sparsedrive: End-to-end autonomous driving via sparse scene representation
Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025
2025
-
[38]
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024
Pith/arXiv arXiv 2024
-
[39]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[40]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
2025
-
[41]
Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026
arXiv 2026
-
[42]
Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025
2025
-
[43]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024
2024
-
[44]
Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025
Pith/arXiv arXiv 2025
-
[45]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025
2025
-
[46]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, pages 180–191. PMLR, 2022
2022
-
[47]
Openemma: Open-source multimodal model for end-to-end autonomous driving
Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025
2025
-
[48]
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025
Pith/arXiv arXiv 2025
-
[49]
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025
Pith/arXiv arXiv 2025
-
[50]
Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023
Pith/arXiv arXiv 2023
-
[51]
Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.Advances in Neural Information Processing Systems, 38:10204–10229, 2025
Bozhou Zhang, Nan Song, Xiatian Zhu, Jiankang Deng, Li Zhang, et al. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.Advances in Neural Information Processing Systems, 38:10204–10229, 2025
2025
-
[52]
Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowledge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024
arXiv 2024
-
[53]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 11
2024
-
[54]
Genad: Generative end-to-end autonomous driving
Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024
2024
-
[55]
Opendrivevla: Towards end-to-end autonomous driving with large vision language action model
Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Volker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026
2026
-
[56]
Embodied understanding of driving scenarios
Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li. Embodied understanding of driving scenarios. InEuropean Conference on Computer Vision, pages 129–148. Springer, 2024
2024
-
[57]
Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025
Pith/arXiv arXiv 2025
-
[58]
Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.IEEE Transactions on Artificial Intelligence, 2025
Runwen Zhu, Jianbo Zhao, Diankun Zhang, Guoan Wang, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving.IEEE Transactions on Artificial Intelligence, 2025
2025
-
[59]
Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, and Jiwen Lu. Dvgt-2: Vision-geometry-action model for autonomous driving at scale.arXiv preprint arXiv:2604.00813, 2026. 12 A Architectural Details VLGA-Base and VLGA-Large use the public Qwen3-VL-2B and Qwen3-VL-8B vision-language backbones with hidden dime...
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.