pith. machine review for the scientific record. sign in

arxiv: 2604.18484 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.MM· cs.RO

Recognition: unknown

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:43 UTC · model grok-4.3

classification 💻 cs.CV cs.MMcs.RO
keywords embodied AIvision-language models3D adaptersspatial reasoningphysical cuesVLA modelsfoundation modelsout-of-distribution generalization
0
0 comments X

The pith

XEmbodied equips vision-language models with 3D geometric awareness and physical cues via adapters for better embodied performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XEmbodied to fix the gap between generic VLMs trained on 2D images and the needs of large-scale embodied environments like autonomous driving. It adds a structured 3D Adapter to bring in geometric representations such as 3D boxes and an Efficient Image-Embodied Adapter to distill physical signals like occupancy grids into the model's context tokens. Progressive domain curriculum training followed by reinforcement learning post-training is used to keep the original VLM's broad capabilities intact. The result is claimed gains in spatial reasoning, traffic semantics understanding, embodied affordance, and out-of-distribution generalization, tested across 18 public benchmarks for scenario mining and embodied VQA. If correct, this would let cloud pipelines produce higher-quality annotations for training next-generation Vision-Language-Action models without separate geometry modules.

Core claim

XEmbodied is a cloud-side foundation model that integrates geometric representations through a structured 3D Adapter and distills physical signals using an Efficient Image-Embodied Adapter, combined with progressive domain curriculum and reinforcement learning post-training, to endow VLMs with intrinsic 3D geometric awareness and physical cue interaction while preserving general capabilities and achieving robust results on 18 benchmarks for spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization in large-scale embodied tasks.

What carries the argument

The structured 3D Adapter that integrates geometric representations such as occupancy grids and 3D boxes, together with the Efficient Image-Embodied Adapter that distills physical signals into context tokens.

If this is right

  • Large-scale scenario mining pipelines can generate higher-quality embodied VQA annotations directly from complex 3D environments.
  • Vision-Language-Action models trained with XEmbodied should generalize better to out-of-distribution traffic and interaction situations.
  • Spatial reasoning and embodied affordance tasks gain accuracy without requiring separate geometry-processing modules at inference time.
  • The same adapter-based approach can be applied to other VLM backbones while retaining their original language and vision skills.
  • Reinforcement learning post-training on top of curriculum learning stabilizes the transfer of physical cues into the model's token space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The adapter design might reduce reliance on massive labeled 3D datasets by distilling signals from existing 2D images during training.
  • Similar geometric and physical adapters could be tested on non-driving embodied domains such as indoor robotics or manipulation tasks.
  • If the adapters prove modular, they could be inserted into existing open-source VLMs with minimal retraining cost.
  • The progressive curriculum might offer a general recipe for adapting 2D foundation models to any 3D-rich domain without full retraining.

Load-bearing premise

Adding the 3D Adapter, Efficient Image-Embodied Adapter, progressive domain curriculum, and reinforcement learning post-training improves embodied performance without degrading the base VLM's general capabilities.

What would settle it

A controlled test on one of the 18 benchmarks that shows no gain in spatial reasoning accuracy or a measurable drop in performance on standard general VLM tasks such as image captioning or visual question answering would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.18484 by ChuChu Xie, Diange Yang, Guang Chen, Guanghao Zhang, Hangjun Ye, Hao Ye, Jingrui Pang, Kangan Qian, Kun Jiang, Long Chen, Mengmeng Yang, Sicong Jiang, Siwen Jiao, Yang Zhong, Yunlong Wang, Zilin Huang.

Figure 1
Figure 1. Figure 1: Comparison of three embodied scene understanding paradigms. (a) Traditional rule-based pipelines offer high transparency but poor scalability. (b) Vanilla MLLMs have strong general reasoning but suffer from hallucinations and weak geometric aware￾ness. (c) XEmbodied (ours) integrates geometric priors and embodied evidence using tools like detection, occupancy and map segmentation with our 3DA and EIEA. Spa… view at source ↗
Figure 2
Figure 2. Figure 2: Quantitative comparison across three core dimensions of embodied understand￾ing. Our XEmbodied (SFT) model consistently outperforms state-of-the-art baselines across 18 public benchmarks. hinges on large-scale, high-quality annotations, which are typically generated through a data closed-loop pipeline: on-vehicle models continuously collect real￾world driving logs, upload them to the cloud for high-value s… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of XEmbodied. Given an image/video and an embodied instruction, XEmbodied first extracts 2D semantic tokens and 3D geometric tokens via dedicated encoders. Our 3DA fuses these tokens to form latent 3D thinking. The EIEA then distills physical cues from embodied tool outputs into compact tokens, which are seamlessly injected into the MLLM context for robust physical-augmented reasoning.… view at source ↗
Figure 4
Figure 4. Figure 4: Our four-stage data curation pipeline for embodied closed-loop learning. The pipeline comprises four core modules ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two-stage training pipeline for our Efficient Image-Embodied Adapter (EIEA). (a) Alignment stage: We align latent physical cue tokens with LLM response embed￾dings to construct a unified semantic space for physical-augmented reasoning. (b) RL fine-tuning stage: We fine-tune XEmbodied via LoRA, using final answer rewards to optimize the model’s reasoning with physical cues. inject geometry into the fixed la… view at source ↗
Figure 6
Figure 6. Figure 6: Overall model architecture of the proposed 3D Adapter (3DA) and Efficient Image-Embodied Adapter (EIEA). The architecture integrates 3D geometric priors and heterogeneous embodied physical information into the VLM backbone for enhanced spatial and embodied reasoning. Semantic Stream (Qwen3-VL Backbone) We retain the original vision pipeline of Qwen3-VL without modification: raw pixel values (pixel_values ∈… view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of refined datasets across three primary domains. The pie chart illustrates the proportional allocation of refined data, with the AutoDrive domain con￾stituting the largest share at 41.9%, followed by Common data at 38.3%, and the Robotic domain comprising 19.8%. Optimization of data distribution This optimization in distribution is en￾tirely algorithm-driven, ensuring strict alignment between… view at source ↗
Figure 8
Figure 8. Figure 8: T1: Tiered data examples from Bdd [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: T1: Tiered data examples from VQAv2 and GQA [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: T2: Tiered data example from Roborefit [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: T2: Tiered data example from DrivingVQA [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: T2: Tiered data example from SURDS [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: T2: Tiered data example from RefCOCO [PITH_FULL_IMAGE:figures/full_fig_p045_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: T3: Tiered data example from DriveLMM [PITH_FULL_IMAGE:figures/full_fig_p046_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: T3: Tiered data example from MapLMv2 [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: T4: Tiered data example from LingoQA [PITH_FULL_IMAGE:figures/full_fig_p048_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: T4: Tiered data example from RoboVQA [PITH_FULL_IMAGE:figures/full_fig_p049_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Data quality assessment example from Ommidrive [PITH_FULL_IMAGE:figures/full_fig_p050_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Data quality assessment example from DriveLMM-o1 [PITH_FULL_IMAGE:figures/full_fig_p051_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Data quality assessment example from GQA [PITH_FULL_IMAGE:figures/full_fig_p052_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Data quality assessment example from Roborefit [PITH_FULL_IMAGE:figures/full_fig_p053_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: SURDS-Back: Spatial positioning judgment and 3D coordinate estimation [PITH_FULL_IMAGE:figures/full_fig_p066_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: DrivingVQA-Front: French traffic rule reasoning for parking and overtaking [PITH_FULL_IMAGE:figures/full_fig_p067_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: VLADBench-Front: Driving efficiency maintenance reason analysis [PITH_FULL_IMAGE:figures/full_fig_p068_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: MapLMv2: Lane attribute description with multi-view images [PITH_FULL_IMAGE:figures/full_fig_p069_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Ego3D-Bench: Spatial proximity judgment and distance estimation [PITH_FULL_IMAGE:figures/full_fig_p070_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Where2place & VABench-Point-Box: Robotic manipulation bounding box pre￾diction [PITH_FULL_IMAGE:figures/full_fig_p071_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Part-Affordance & VABench-Point-Box: Robotic grasping region detection [PITH_FULL_IMAGE:figures/full_fig_p072_28.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes XEmbodied, a cloud-side foundation model that augments vision-language models with intrinsic 3D geometric awareness and physical cues (occupancy grids, 3D boxes) via a structured 3D Adapter and an Efficient Image-Embodied Adapter. Progressive domain curriculum and reinforcement learning post-training are used to improve spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization on 18 public benchmarks for large-scale scenario mining and embodied VQA, while claiming to preserve the base VLM's general capabilities.

Significance. If the quantitative results, ablations, and controls confirm the claims without degradation on general VLM tasks, the work would offer a practical route to inject geometric and physical reasoning into existing VLMs, addressing a recognized limitation in current VLA training pipelines for autonomous systems.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'significant improvements' and 'robust performance across 18 benchmarks' is unsupported by any reported metrics, baselines, error bars, or ablation tables, preventing verification of the central empirical claim.
  2. [Abstract] Abstract: the claim that general capabilities are preserved after the 3D Adapter, Efficient Image-Embodied Adapter, domain curriculum, and RL post-training lacks any side-by-side evaluation on standard non-embodied benchmarks (e.g., VQAv2, GQA, or captioning tasks). This no-trade-off condition is load-bearing for the contribution yet remains untested.
minor comments (1)
  1. [Abstract] Abstract: the high-level description of the adapters and training stages would benefit from explicit architectural diagrams or pseudocode to clarify how geometric tokens are integrated without altering the base VLM forward pass.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the abstract. We address each point below and confirm that revisions will be made to better align the abstract claims with the quantitative evidence in the manuscript body.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'significant improvements' and 'robust performance across 18 benchmarks' is unsupported by any reported metrics, baselines, error bars, or ablation tables, preventing verification of the central empirical claim.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. The full manuscript reports detailed metrics, baseline comparisons, ablations, and error bars for all 18 benchmarks in Sections 4 and 5 (Tables 2-5, Figures 3-7). We will revise the abstract to reference these specific results and highlight key improvements (e.g., gains on spatial reasoning and embodied VQA tasks) while preserving brevity. revision: yes

  2. Referee: [Abstract] Abstract: the claim that general capabilities are preserved after the 3D Adapter, Efficient Image-Embodied Adapter, domain curriculum, and RL post-training lacks any side-by-side evaluation on standard non-embodied benchmarks (e.g., VQAv2, GQA, or captioning tasks). This no-trade-off condition is load-bearing for the contribution yet remains untested.

    Authors: The manuscript supports preservation of general capabilities through the lightweight, modular design of the adapters and curriculum (which avoid overwriting base VLM weights), along with internal consistency checks. However, we acknowledge that explicit side-by-side evaluations on VQAv2, GQA, and captioning tasks would provide stronger verification of the no-trade-off claim. We will add these controlled comparisons in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposals and empirical claims are independent of inputs

full rationale

The provided abstract and description outline a standard VLM adaptation pipeline: adding a structured 3D Adapter for geometric representations, an Efficient Image-Embodied Adapter for physical cue distillation, progressive domain curriculum, and RL post-training. These are presented as design choices whose effects are measured on 18 external benchmarks. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear. The preservation of general VLM capabilities is asserted but treated as an empirical outcome rather than a definitional tautology. The derivation chain therefore consists of independent engineering steps whose validity rests on reported benchmark deltas, not on reduction to the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, mathematical axioms, or invented physical entities are described; the adapters and training procedures are presented as engineering choices whose internal details and any fitted values remain unknown.

pith-pipeline@v0.9.0 · 5532 in / 1394 out tokens · 39215 ms · 2026-05-10T05:43:41.341404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

133 extracted references · 68 canonical work pages · 18 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4, 26

  2. [2]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    Azzolini, A., Bai, J., Brandon, H., Cao, J., Chattopadhyay, P., Chen, H., Chu, J., Cui, Y., Diamond, J., Ding, Y., et al.: Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 (2025) 3, 5, 11, 12, 13, 26

  3. [3]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 3

  4. [4]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao,C.,Ge,C.,etal.:Qwen3-vltechnicalreport.arXivpreprintarXiv:2511.21631 (2025) 6, 8, 10, 11, 12, 13, 27, 61

  5. [5]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (Feb 2025), https://arxiv.org/abs/2502.1...

  6. [6]

    Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

    Bai, S., Li, M., Liu, Y., Tang, J., Zhang, H., Sun, L., Chu, X., Tang, Y.: Univg-r1: Reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231 (2025) 11, 12

  7. [7]

    Frontiers in Computational Neuroscience14, 63 (2020) 3

    Bermudez-Contreras, E., Clark, B.J., Wilber, A.: The neuroscience of spatial nav- igation and the relationship to artificial intelligence. Frontiers in Computational Neuroscience14, 63 (2020) 3

  8. [8]

    In: Proceedings of the 23rd ACM symposium on virtual reality software and technology

    Bhandari, J., Tregillus, S., Folmer, E.: Legomotion: Scalable walking-based vir- tual locomotion. In: Proceedings of the 23rd ACM symposium on virtual reality software and technology. pp. 1–8 (2017) 7

  9. [9]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan,Y.,Baldan,G.,Beijbom,O.:nuscenes:Amultimodaldatasetforautonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020) 3

  10. [10]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025) 5, 27 16 K.A. Qian and C.C. Xie et al

  11. [11]

    In: CVPR

    Cao, X., Zhou, T., Ma, Y., Ye, W., Cui, C., Tang, K., Cao, Z., Liang, K., Wang, Z., Rehg, J.M., Zheng, C.: Maplm: A real-world large-scale vision-language bench- mark for map and traffic scene understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21819–21830 (2024). https://doi.org/10.1109/CVPR52733.2024.0206110

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14455–14465 (2024) 5, 27

  13. [13]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 14093–14100. IEEE (2024) 4, 26

  14. [14]

    What’s in the image? a deep-dive into the vision of vision language models

    Chen, X., Huang, L., Ma, T., Fang, R., Shi, S., Li, H.: Solve: Synergy of language- vision and end-to-end networks for autonomous driving. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12068– 12077 (2025).https://doi.org/10.1109/CVPR52734.2025.0112759, 60, 61

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 5, 27

  16. [16]

    Advances in Neural Information Processing Systems37, 135062–135093 (2024) 5, 27

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024) 5, 27

  17. [17]

    Chi, H., ang Gao, H., Liu, Z., Liu, J., Liu, C., Li, J., Yang, K., Yu, Y., Wang, Z., Li, W., Wang, L., Hu, X., Sun, H., Zhao, H., Zhao, H.: Impromptu vla: Open weights and open data for driving vision-language-action models (2025),https: //arxiv.org/abs/2505.2375760, 61

  18. [18]

    Corbière, C., Roburin, S., Montariol, S., Bosselut, A., Alahi, A.: Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios (2025) 10

  19. [19]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.D., et al.: A survey on multimodal large language models for autonomous driving. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 958–979 (2024) 4, 26

  20. [20]

    Driving with A Thousand Faces: A Benchmark for Closed-Loop Personalized End-to-End Autonomous Driving

    Dong, X., Li, R., Han, X., Wu, Z., Wang, J., Chen, J., Jiang, Q., Yiu, S., Zhu, X., Ma, Y.: Driving with a thousand faces: A benchmark for closed-loop personalized end-to-end autonomous driving. arXiv preprint arXiv:2602.18757 (2026) 3

  21. [21]

    Nature Reviews Neuroscience24(2), 63–79 (2023) 3

    Farzanfar, D., Spiers, H.J., Moscovitch, M., Rosenbaum, R.S.: From cognitive maps to spatial schemas. Nature Reviews Neuroscience24(2), 63–79 (2023) 3

  22. [22]

    In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

    Fu, D., Li, X., Wen, L., Dou, M., Cai, P., Shi, B., Qiao, Y.: Drive like a human: Rethinking autonomous driving with large language models. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW). pp. 910–919. IEEE (2024) 4, 5, 26

  23. [23]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3354–3361. IEEE (2012) 3

  24. [24]

    In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 10 XEmbodied 17

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 10 XEmbodied 17

  25. [25]

    In: NeurIPS (2025) 10

    Guo, X., Zhang, R., Duan, Y., He, Y., Nie, D., Huang, W., Zhang, C., Liu, S., Zhao, H., Chen, L.: Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. In: NeurIPS (2025) 10

  26. [26]

    Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,

    Guo, X., Zhang, R., Duan, Y., He, Y., Zhang, C., Liu, S., Chen, L.: Drivemllm: A benchmark for spatial understanding with multimodal large language models in autonomous driving. arXiv preprint arXiv:2411.13112 (2024) 25

  27. [27]

    arXiv preprint arXiv:2511.11239 , year=

    Guo, Z., Liu, J., Li, Y., Gao, W., Yang, Z., Li, C., Zhang, X., Jian, P.: Beyond flat- lands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression. arXiv preprint arXiv:2511.11239 (2025) 3

  28. [28]

    arXiv preprint arXiv:2207.11514 (2022) 5, 27

    Ha, H., Song, S.: Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. arXiv preprint arXiv:2207.11514 (2022) 5, 27

  29. [29]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024) 3

  30. [31]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Hao, X., Zhou, L., Huang, Z., Hou, Z., Tang, Y., Zhang, L., Li, G., Lu, Z., Ren, S., Meng, X., et al.: Mimo-embodied: X-embodied foundation model. arXiv preprint arXiv:2511.16518 (2026) 5, 11, 12, 13, 26

  31. [32]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 7

  32. [33]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016) 7

  33. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853– 17862 (2023) 61

  34. [35]

    Journal of Intelligent and Connected Vehicles8(2), 9210059–1 (2025) 3

    Hu, Z., Xu, M., Cheng, Q.: Multimodal large-language model empowering next- generation autonomous driving systems. Journal of Intelligent and Connected Vehicles8(2), 9210059–1 (2025) 3

  35. [36]

    3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

    Huang, T., Zhang, Z., Tang, H.: 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding. arXiv preprint arXiv:2507.23478 (2025) 5, 27

  36. [37]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Compos- able 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023) 4, 26

  37. [38]

    arXiv preprint arXiv:2412.07689 (2024)

    Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Drivemm: All-in-one large multimodal model for autonomous driving. arXiv preprint arXiv:2412.076892(3), 8 (2024) 3

  38. [39]

    Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Robotron-drive:All-in-onelargemultimodalmodelforautonomousdriving.ICCV (2024) 4, 11, 12, 26, 59, 60, 61

  39. [40]

    Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driv- ing.arXiv preprint arXiv:2412.15544, 2024

    Huang, Z., Sheng, Z., Qu, Y., You, J., Chen, S.: Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving. arXiv preprint arXiv:2412.15544 (2024) 4, 26

  40. [41]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 10 18 K.A. Qian and C.C. Xie et al

  41. [42]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024) 4, 26, 60, 61

  42. [44]

    DriveLMM-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding

    Ishaq, A., Lahoud, J., More, K., Thawakar, O., Thawkar, R., Dissanayake, D., Ahsan, N., Li, Y., Khan, F.S., Cholakkal, H., et al.: Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. arXiv preprint arXiv:2503.10621 (2025) 4, 25, 26

  43. [45]

    IEEE Robotics and Automation Letters9(11), 9836–9843 (2024) 3, 7, 30

    Jia, P., Wen, T., Luo, Z., Yang, M., Jiang, K., Liu, Z., Tang, X., Lei, Z., Cui, L., Zhang, B., et al.: Diffmap: Enhancing map segmentation with map prior using diffusion model. IEEE Robotics and Automation Letters9(11), 9836–9843 (2024) 3, 7, 30

  44. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient au- tonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023) 60, 61

  45. [47]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

    Jiang, B., Chen, S., Zhang, Q., Liu, W., Wang, X.: Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608 (2025) 25

  46. [48]

    arXiv preprint arXiv:2506.24044 (2025) 1, 3

    Jiang, S., Huang, Z., Qian, K., Luo, Z., Zhu, T., Zhong, Y., Tang, Y., Kong, M., Wang, Y., Jiao, S., et al.: A survey on vision-language-action models for autonomous driving. arXiv preprint arXiv:2506.24044 (2025) 1, 3

  47. [49]

    R efer I t G ame: Referring to objects in photographs of natural scenes

    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: Referring to objects in photographs of natural scenes. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP). pp. 787–798. Association for Computational Linguistics, Doha, Qatar (Oct 2014).https://doi...

  48. [50]

    Proceedings of the European Conference on Computer Vision (ECCV) (2018) 25

    Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. Proceedings of the European Conference on Computer Vision (ECCV) (2018) 25

  49. [51]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 4, 26

  50. [52]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 5, 27

  51. [53]

    Imagine while reasoning in space: Multimodal visualization-of-thought, 2025b.https://arxiv.org/abs/2501.07542

    Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imag- ine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542 (2025) 5, 27

  52. [54]

    Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies.arXiv preprint arXiv:2401.12888, 2024

    Li, L., Shao, W., Dong, W., Tian, Y., Zhang, Q., Yang, K., Zhang, W.: Data- centric evolution in autonomous driving: A comprehensive survey of big data sys- tem, data mining, and closed-loop technologies. arXiv preprint arXiv:2401.12888 (2024) 3

  53. [55]

    SpaceDrive: Infusing spatial awareness into VLM-based autonomous driving.arXiv preprint arXiv:2512.10719, 2025

    Li, P., Zhang, Z., Holtz, D., Yu, H., Yang, Y., Lai, Y., Song, R., Geiger, A., Zell, A.: Spacedrive: Infusing spatial awareness into vlm-based autonomous driving. arXiv preprint arXiv:2512.107192(2025) 4 XEmbodied 19

  54. [56]

    Li, S., Tang, H.: Multimodal alignment and fusion: A survey: S. li, h. tang. Inter- national Journal of Computer Vision134(3), 103 (2026) 3

  55. [57]

    Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

    Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., et al.: Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378 (2023) 4, 26

  56. [58]

    Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alvarez, J.M.: Is ego status all you need for open-loop end-to-end autonomous driving? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14864–14873 (2024) 61

  57. [59]

    Advances in neural information processing systems36, 34892–34916 (2023) 3

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 3

  58. [60]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 5, 27

    Liu, K., Liu, Y.J., Chen, B.: General 3d vision-language model with fast render- ing and pre-training vision-language alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 5, 27

  59. [61]

    Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025

    Liu, Y., Ma, M., Yu, X., Ding, P., Zhao, H., Sun, M., Huang, S., Wang, D.: Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448 (2025) 3, 30, 31

  60. [62]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Liu,Y.,Zhang,B.,Zang,Y.,Cao,Y.,Xing,L.,Dong,X.,Duan,H.,Lin,D.,Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025) 5, 27

  61. [63]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 10

  62. [64]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Lu, Y., Fan, Y., Deng, B., Liu, F., Li, Y., Wang, S.: Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 976–983. IEEE (2023) 10

  63. [65]

    Spatialreasoner: To- wards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

    Ma, W., Chou, Y.C., Liu, Q., Wang, X., de Melo, C., Xie, J., Yuille, A.: Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning. arXiv preprint arXiv:2504.20024 (2025) 5, 27

  64. [66]

    In: European Conference on Computer Vision

    Ma, Y., Cao, Y., Sun, J., Pavone, M., Xiao, C.: Dolphins: Multimodal language model for driving. In: European Conference on Computer Vision. pp. 403–420. Springer (2024) 4, 26

  65. [67]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

    Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023) 4, 26

  66. [68]

    A language agent for au- tonomous driving

    Mao, J., Ye, J., Qian, Y., Pavone, M., Wang, Y.: A language agent for autonomous driving. arXiv preprint arXiv:2311.10813 (2023) 61

  67. [69]

    arXiv preprint arXiv:2312.14115 , year=

    Marcu, A.M., Chen, L., Hünermann, J., Karnsund, A., Hanotte, B., Chi- dananda, P., Nair, S., Badrinarayanan, V., Kendall, A., Shotton, J., Sinavski, O.: Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115 (2023) 10

  68. [70]

    Com- munications biology8(1), 80 (2025) 3

    Melchionna, M., Castiglione, S., Girardi, G., Profico, A., Mondanaro, A., Sansa- lone, G., Chatar, N., Pérez Ramos, A., Fernández-Monescillo, M., Serio, C., et al.: Cortical areas associated to higher cognition drove primate brain evolution. Com- munications biology8(1), 80 (2025) 3

  69. [71]

    In: 2025 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS)

    Miao,J.,Wen,T.,Luo,Z.,Qian,K.,Fu,Z.,Wang,Y.,Jiang,K.,Yang,M.,Huang, J.,Zhong,Z.,etal.:Efficientend-to-endvisuallocalizationforautonomousdriving with decoupled bev neural matching. In: 2025 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS). pp. 10719–10726. IEEE (2025) 3 20 K.A. Qian and C.C. Xie et al

  70. [72]

    In: Eu- ropean Conference on Computer Vision

    Nie, M., Peng, R., Wang, C., Cai, X., Han, J., Xu, H., Zhang, L.: Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In: Eu- ropean Conference on Computer Vision. pp. 292–308. Springer (2024) 25

  71. [73]

    OpenAI: Gpt-4o system card (2024),https://arxiv.org/abs/2410.2127611, 12, 13

  72. [74]

    arXiv preprint arXiv:2504.01805 (2025)

    Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 5, 27, 28

  73. [75]

    Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,

    Pan, Z., Liu, H.: Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse. arXiv preprint arXiv:2503.18470 (2025) 5, 27

  74. [76]

    Peng, Q., Chen, X., Yang, C., Shi, S., Li, H.: Colavla: Leveraging cognitive la- tent reasoning for hierarchical parallel trajectory planning in autonomous driving (2026),https://arxiv.org/abs/2512.2293961

  75. [77]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

    Qi, Z., Zhang, Z., Yu, Y., Wang, J., Zhao, H.: Vln-r1: Vision-language navigation via reinforcement fine-tuning. arXiv preprint arXiv:2506.17221 (2025) 4, 26

  76. [78]

    arXiv preprint arXiv:2505.15298 1,

    Qian, K., Jiang, S., Zhong, Y., Luo, Z., Huang, Z., Zhu, T., Jiang, K., Yang, M., Fu, Z., Miao, J., et al.: Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving. arXiv preprint arXiv:2505.152981(2), 3 (2025) 3, 4, 5, 27, 30

  77. [79]

    arXiv preprint arXiv:2503.08162 (2025) 60, 61

    Qian, K., Luo, Z., Jiang, S., Huang, Z., Miao, J., Ma, Z., Zhu, T., Li, J., He, Y., Fu, Z., et al.: Fasionad++: Integrating high-level instruction and information bottleneck in fat-slow fusion systems for enhanced safety in autonomous driving with adaptive feedback. arXiv preprint arXiv:2503.08162 (2025) 60, 61

  78. [80]

    arXiv preprint arXiv:2411.18013 (2024) 4, 26, 60, 61

    Qian, K., Ma, Z., He, Y., Luo, Z., Shi, T., Zhu, T., Li, J., Wang, J., Chen, Z., He, X., et al.: Fasionad: Fast and slow fusion thinking systems for human- like autonomous driving with adaptive feedback. arXiv preprint arXiv:2411.18013 (2024) 4, 26, 60, 61

  79. [81]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Qian, K., Miao, J., Jiao, X., Luo, Z., Fu, Z., Shi, Y., Wang, Y., Jiang, K., Yang, D.: Priormotion: Generative class-agnostic motion prediction with raster-vector motion field priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27284–27294 (2025) 7

  80. [82]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4542–4550 (2024) 25

Showing first 80 references.