Recognition: unknown
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
Pith reviewed 2026-05-10 05:43 UTC · model grok-4.3
The pith
XEmbodied equips vision-language models with 3D geometric awareness and physical cues via adapters for better embodied performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
XEmbodied is a cloud-side foundation model that integrates geometric representations through a structured 3D Adapter and distills physical signals using an Efficient Image-Embodied Adapter, combined with progressive domain curriculum and reinforcement learning post-training, to endow VLMs with intrinsic 3D geometric awareness and physical cue interaction while preserving general capabilities and achieving robust results on 18 benchmarks for spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization in large-scale embodied tasks.
What carries the argument
The structured 3D Adapter that integrates geometric representations such as occupancy grids and 3D boxes, together with the Efficient Image-Embodied Adapter that distills physical signals into context tokens.
If this is right
- Large-scale scenario mining pipelines can generate higher-quality embodied VQA annotations directly from complex 3D environments.
- Vision-Language-Action models trained with XEmbodied should generalize better to out-of-distribution traffic and interaction situations.
- Spatial reasoning and embodied affordance tasks gain accuracy without requiring separate geometry-processing modules at inference time.
- The same adapter-based approach can be applied to other VLM backbones while retaining their original language and vision skills.
- Reinforcement learning post-training on top of curriculum learning stabilizes the transfer of physical cues into the model's token space.
Where Pith is reading between the lines
- The adapter design might reduce reliance on massive labeled 3D datasets by distilling signals from existing 2D images during training.
- Similar geometric and physical adapters could be tested on non-driving embodied domains such as indoor robotics or manipulation tasks.
- If the adapters prove modular, they could be inserted into existing open-source VLMs with minimal retraining cost.
- The progressive curriculum might offer a general recipe for adapting 2D foundation models to any 3D-rich domain without full retraining.
Load-bearing premise
Adding the 3D Adapter, Efficient Image-Embodied Adapter, progressive domain curriculum, and reinforcement learning post-training improves embodied performance without degrading the base VLM's general capabilities.
What would settle it
A controlled test on one of the 18 benchmarks that shows no gain in spatial reasoning accuracy or a measurable drop in performance on standard general VLM tasks such as image captioning or visual question answering would falsify the central claim.
Figures
read the original abstract
Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes XEmbodied, a cloud-side foundation model that augments vision-language models with intrinsic 3D geometric awareness and physical cues (occupancy grids, 3D boxes) via a structured 3D Adapter and an Efficient Image-Embodied Adapter. Progressive domain curriculum and reinforcement learning post-training are used to improve spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization on 18 public benchmarks for large-scale scenario mining and embodied VQA, while claiming to preserve the base VLM's general capabilities.
Significance. If the quantitative results, ablations, and controls confirm the claims without degradation on general VLM tasks, the work would offer a practical route to inject geometric and physical reasoning into existing VLMs, addressing a recognized limitation in current VLA training pipelines for autonomous systems.
major comments (2)
- [Abstract] Abstract: the assertion of 'significant improvements' and 'robust performance across 18 benchmarks' is unsupported by any reported metrics, baselines, error bars, or ablation tables, preventing verification of the central empirical claim.
- [Abstract] Abstract: the claim that general capabilities are preserved after the 3D Adapter, Efficient Image-Embodied Adapter, domain curriculum, and RL post-training lacks any side-by-side evaluation on standard non-embodied benchmarks (e.g., VQAv2, GQA, or captioning tasks). This no-trade-off condition is load-bearing for the contribution yet remains untested.
minor comments (1)
- [Abstract] Abstract: the high-level description of the adapters and training stages would benefit from explicit architectural diagrams or pseudocode to clarify how geometric tokens are integrated without altering the base VLM forward pass.
Simulated Author's Rebuttal
We thank the referee for the constructive comments regarding the abstract. We address each point below and confirm that revisions will be made to better align the abstract claims with the quantitative evidence in the manuscript body.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'significant improvements' and 'robust performance across 18 benchmarks' is unsupported by any reported metrics, baselines, error bars, or ablation tables, preventing verification of the central empirical claim.
Authors: We agree that the abstract would be strengthened by including concrete quantitative support. The full manuscript reports detailed metrics, baseline comparisons, ablations, and error bars for all 18 benchmarks in Sections 4 and 5 (Tables 2-5, Figures 3-7). We will revise the abstract to reference these specific results and highlight key improvements (e.g., gains on spatial reasoning and embodied VQA tasks) while preserving brevity. revision: yes
-
Referee: [Abstract] Abstract: the claim that general capabilities are preserved after the 3D Adapter, Efficient Image-Embodied Adapter, domain curriculum, and RL post-training lacks any side-by-side evaluation on standard non-embodied benchmarks (e.g., VQAv2, GQA, or captioning tasks). This no-trade-off condition is load-bearing for the contribution yet remains untested.
Authors: The manuscript supports preservation of general capabilities through the lightweight, modular design of the adapters and curriculum (which avoid overwriting base VLM weights), along with internal consistency checks. However, we acknowledge that explicit side-by-side evaluations on VQAv2, GQA, and captioning tasks would provide stronger verification of the no-trade-off claim. We will add these controlled comparisons in the revised manuscript. revision: yes
Circularity Check
No circularity: architectural proposals and empirical claims are independent of inputs
full rationale
The provided abstract and description outline a standard VLM adaptation pipeline: adding a structured 3D Adapter for geometric representations, an Efficient Image-Embodied Adapter for physical cue distillation, progressive domain curriculum, and RL post-training. These are presented as design choices whose effects are measured on 18 external benchmarks. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear. The preservation of general VLM capabilities is asserted but treated as an empirical outcome rather than a definitional tautology. The derivation chain therefore consists of independent engineering steps whose validity rests on reported benchmark deltas, not on reduction to the inputs themselves.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4, 26
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Azzolini, A., Bai, J., Brandon, H., Cao, J., Chattopadhyay, P., Chen, H., Chu, J., Cui, Y., Diamond, J., Ding, Y., et al.: Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 (2025) 3, 5, 11, 12, 13, 26
-
[3]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao,C.,Ge,C.,etal.:Qwen3-vltechnicalreport.arXivpreprintarXiv:2511.21631 (2025) 6, 8, 10, 11, 12, 13, 27, 61
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (Feb 2025), https://arxiv.org/abs/2502.1...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Bai, S., Li, M., Liu, Y., Tang, J., Zhang, H., Sun, L., Chu, X., Tang, Y.: Univg-r1: Reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231 (2025) 11, 12
-
[7]
Frontiers in Computational Neuroscience14, 63 (2020) 3
Bermudez-Contreras, E., Clark, B.J., Wilber, A.: The neuroscience of spatial nav- igation and the relationship to artificial intelligence. Frontiers in Computational Neuroscience14, 63 (2020) 3
2020
-
[8]
In: Proceedings of the 23rd ACM symposium on virtual reality software and technology
Bhandari, J., Tregillus, S., Folmer, E.: Legomotion: Scalable walking-based vir- tual locomotion. In: Proceedings of the 23rd ACM symposium on virtual reality software and technology. pp. 1–8 (2017) 7
2017
-
[9]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan,Y.,Baldan,G.,Beijbom,O.:nuscenes:Amultimodaldatasetforautonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020) 3
2020
-
[10]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025) 5, 27 16 K.A. Qian and C.C. Xie et al
2025
-
[11]
Cao, X., Zhou, T., Ma, Y., Ye, W., Cui, C., Tang, K., Cao, Z., Liang, K., Wang, Z., Rehg, J.M., Zheng, C.: Maplm: A real-world large-scale vision-language bench- mark for map and traffic scene understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21819–21830 (2024). https://doi.org/10.1109/CVPR52733.2024.0206110
-
[12]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14455–14465 (2024) 5, 27
2024
-
[13]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 14093–14100. IEEE (2024) 4, 26
2024
-
[14]
What’s in the image? a deep-dive into the vision of vision language models
Chen, X., Huang, L., Ma, T., Fang, R., Shi, S., Li, H.: Solve: Synergy of language- vision and end-to-end networks for autonomous driving. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12068– 12077 (2025).https://doi.org/10.1109/CVPR52734.2025.0112759, 60, 61
-
[15]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 5, 27
2024
-
[16]
Advances in Neural Information Processing Systems37, 135062–135093 (2024) 5, 27
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024) 5, 27
2024
- [17]
-
[18]
Corbière, C., Roburin, S., Montariol, S., Bosselut, A., Alahi, A.: Retrieval-based interleaved visual chain-of-thought in real-world driving scenarios (2025) 10
2025
-
[19]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.D., et al.: A survey on multimodal large language models for autonomous driving. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 958–979 (2024) 4, 26
2024
-
[20]
Dong, X., Li, R., Han, X., Wu, Z., Wang, J., Chen, J., Jiang, Q., Yiu, S., Zhu, X., Ma, Y.: Driving with a thousand faces: A benchmark for closed-loop personalized end-to-end autonomous driving. arXiv preprint arXiv:2602.18757 (2026) 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Nature Reviews Neuroscience24(2), 63–79 (2023) 3
Farzanfar, D., Spiers, H.J., Moscovitch, M., Rosenbaum, R.S.: From cognitive maps to spatial schemas. Nature Reviews Neuroscience24(2), 63–79 (2023) 3
2023
-
[22]
In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)
Fu, D., Li, X., Wen, L., Dou, M., Cai, P., Shi, B., Qiao, Y.: Drive like a human: Rethinking autonomous driving with large language models. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW). pp. 910–919. IEEE (2024) 4, 5, 26
2024
-
[23]
In: 2012 IEEE conference on computer vision and pattern recognition
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3354–3361. IEEE (2012) 3
2012
-
[24]
In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 10 XEmbodied 17
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 10 XEmbodied 17
2017
-
[25]
In: NeurIPS (2025) 10
Guo, X., Zhang, R., Duan, Y., He, Y., Nie, D., Huang, W., Zhang, C., Liu, S., Zhao, H., Chen, L.: Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. In: NeurIPS (2025) 10
2025
-
[26]
Guo, X., Zhang, R., Duan, Y., He, Y., Zhang, C., Liu, S., Chen, L.: Drivemllm: A benchmark for spatial understanding with multimodal large language models in autonomous driving. arXiv preprint arXiv:2411.13112 (2024) 25
-
[27]
arXiv preprint arXiv:2511.11239 , year=
Guo, Z., Liu, J., Li, Y., Gao, W., Yang, Z., Li, C., Zhang, X., Jian, P.: Beyond flat- lands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression. arXiv preprint arXiv:2511.11239 (2025) 3
-
[28]
arXiv preprint arXiv:2207.11514 (2022) 5, 27
Ha, H., Song, S.: Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. arXiv preprint arXiv:2207.11514 (2022) 5, 27
-
[29]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024) 3
work page internal anchor Pith review arXiv 2024
-
[31]
MiMo-Embodied: X-Embodied Foundation Model Technical Report
Hao, X., Zhou, L., Huang, Z., Hou, Z., Tang, Y., Zhang, L., Li, G., Lu, Z., Ren, S., Meng, X., et al.: Mimo-embodied: X-embodied foundation model. arXiv preprint arXiv:2511.16518 (2026) 5, 11, 12, 13, 26
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 7
2016
-
[33]
Gaussian Error Linear Units (GELUs)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016) 7
work page Pith review arXiv 2016
-
[34]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853– 17862 (2023) 61
2023
-
[35]
Journal of Intelligent and Connected Vehicles8(2), 9210059–1 (2025) 3
Hu, Z., Xu, M., Cheng, Q.: Multimodal large-language model empowering next- generation autonomous driving systems. Journal of Intelligent and Connected Vehicles8(2), 9210059–1 (2025) 3
2025
-
[36]
Huang, T., Zhang, Z., Tang, H.: 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding. arXiv preprint arXiv:2507.23478 (2025) 5, 27
-
[37]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Compos- able 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023) 4, 26
work page internal anchor Pith review arXiv 2023
-
[38]
arXiv preprint arXiv:2412.07689 (2024)
Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Drivemm: All-in-one large multimodal model for autonomous driving. arXiv preprint arXiv:2412.076892(3), 8 (2024) 3
-
[39]
Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Robotron-drive:All-in-onelargemultimodalmodelforautonomousdriving.ICCV (2024) 4, 11, 12, 26, 59, 60, 61
2024
-
[40]
Huang, Z., Sheng, Z., Qu, Y., You, J., Chen, S.: Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving. arXiv preprint arXiv:2412.15544 (2024) 4, 26
-
[41]
In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 10 18 K.A. Qian and C.C. Xie et al
2019
-
[42]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024) 4, 26, 60, 61
work page internal anchor Pith review arXiv 2024
-
[44]
Ishaq, A., Lahoud, J., More, K., Thawakar, O., Thawkar, R., Dissanayake, D., Ahsan, N., Li, Y., Khan, F.S., Cholakkal, H., et al.: Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. arXiv preprint arXiv:2503.10621 (2025) 4, 25, 26
-
[45]
IEEE Robotics and Automation Letters9(11), 9836–9843 (2024) 3, 7, 30
Jia, P., Wen, T., Luo, Z., Yang, M., Jiang, K., Liu, Z., Tang, X., Lei, Z., Cui, L., Zhang, B., et al.: Diffmap: Enhancing map segmentation with map prior using diffusion model. IEEE Robotics and Automation Letters9(11), 9836–9843 (2024) 3, 7, 30
2024
-
[46]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient au- tonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023) 60, 61
2023
-
[47]
Jiang, B., Chen, S., Zhang, Q., Liu, W., Wang, X.: Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608 (2025) 25
-
[48]
arXiv preprint arXiv:2506.24044 (2025) 1, 3
Jiang, S., Huang, Z., Qian, K., Luo, Z., Zhu, T., Zhong, Y., Tang, Y., Kong, M., Wang, Y., Jiao, S., et al.: A survey on vision-language-action models for autonomous driving. arXiv preprint arXiv:2506.24044 (2025) 1, 3
-
[49]
R efer I t G ame: Referring to objects in photographs of natural scenes
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: Referring to objects in photographs of natural scenes. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP). pp. 787–798. Association for Computational Linguistics, Doha, Qatar (Oct 2014).https://doi...
-
[50]
Proceedings of the European Conference on Computer Vision (ECCV) (2018) 25
Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. Proceedings of the European Conference on Computer Vision (ECCV) (2018) 25
2018
-
[51]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 4, 26
work page internal anchor Pith review arXiv 2024
-
[52]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 5, 27
work page Pith review arXiv 2024
-
[53]
Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imag- ine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542 (2025) 5, 27
-
[54]
Li, L., Shao, W., Dong, W., Tian, Y., Zhang, Q., Yang, K., Zhang, W.: Data- centric evolution in autonomous driving: A comprehensive survey of big data sys- tem, data mining, and closed-loop technologies. arXiv preprint arXiv:2401.12888 (2024) 3
-
[55]
Li, P., Zhang, Z., Holtz, D., Yu, H., Yang, Y., Lai, Y., Song, R., Geiger, A., Zell, A.: Spacedrive: Infusing spatial awareness into vlm-based autonomous driving. arXiv preprint arXiv:2512.107192(2025) 4 XEmbodied 19
-
[56]
Li, S., Tang, H.: Multimodal alignment and fusion: A survey: S. li, h. tang. Inter- national Journal of Computer Vision134(3), 103 (2026) 3
2026
-
[57]
Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., et al.: Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378 (2023) 4, 26
-
[58]
Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alvarez, J.M.: Is ego status all you need for open-loop end-to-end autonomous driving? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14864–14873 (2024) 61
2024
-
[59]
Advances in neural information processing systems36, 34892–34916 (2023) 3
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 3
2023
-
[60]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 5, 27
Liu, K., Liu, Y.J., Chen, B.: General 3d vision-language model with fast render- ing and pre-training vision-language alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 5, 27
2025
-
[61]
Liu, Y., Ma, M., Yu, X., Ding, P., Zhao, H., Sun, M., Huang, S., Wang, D.: Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448 (2025) 3, 30, 31
-
[62]
Liu,Y.,Zhang,B.,Zang,Y.,Cao,Y.,Xing,L.,Dong,X.,Duan,H.,Lin,D.,Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025) 5, 27
-
[63]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 10
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[64]
In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Lu, Y., Fan, Y., Deng, B., Liu, F., Li, Y., Wang, S.: Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 976–983. IEEE (2023) 10
2023
-
[65]
Ma, W., Chou, Y.C., Liu, Q., Wang, X., de Melo, C., Xie, J., Yuille, A.: Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning. arXiv preprint arXiv:2504.20024 (2025) 5, 27
-
[66]
In: European Conference on Computer Vision
Ma, Y., Cao, Y., Sun, J., Pavone, M., Xiao, C.: Dolphins: Multimodal language model for driving. In: European Conference on Computer Vision. pp. 403–420. Springer (2024) 4, 26
2024
-
[67]
Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a
Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023) 4, 26
-
[68]
A language agent for au- tonomous driving
Mao, J., Ye, J., Qian, Y., Pavone, M., Wang, Y.: A language agent for autonomous driving. arXiv preprint arXiv:2311.10813 (2023) 61
-
[69]
arXiv preprint arXiv:2312.14115 , year=
Marcu, A.M., Chen, L., Hünermann, J., Karnsund, A., Hanotte, B., Chi- dananda, P., Nair, S., Badrinarayanan, V., Kendall, A., Shotton, J., Sinavski, O.: Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115 (2023) 10
-
[70]
Com- munications biology8(1), 80 (2025) 3
Melchionna, M., Castiglione, S., Girardi, G., Profico, A., Mondanaro, A., Sansa- lone, G., Chatar, N., Pérez Ramos, A., Fernández-Monescillo, M., Serio, C., et al.: Cortical areas associated to higher cognition drove primate brain evolution. Com- munications biology8(1), 80 (2025) 3
2025
-
[71]
In: 2025 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS)
Miao,J.,Wen,T.,Luo,Z.,Qian,K.,Fu,Z.,Wang,Y.,Jiang,K.,Yang,M.,Huang, J.,Zhong,Z.,etal.:Efficientend-to-endvisuallocalizationforautonomousdriving with decoupled bev neural matching. In: 2025 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS). pp. 10719–10726. IEEE (2025) 3 20 K.A. Qian and C.C. Xie et al
2025
-
[72]
In: Eu- ropean Conference on Computer Vision
Nie, M., Peng, R., Wang, C., Cai, X., Han, J., Xu, H., Zhang, L.: Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In: Eu- ropean Conference on Computer Vision. pp. 292–308. Springer (2024) 25
2024
-
[73]
OpenAI: Gpt-4o system card (2024),https://arxiv.org/abs/2410.2127611, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
arXiv preprint arXiv:2504.01805 (2025)
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 5, 27, 28
-
[75]
Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,
Pan, Z., Liu, H.: Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse. arXiv preprint arXiv:2503.18470 (2025) 5, 27
- [76]
-
[77]
Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,
Qi, Z., Zhang, Z., Yu, Y., Wang, J., Zhao, H.: Vln-r1: Vision-language navigation via reinforcement fine-tuning. arXiv preprint arXiv:2506.17221 (2025) 4, 26
-
[78]
arXiv preprint arXiv:2505.15298 1,
Qian, K., Jiang, S., Zhong, Y., Luo, Z., Huang, Z., Zhu, T., Jiang, K., Yang, M., Fu, Z., Miao, J., et al.: Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving. arXiv preprint arXiv:2505.152981(2), 3 (2025) 3, 4, 5, 27, 30
-
[79]
arXiv preprint arXiv:2503.08162 (2025) 60, 61
Qian, K., Luo, Z., Jiang, S., Huang, Z., Miao, J., Ma, Z., Zhu, T., Li, J., He, Y., Fu, Z., et al.: Fasionad++: Integrating high-level instruction and information bottleneck in fat-slow fusion systems for enhanced safety in autonomous driving with adaptive feedback. arXiv preprint arXiv:2503.08162 (2025) 60, 61
-
[80]
arXiv preprint arXiv:2411.18013 (2024) 4, 26, 60, 61
Qian, K., Ma, Z., He, Y., Luo, Z., Shi, T., Zhu, T., Li, J., Wang, J., Chen, Z., He, X., et al.: Fasionad: Fast and slow fusion thinking systems for human- like autonomous driving with adaptive feedback. arXiv preprint arXiv:2411.18013 (2024) 4, 26, 60, 61
-
[81]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Qian, K., Miao, J., Jiao, X., Luo, Z., Fu, Z., Shi, Y., Wang, Y., Jiang, K., Yang, D.: Priormotion: Generative class-agnostic motion prediction with raster-vector motion field priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27284–27294 (2025) 7
2025
-
[82]
In: Pro- ceedings of the AAAI Conference on Artificial Intelligence
Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4542–4550 (2024) 25
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.