pith. sign in

arxiv: 2511.16518 · v2 · submitted 2025-11-20 · 💻 cs.RO · cs.CL· cs.CV

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Pith reviewed 2026-05-17 20:37 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV
keywords embodied AIautonomous drivingfoundation modelcross-embodiment transfertask planningaffordance predictionreinforcement learningchain of thought
0
0 comments X

The pith

MiMo-Embodied is the first foundation model to reach state-of-the-art results in both autonomous driving and embodied AI by training them together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiMo-Embodied as a single model trained across both autonomous driving and embodied AI domains. It reports new benchmark records on 17 embodied tasks covering planning, affordance, and spatial understanding, plus 12 driving tasks in perception, prediction, and planning. The authors show that multi-stage learning, carefully built datasets, and chain-of-thought plus reinforcement learning fine-tuning produce positive transfer so that gains in one domain improve the other. A reader would care because this suggests one model can handle diverse physical-world interactions instead of needing separate specialized systems for cars and robots.

Core claim

MiMo-Embodied integrates autonomous driving and embodied AI into one foundation model and, through multi-stage learning, curated data construction, and CoT/RL fine-tuning, achieves state-of-the-art performance on 17 embodied AI benchmarks and 12 autonomous driving benchmarks while demonstrating strong positive transfer that lets the domains mutually reinforce each other.

What carries the argument

The cross-embodiment training pipeline of multi-stage learning, curated data construction, and CoT/RL fine-tuning that creates mutual reinforcement between driving and embodied tasks.

If this is right

  • A single model can outperform specialized open-source, closed-source, and task-specific baselines in both driving and embodied settings.
  • Performance in task planning, affordance prediction, spatial understanding, environmental perception, status prediction, and driving planning all improve together.
  • The two domains exhibit positive transfer so that progress in one directly benefits the other.
  • Open-sourcing the model and training details enables further work on unified physical-world systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-training approach could be tested on additional embodied domains such as dexterous manipulation or multi-robot coordination.
  • If the transfer holds at larger scales, unified models may eventually replace collections of narrow specialists for real-world deployment.
  • Future benchmarks that control strictly for data volume and model size would clarify how much of the gain is truly from cross-embodiment sharing.

Load-bearing premise

The benchmark gains come from genuine cross-domain transfer rather than from simply using more total training data or different model scales than the baselines.

What would settle it

An ablation experiment that trains separate models on matched total data volume and compute and finds no performance difference from the joint model would show the claimed transfer is not the main driver.

read the original abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiMo-Embodied, presented as the first cross-embodied foundation model that integrates Autonomous Driving and Embodied AI. It claims new state-of-the-art results across 17 embodied AI benchmarks (Task Planning, Affordance Prediction, Spatial Understanding) and 12 autonomous driving benchmarks (Environmental Perception, Status Prediction, Driving Planning), outperforming open-source, closed-source, and specialized baselines. The authors attribute the gains to multi-stage learning, curated data construction, and CoT/RL fine-tuning, which they argue produce strong positive transfer between the two domains.

Significance. If the empirical claims are substantiated, the work would be significant for demonstrating that joint training across driving and embodied domains can yield mutual reinforcement rather than interference, supporting the development of more generalist foundation models for robotics and autonomous systems. The open-sourcing of code and models would further aid reproducibility and follow-on research.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods: The central claim that the two domains 'exhibit strong positive transfer and mutually reinforce one another' is load-bearing for the paper's contribution yet rests on benchmark comparisons without ablations that hold total token count, parameter count, optimizer schedule, and benchmark selection fixed while varying only the presence of cross-domain data. A controlled comparison (AD-only vs. Embodied-only vs. joint at matched compute) is required to rule out that observed gains arise simply from larger pooled data volume or unstated differences in scale and filtering.
  2. [Results] Results: No error bars, standard deviations, or statistical significance tests are reported for the claimed outperformance across the 29 benchmarks. Without these, it is impossible to determine whether the reported SOTA margins reflect genuine improvements or variability in evaluation.
minor comments (2)
  1. [Abstract] The abstract states 'new records across 17 embodied AI benchmarks' and '12 autonomous driving benchmarks' but does not list the exact benchmark names or provide a summary table in the provided text; including such a table would improve clarity.
  2. [Methods] The paper mentions 'detailed analysis of our model design and training methodologies' but the excerpt does not include explicit data-exclusion rules or hyperparameter tables; adding these would aid assessment of reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our strongest honest defense of the manuscript while acknowledging where additional clarification or discussion would strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The central claim that the two domains 'exhibit strong positive transfer and mutually reinforce one another' is load-bearing for the paper's contribution yet rests on benchmark comparisons without ablations that hold total token count, parameter count, optimizer schedule, and benchmark selection fixed while varying only the presence of cross-domain data. A controlled comparison (AD-only vs. Embodied-only vs. joint at matched compute) is required to rule out that observed gains arise simply from larger pooled data volume or unstated differences in scale and filtering.

    Authors: We agree that an ideal controlled ablation holding every hyperparameter and compute budget fixed would provide the most direct evidence for positive cross-domain transfer. Our multi-stage pipeline, however, incorporates domain-specific data curation, progressive alignment stages, and CoT/RL fine-tuning that are not trivially separable while preserving identical token counts and schedules. The manuscript already compares MiMo-Embodied against both single-domain foundation models and specialized baselines; the consistent gains across 29 diverse benchmarks (many of which use different evaluation protocols) are difficult to attribute solely to data volume. We will revise the Methods and Discussion sections to more explicitly discuss these design choices, the rationale for our training stages, and the limitations of the current evidence with respect to fully isolated ablations. revision: partial

  2. Referee: [Results] Results: No error bars, standard deviations, or statistical significance tests are reported for the claimed outperformance across the 29 benchmarks. Without these, it is impossible to determine whether the reported SOTA margins reflect genuine improvements or variability in evaluation.

    Authors: We acknowledge that reporting variability would improve interpretability. Large-scale foundation-model training and evaluation on 29 benchmarks incurs prohibitive compute costs for repeated independent runs, which is why we follow the common practice in the field of reporting results from the primary training run. We will add a concise statement in the revised Results section describing the evaluation protocol, noting the single-run nature of the numbers, and discussing why the margins appear robust given the breadth of tasks and the outperformance relative to multiple classes of baselines. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims are self-contained

full rationale

The paper is a technical report describing model training (multi-stage learning, curated data, CoT/RL fine-tuning) and reporting SOTA results on 17 embodied AI plus 12 autonomous driving benchmarks. No derivation chain, equations, or first-principles predictions are presented whose outputs reduce by construction to fitted inputs, self-citations, or renamed ansatzes. Positive transfer is asserted from observed performance differences rather than any definitional equivalence or load-bearing self-citation. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the chosen benchmarks as proxies for real capability and on the assumption that multi-stage training plus CoT/RL produces genuine transfer rather than dataset-size effects. No machine-checked proofs or parameter-free derivations are present.

free parameters (2)
  • multi-stage training schedule and data mixture ratios
    Specific ordering of stages and proportions of driving versus embodied data are selected to produce the reported transfer gains.
  • CoT/RL fine-tuning hyperparameters
    Reward functions, chain-of-thought prompting templates, and RL hyperparameters are tuned to achieve the final benchmark numbers.
axioms (1)
  • domain assumption The 17 embodied and 12 driving benchmarks are fair, comprehensive, and representative of downstream real-world performance.
    SOTA claims rest entirely on these benchmark scores.
invented entities (1)
  • MiMo-Embodied cross-embodied foundation model no independent evidence
    purpose: Single model that jointly handles autonomous driving and embodied AI tasks
    New model introduced in this work; no independent prior validation exists outside the paper.

pith-pipeline@v0.9.0 · 5613 in / 1372 out tokens · 36675 ms · 2026-05-17T20:37:24.065020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

  2. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  3. RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded p...

  4. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  5. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.

  6. Large Vision-Language Models Get Lost in Attention

    cs.AI 2026-05 unverdicted novelty 6.0

    In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

  7. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  8. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  9. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  10. Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches di...

  11. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  12. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 unverdicted novelty 6.0

    FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

  13. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  14. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 10 Pith papers · 15 internal anchors

  1. [1]

    Claude 3.7 sonnet and claude code

    Anthropic. Claude 3.7 sonnet and claude code. 2025

  2. [2]

    Claude sonnet 4

    Anthropic. Claude sonnet 4. 2025

  3. [3]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  7. [7]

    Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

    Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

  8. [8]

    arXiv preprint arXiv:2510.25122 (2025)

    Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025

  9. [9]

    Automated evaluation of large vision-language models on self-driving corner cases

    Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7817–7826, 2025

  10. [10]

    Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023

  11. [11]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision, pages 720–736, 2018

  12. [12]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

  13. [13]

    Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  14. [14]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

  15. [15]

    Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

  16. [16]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

  17. [17]

    Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025

    Google. Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025. Accessed: 2025-05-06

  18. [18]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. 24 In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  21. [21]

    Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

  22. [22]

    DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

    Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

  23. [23]

    St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549, 2022

  24. [24]

    Robotron-drive: All-in-one large multimodal model for autonomous driving

    Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  26. [26]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

    Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025

  27. [27]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025

  28. [28]

    Ku, Qian Liu, and Wenhu Chen

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

  29. [29]

    Adapt: Action-aware driving caption transformer

    Bu Jin and Haotian Liu. Adapt: Action-aware driving caption transformer. InCAAI International Conference on Artificial Intelligence, pages 473–477, 2023

  30. [30]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251, 2016

  31. [31]

    Textual explanations for self-driving vehicles

    Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision, pages 563–578, 2018

  32. [32]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  33. [33]

    Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving

    Yuhang Lu, Yichen Yao, Jiadong Tu, Jiangnan Shao, Yuexin Ma, and Xinge Zhu. Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5838–5846, 2025

  34. [34]

    Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

    Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 976–983, 2023

  35. [35]

    Visual embodied brain: Let multimodal large language models see, think, and control in spaces,

    Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 25

  36. [36]

    Sqa3d: Situated question answering in 3d scenes,

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

  37. [37]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

  38. [38]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269, 2024

  39. [39]

    Affordance detection of tool parts from geometric features

    Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE International Conference on Robotics and Automation, pages 1374–1381, 2015

  40. [40]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023

  41. [41]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

  42. [42]

    EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

    Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

  43. [43]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024

  44. [44]

    Sat: Spa- tial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

  45. [45]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InIEEE International Conference on Robotics and Automation, pages 645–652, 2024

  46. [46]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274, 2024

  47. [47]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

  48. [48]

    Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

    Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of ACM International Conference on Multimedia, pages 12706–12713, 2025

  49. [49]

    Robobrain 2.0 technical report

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

  50. [50]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  51. [51]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

  52. [52]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736, 2023

  53. [53]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 26

  54. [54]

    The all-seeing project v2: Towards general relation comprehension of the open world

    Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. In European Conference on Computer Vision, pages 471–490, 2024

  55. [55]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  56. [56]

    Embodied scene understanding for vision language models via metavqa

    Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22453–22464, 2025

  57. [57]

    Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023

  58. [58]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  59. [59]

    MiMo-VL technical report

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URLhttps://arxiv.org/abs/2506.03569

  60. [60]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

  61. [61]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  62. [62]

    Robopoint: A vision-language model for spatial affordance prediction for robotics,

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

  63. [63]

    From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

    Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. From seeing to doing: Bridging reasoning and decision for robotic manipulation.arXiv preprint arXiv:2505.08548, 2025

  64. [64]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

  65. [65]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

  66. [66]

    Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

    Enming Zhang, Xingyuan Dai, Min Huang, Yisheng Lv, and Qinghai Miao. Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

  67. [67]

    arXiv preprint arXiv:2508.04598, 2025

    Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Pengwei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang.nava3: Understanding any instruction, navigating anywhere, finding anything. arXiv preprint arXiv:2508.04598, 2025

  68. [68]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  69. [69]

    RoboRefer: Towards spatial referring with rea- soning in vision-language models for robotics,

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXiv:2506.04308, 2025

  70. [70]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 27 7 Contributions and Acknowledgments Core Contributors •Xiaoshuai Hao •Lei Zhou •Zhijian Huang •...