MiMo-Embodied: X-Embodied Foundation Model Technical Report

Bing Wang; Chaofan Zhang; Chenxu Dang; Chuhong Gong; Diyun Xiang; Fuli Luo; Guang Chen; Guang Li; Haiyang Sun; Hanbing Li

arxiv: 2511.16518 · v2 · submitted 2025-11-20 · 💻 cs.RO · cs.CL· cs.CV

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaoshuai Hao , Lei Zhou , Zhijian Huang , Zhiwen Hou , Yingbo Tang , Lingfeng Zhang , Guang Li , Zheng Lu

show 36 more authors

Shuhuai Ren Xianhui Meng Yuchen Zhang Jing Wu Jinghui Lu Chenxu Dang Jiayi Guan Jianhua Wu Zhiyi Hou Hanbing Li Shumeng Xia Mingliang Zhou Yinan Zheng Zihao Yue Shuhao Gu Hao Tian Yuannan Shen Jianwei Cui Wen Zhang Shaoqing Xu Bing Wang Haiyang Sun Zeyu Zhu Yuncheng Jiang Zibin Guo Chuhong Gong Chaofan Zhang Wenbo Ding Kun Ma Guang Chen Rui Cai Diyun Xiang Heng Qu Fuli Luo Hangjun Ye Long Chen

This is my paper

Pith reviewed 2026-05-17 20:37 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV

keywords embodied AIautonomous drivingfoundation modelcross-embodiment transfertask planningaffordance predictionreinforcement learningchain of thought

0 comments

The pith

MiMo-Embodied is the first foundation model to reach state-of-the-art results in both autonomous driving and embodied AI by training them together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiMo-Embodied as a single model trained across both autonomous driving and embodied AI domains. It reports new benchmark records on 17 embodied tasks covering planning, affordance, and spatial understanding, plus 12 driving tasks in perception, prediction, and planning. The authors show that multi-stage learning, carefully built datasets, and chain-of-thought plus reinforcement learning fine-tuning produce positive transfer so that gains in one domain improve the other. A reader would care because this suggests one model can handle diverse physical-world interactions instead of needing separate specialized systems for cars and robots.

Core claim

MiMo-Embodied integrates autonomous driving and embodied AI into one foundation model and, through multi-stage learning, curated data construction, and CoT/RL fine-tuning, achieves state-of-the-art performance on 17 embodied AI benchmarks and 12 autonomous driving benchmarks while demonstrating strong positive transfer that lets the domains mutually reinforce each other.

What carries the argument

The cross-embodiment training pipeline of multi-stage learning, curated data construction, and CoT/RL fine-tuning that creates mutual reinforcement between driving and embodied tasks.

If this is right

A single model can outperform specialized open-source, closed-source, and task-specific baselines in both driving and embodied settings.
Performance in task planning, affordance prediction, spatial understanding, environmental perception, status prediction, and driving planning all improve together.
The two domains exhibit positive transfer so that progress in one directly benefits the other.
Open-sourcing the model and training details enables further work on unified physical-world systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-training approach could be tested on additional embodied domains such as dexterous manipulation or multi-robot coordination.
If the transfer holds at larger scales, unified models may eventually replace collections of narrow specialists for real-world deployment.
Future benchmarks that control strictly for data volume and model size would clarify how much of the gain is truly from cross-embodiment sharing.

Load-bearing premise

The benchmark gains come from genuine cross-domain transfer rather than from simply using more total training data or different model scales than the baselines.

What would settle it

An ablation experiment that trains separate models on matched total data volume and compute and finds no performance difference from the joint model would show the claimed transfer is not the main driver.

read the original abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiMo-Embodied trains one model on both driving and embodied data and claims SOTA across many benchmarks, but the positive transfer story rests on untested assumptions about data mixing.

read the letter

The key takeaway is that this technical report introduces MiMo-Embodied as a single model handling both autonomous driving and embodied AI, claiming superior performance on many benchmarks in each and attributing it to positive transfer from joint training. They have done the work of collecting and curating data from both areas, applying a multi-stage schedule, and using chain-of-thought and reinforcement learning steps in fine-tuning. That combination is what they say allows the mutual reinforcement. Making the model and code public is a concrete step that helps the community check the results. Where it falls short is in pinning down why the results are good. The abstract highlights outperformance over baselines, but without ablations that match total data volume, model size, and training steps while only changing the domain mix, it's difficult to credit the cross-embodiment aspect specifically. The concern about missing controls for transfer versus scale seems to apply here. Details on variance across runs or exact baseline setups would also help. Readers focused on foundation models for real-world robotics and vehicle autonomy will find this relevant. It offers a practical example of mixing datasets that were usually treated apart. The work shows clear effort in implementation and openness, so it makes sense to put it through peer review for closer examination of the empirical support. I recommend accepting it for review and suggesting the authors add the missing ablation studies on the data mixture effects.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiMo-Embodied, presented as the first cross-embodied foundation model that integrates Autonomous Driving and Embodied AI. It claims new state-of-the-art results across 17 embodied AI benchmarks (Task Planning, Affordance Prediction, Spatial Understanding) and 12 autonomous driving benchmarks (Environmental Perception, Status Prediction, Driving Planning), outperforming open-source, closed-source, and specialized baselines. The authors attribute the gains to multi-stage learning, curated data construction, and CoT/RL fine-tuning, which they argue produce strong positive transfer between the two domains.

Significance. If the empirical claims are substantiated, the work would be significant for demonstrating that joint training across driving and embodied domains can yield mutual reinforcement rather than interference, supporting the development of more generalist foundation models for robotics and autonomous systems. The open-sourcing of code and models would further aid reproducibility and follow-on research.

major comments (2)

[Abstract and Methods] Abstract and Methods: The central claim that the two domains 'exhibit strong positive transfer and mutually reinforce one another' is load-bearing for the paper's contribution yet rests on benchmark comparisons without ablations that hold total token count, parameter count, optimizer schedule, and benchmark selection fixed while varying only the presence of cross-domain data. A controlled comparison (AD-only vs. Embodied-only vs. joint at matched compute) is required to rule out that observed gains arise simply from larger pooled data volume or unstated differences in scale and filtering.
[Results] Results: No error bars, standard deviations, or statistical significance tests are reported for the claimed outperformance across the 29 benchmarks. Without these, it is impossible to determine whether the reported SOTA margins reflect genuine improvements or variability in evaluation.

minor comments (2)

[Abstract] The abstract states 'new records across 17 embodied AI benchmarks' and '12 autonomous driving benchmarks' but does not list the exact benchmark names or provide a summary table in the provided text; including such a table would improve clarity.
[Methods] The paper mentions 'detailed analysis of our model design and training methodologies' but the excerpt does not include explicit data-exclusion rules or hyperparameter tables; adding these would aid assessment of reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our strongest honest defense of the manuscript while acknowledging where additional clarification or discussion would strengthen the presentation.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The central claim that the two domains 'exhibit strong positive transfer and mutually reinforce one another' is load-bearing for the paper's contribution yet rests on benchmark comparisons without ablations that hold total token count, parameter count, optimizer schedule, and benchmark selection fixed while varying only the presence of cross-domain data. A controlled comparison (AD-only vs. Embodied-only vs. joint at matched compute) is required to rule out that observed gains arise simply from larger pooled data volume or unstated differences in scale and filtering.

Authors: We agree that an ideal controlled ablation holding every hyperparameter and compute budget fixed would provide the most direct evidence for positive cross-domain transfer. Our multi-stage pipeline, however, incorporates domain-specific data curation, progressive alignment stages, and CoT/RL fine-tuning that are not trivially separable while preserving identical token counts and schedules. The manuscript already compares MiMo-Embodied against both single-domain foundation models and specialized baselines; the consistent gains across 29 diverse benchmarks (many of which use different evaluation protocols) are difficult to attribute solely to data volume. We will revise the Methods and Discussion sections to more explicitly discuss these design choices, the rationale for our training stages, and the limitations of the current evidence with respect to fully isolated ablations. revision: partial
Referee: [Results] Results: No error bars, standard deviations, or statistical significance tests are reported for the claimed outperformance across the 29 benchmarks. Without these, it is impossible to determine whether the reported SOTA margins reflect genuine improvements or variability in evaluation.

Authors: We acknowledge that reporting variability would improve interpretability. Large-scale foundation-model training and evaluation on 29 benchmarks incurs prohibitive compute costs for repeated independent runs, which is why we follow the common practice in the field of reporting results from the primary training run. We will add a concise statement in the revised Results section describing the evaluation protocol, noting the single-run nature of the numbers, and discussing why the margins appear robust given the breadth of tasks and the outperformance relative to multiple classes of baselines. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims are self-contained

full rationale

The paper is a technical report describing model training (multi-stage learning, curated data, CoT/RL fine-tuning) and reporting SOTA results on 17 embodied AI plus 12 autonomous driving benchmarks. No derivation chain, equations, or first-principles predictions are presented whose outputs reduce by construction to fitted inputs, self-citations, or renamed ansatzes. Positive transfer is asserted from observed performance differences rather than any definitional equivalence or load-bearing self-citation. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the chosen benchmarks as proxies for real capability and on the assumption that multi-stage training plus CoT/RL produces genuine transfer rather than dataset-size effects. No machine-checked proofs or parameter-free derivations are present.

free parameters (2)

multi-stage training schedule and data mixture ratios
Specific ordering of stages and proportions of driving versus embodied data are selected to produce the reported transfer gains.
CoT/RL fine-tuning hyperparameters
Reward functions, chain-of-thought prompting templates, and RL hyperparameters are tuned to achieve the final benchmark numbers.

axioms (1)

domain assumption The 17 embodied and 12 driving benchmarks are fair, comprehensive, and representative of downstream real-world performance.
SOTA claims rest entirely on these benchmark scores.

invented entities (1)

MiMo-Embodied cross-embodied foundation model no independent evidence
purpose: Single model that jointly handles autonomous driving and embodied AI tasks
New model introduced in this work; no independent prior validation exists outside the paper.

pith-pipeline@v0.9.0 · 5613 in / 1372 out tokens · 36675 ms · 2026-05-17T20:37:24.065020+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The MiMo-Embodied architecture consists of three main components: (1) a Vision Transformer (ViT) for encoding visual inputs; (2) a projector...; and (3) the LLM... progressive four-stage training strategy... Stage 1: Embodied AI Supervised Fine-tuning... Stage 4: Reinforcement Learning Fine-Tuning
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation
cs.AI 2026-04 unverdicted novelty 7.0

RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded p...
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches di...
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 unverdicted novelty 6.0

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 10 Pith papers · 15 internal anchors

[1]

Claude 3.7 sonnet and claude code

Anthropic. Claude 3.7 sonnet and claude code. 2025

work page 2025
[2]

Claude sonnet 4

Anthropic. Claude sonnet 4. 2025

work page 2025
[3]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[8]

arXiv preprint arXiv:2510.25122 (2025)

Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025

work page arXiv 2025
[9]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7817–7826, 2025

work page 2025
[10]

Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023

work page arXiv 2023
[11]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision, pages 720–736, 2018

work page 2018
[12]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024
[13]

Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

work page 2025
[14]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024
[15]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024
[16]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025

Google. Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025. Accessed: 2025-05-06

work page 2025
[18]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. 24 In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

work page arXiv 2025
[22]

DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

work page arXiv 2025
[23]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549, 2022

work page 2022
[24]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025

work page arXiv 2025
[27]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025

work page 2025
[28]

Ku, Qian Liu, and Wenhu Chen

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

work page arXiv 2024
[29]

Adapt: Action-aware driving caption transformer

Bu Jin and Haotian Liu. Adapt: Action-aware driving caption transformer. InCAAI International Conference on Artificial Intelligence, pages 473–477, 2023

work page 2023
[30]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251, 2016

work page 2016
[31]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision, pages 563–578, 2018

work page 2018
[32]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving

Yuhang Lu, Yichen Yao, Jiadong Tu, Jiangnan Shao, Yuexin Ma, and Xinge Zhu. Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5838–5846, 2025

work page 2025
[34]

Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 976–983, 2023

work page 2023
[35]

Visual embodied brain: Let multimodal large language models see, think, and control in spaces,

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 25

work page arXiv 2025
[36]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

work page arXiv 2022
[37]

Drama: Joint risk localization and captioning in driving

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

work page 2023
[38]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269, 2024

work page 2024
[39]

Affordance detection of tool parts from geometric features

Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE International Conference on Robotics and Automation, pages 1374–1381, 2015

work page 2015
[40]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023

work page 2023
[41]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024
[42]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

work page arXiv 2024
[43]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024

work page arXiv 2024
[44]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024
[45]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InIEEE International Conference on Robotics and Automation, pages 645–652, 2024

work page 2024
[46]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274, 2024

work page 2024
[47]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025
[48]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of ACM International Conference on Multimedia, pages 12706–12713, 2025

work page 2025
[49]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

work page arXiv 2025
[50]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

work page 2024
[52]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736, 2023

work page 2023
[53]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 26

work page 2025
[54]

The all-seeing project v2: Towards general relation comprehension of the open world

Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. In European Conference on Computer Vision, pages 471–490, 2024

work page 2024
[55]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Embodied scene understanding for vision language models via metavqa

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22453–22464, 2025

work page 2025
[57]

Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023

work page 2023
[58]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[59]

MiMo-VL technical report

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URLhttps://arxiv.org/abs/2506.03569

work page arXiv 2025
[60]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

work page 2025
[61]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

work page 2025
[62]

Robopoint: A vision-language model for spatial affordance prediction for robotics,

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

work page arXiv 2024
[63]

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. From seeing to doing: Bridging reasoning and decision for robotic manipulation.arXiv preprint arXiv:2505.08548, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

work page 2025
[65]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

Enming Zhang, Xingyuan Dai, Min Huang, Yisheng Lv, and Qinghai Miao. Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

work page arXiv 2024
[67]

arXiv preprint arXiv:2508.04598, 2025

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Pengwei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang.nava3: Understanding any instruction, navigating anywhere, finding anything. arXiv preprint arXiv:2508.04598, 2025

work page arXiv 2025
[68]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

RoboRefer: Towards spatial referring with rea- soning in vision-language models for robotics,

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXiv:2506.04308, 2025

work page arXiv 2025
[70]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 27 7 Contributions and Acknowledgments Core Contributors •Xiaoshuai Hao •Lei Zhou •Zhijian Huang •...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Claude 3.7 sonnet and claude code

Anthropic. Claude 3.7 sonnet and claude code. 2025

work page 2025

[2] [2]

Claude sonnet 4

Anthropic. Claude sonnet 4. 2025

work page 2025

[3] [3]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024

[8] [8]

arXiv preprint arXiv:2510.25122 (2025)

Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025

work page arXiv 2025

[9] [9]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7817–7826, 2025

work page 2025

[10] [10]

Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023

work page arXiv 2023

[11] [11]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision, pages 720–736, 2018

work page 2018

[12] [12]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024

[13] [13]

Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

work page 2025

[14] [14]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024

[15] [15]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024

[16] [16]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025

Google. Gemini 2.5 pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance/, 2025. Accessed: 2025-05-06

work page 2025

[18] [18]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. 24 In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022

[19] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

work page arXiv 2025

[22] [22]

DriveAction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models.arXiv preprint arXiv:2506.05667, 2025

work page arXiv 2025

[23] [23]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549, 2022

work page 2022

[24] [24]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025

[25] [25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding.arXiv preprint arXiv:2503.10621, 2025

work page arXiv 2025

[27] [27]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025

work page 2025

[28] [28]

Ku, Qian Liu, and Wenhu Chen

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

work page arXiv 2024

[29] [29]

Adapt: Action-aware driving caption transformer

Bu Jin and Haotian Liu. Adapt: Action-aware driving caption transformer. InCAAI International Conference on Artificial Intelligence, pages 473–477, 2023

work page 2023

[30] [30]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251, 2016

work page 2016

[31] [31]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision, pages 563–578, 2018

work page 2018

[32] [32]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving

Yuhang Lu, Yichen Yao, Jiadong Tu, Jiangnan Shao, Yuexin Ma, and Xinge Zhu. Can lvlms obtain a driver’s license? a benchmark towards reliable agi for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5838–5846, 2025

work page 2025

[34] [34]

Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 976–983, 2023

work page 2023

[35] [35]

Visual embodied brain: Let multimodal large language models see, think, and control in spaces,

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 25

work page arXiv 2025

[36] [36]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

work page arXiv 2022

[37] [37]

Drama: Joint risk localization and captioning in driving

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

work page 2023

[38] [38]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269, 2024

work page 2024

[39] [39]

Affordance detection of tool parts from geometric features

Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE International Conference on Robotics and Automation, pages 1374–1381, 2015

work page 2015

[40] [40]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023

work page 2023

[41] [41]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024

[42] [42]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

work page arXiv 2024

[43] [43]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024

work page arXiv 2024

[44] [44]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024

[45] [45]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InIEEE International Conference on Robotics and Automation, pages 645–652, 2024

work page 2024

[46] [46]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274, 2024

work page 2024

[47] [47]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025

[48] [48]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of ACM International Conference on Multimedia, pages 12706–12713, 2025

work page 2025

[49] [49]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

work page arXiv 2025

[50] [50]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

work page 2024

[52] [52]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736, 2023

work page 2023

[53] [53]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025. 26

work page 2025

[54] [54]

The all-seeing project v2: Towards general relation comprehension of the open world

Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. In European Conference on Computer Vision, pages 471–490, 2024

work page 2024

[55] [55]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Embodied scene understanding for vision language models via metavqa

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22453–22464, 2025

work page 2025

[57] [57]

Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023

work page 2023

[58] [58]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024

[59] [59]

MiMo-VL technical report

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URLhttps://arxiv.org/abs/2506.03569

work page arXiv 2025

[60] [60]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025

work page 2025

[61] [61]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

work page 2025

[62] [62]

Robopoint: A vision-language model for spatial affordance prediction for robotics,

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

work page arXiv 2024

[63] [63]

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. From seeing to doing: Bridging reasoning and decision for robotic manipulation.arXiv preprint arXiv:2505.08548, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

work page 2025

[65] [65]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

Enming Zhang, Xingyuan Dai, Min Huang, Yisheng Lv, and Qinghai Miao. Minidrive: More efficient vision-language models with multi-level 2d features as text tokens for autonomous driving.arXiv preprint arXiv:2409.07267, 2024

work page arXiv 2024

[67] [67]

arXiv preprint arXiv:2508.04598, 2025

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Pengwei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang.nava3: Understanding any instruction, navigating anywhere, finding anything. arXiv preprint arXiv:2508.04598, 2025

work page arXiv 2025

[68] [68]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

RoboRefer: Towards spatial referring with rea- soning in vision-language models for robotics,

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXiv:2506.04308, 2025

work page arXiv 2025

[70] [70]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 27 7 Contributions and Acknowledgments Core Contributors •Xiaoshuai Hao •Lei Zhou •Zhijian Huang •...

work page internal anchor Pith review Pith/arXiv arXiv 2025