pith. machine review for the scientific record. sign in

arxiv: 2511.17411 · v2 · submitted 2025-11-21 · 💻 cs.RO · cs.LG

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Pith reviewed 2026-05-17 20:17 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords robotic foundation models3D perceptionvision language modelsembodied controldata efficiencygeneralizationrobot learning
0
0 comments X

The pith

A robot foundation model matches leaders using 20 times fewer demonstrations by adding 3D perception from ordinary images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robotic foundation models struggle to generalize because they are fine-tuned from vision-language models that only understand flat images. The authors show that first teaching a model to reason about three-dimensional space using labeled non-robot photos creates a stronger base. They build SPEAR-VLM to predict object locations in 3D from one picture, then use it inside SPEAR-1 for controlling robots with words. This lets the system reach or exceed the performance of current best models while training on much less robot-specific data from many sources.

Core claim

The paper presents SPEAR-1, a robotic foundation model that combines grounded 3D perception with language-instructed embodied control. It starts by training SPEAR-VLM on non-robotic images augmented with 3D annotations to infer object coordinates in 3D space. Then SPEAR-1 is trained on approximately 45 million frames from 24 Open X-Embodiment datasets, achieving performance that matches or surpasses state-of-the-art models while requiring 20 times fewer robot demonstrations.

What carries the argument

SPEAR-VLM, a 3D-aware vision-language model that infers object coordinates in 3D space from a single 2D image and serves as the foundation for the robot controller.

If this is right

  • Embodied control becomes more reliable when the underlying perception includes explicit 3D spatial understanding.
  • Scaling robot learning no longer requires proportional increases in expensive robot demonstration data.
  • Models can leverage abundant public image datasets to bootstrap capabilities needed for physical tasks.
  • Performance on language-instructed tasks improves across different environments and robot types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether the same 3D enrichment helps in non-language tasks like pure visual navigation.
  • Combining this with other data sources such as video might further reduce the need for robot trials.
  • Deployment on real robots could reveal whether the 3D reasoning helps in dynamic or cluttered scenes specifically.

Load-bearing premise

3D spatial reasoning learned from annotated everyday images will transfer to make language-guided robot actions more reliable and general.

What would settle it

Running SPEAR-1 and a version without the 3D VLM component on the same set of new robot tasks and seeing if the 3D version shows no advantage in success rate or generalization.

Figures

Figures reproduced from arXiv: 2511.17411 by Aleksandar Yanev, Danda Pani Paudel, Giuliano Albanese, Jan-Nico Zaech, Luc Van Gool, Nikolay Nikolov, Sombit Dey.

Figure 1
Figure 1. Figure 1: (a) Most VLAs fail to show zero-shot performance on the challenging Franka (DROID) setup in unseen environments, without [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SPEAR-1 stages of training. Stage 0: General VLM pretraining on web scale data, e.g. PaliGemma. Stage 1: Integrate a mono depth vision encoder to build SPEAR-VLM and train it on embodied-inspired VQA tasks, e.g. 3D bounding box or object￾to-object distance estimation. We use 2D images from non-robotic data, enriched with 3D annotations. Stage 2: Add an action expert to train SPEAR-1 on robot demonstration … view at source ↗
Figure 3
Figure 3. Figure 3: SPEAR-VLM overview. Left: Training data mixture, auto computed spatial annotations and example question-answer pairs from each category. Right: High-level architecture with fusion between SigLIP and MoGe encoders and PaliGemma embeddings expansion with 3D tokens. This design equips SPEAR-VLM with explicit 3D understanding that serves as a strong foundation for SPEAR-VLA. nents of the distance between objec… view at source ↗
Figure 4
Figure 4. Figure 4: Real world evaluation on WidowX. SPEAR-1 is able to achieve 10% higher average task progress across all tasks than OpenVLA, a strong baseline in this setting. Bottom images correspond to the real-world tasks, whose performances are reported above. Average Apple in drawer and close Carrot on plate Carrot on plate (w/ elevation) Cover the pot Marker in cup 0 20 40 60 80 100 Task progress (%) 14 7 15 17 30 64… view at source ↗
Figure 5
Figure 5. Figure 5: Real world evaluation on Franka. We find that without any fine-tuning on the target environment, SPEAR-1 noticeably outperforms π0-FAST, and matches π0.5, even though both baselines are trained on 20× more robotic data from significantly more diverse environments. The bottom row shows challenging, varied Franka environments where SPEAR-1 maintains strong zero-shot performance. 4.1. Implementation details V… view at source ↗
Figure 6
Figure 6. Figure 6: 3D ablation environments on WidowX. (a) Training data subset from Bridge V2 [45]. (b) - (e) SIMPLER evaluation environ￾ments. A.2.2. VLA training details Reference Frames. In this work we focus on learning po￾sition control of fixed-base single-arm manipulators. Each control in the sequence At = [at, at+1, . . . at+H−1] is de￾fined as a delta with respect to the current end-effector carte￾sian pose ∆EE = [… view at source ↗
read the original abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $\pi_0$-FAST and $\pi_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available at https://spear.insait.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SPEAR-1, a robotic foundation model that first trains SPEAR-VLM on 3D-annotated non-robotic images to acquire spatial reasoning, then integrates this into a policy trained on approximately 45M frames from 24 Open X-Embodiment datasets. The central claim is that this yields performance matching or exceeding baselines such as π₀-FAST and π₀.₅ while requiring 20× fewer robot demonstrations, by addressing the 2D limitation of standard VLMs for embodied 3D control.

Significance. If the performance gains are shown to stem specifically from the 3D transfer rather than confounding factors, the approach would offer a scalable path to reduce reliance on expensive robot data for generalist policies, with potential impact on data-efficient robot learning.

major comments (2)
  1. [Experimental section] Experimental section (and associated tables/figures): the headline result that SPEAR-1 matches or exceeds π₀-FAST and π₀.₅ with 20× fewer demonstrations is not supported by a controlled ablation that holds robot data volume, architecture, and training recipe fixed while varying only the VLM backbone (SPEAR-VLM vs. a standard VLM). Without this isolation, the observed delta cannot be attributed to 3D spatial reasoning acquired from non-robotic annotations.
  2. [§3 and §4] §3 and §4: the integration of SPEAR-VLM into the robot policy is described at a high level, but no quantitative metrics, baseline details, ablation studies, or error analysis are referenced in the provided abstract or summary; the full experimental results must be examined to verify whether the data support the 20× data-efficiency claim.
minor comments (2)
  1. [Abstract] Abstract: the notation $~textbf{SPEAR-1}$ contains a stray tilde and inconsistent bolding; standardize to SPEAR-1 throughout.
  2. [Abstract] The link to model weights and datasets is provided, which is a positive for reproducibility; ensure the 3D-annotated datasets are clearly documented with annotation protocol and coverage statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of experimental rigor. We respond to each major comment below and outline planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Experimental section] Experimental section (and associated tables/figures): the headline result that SPEAR-1 matches or exceeds π₀-FAST and π₀.₅ with 20× fewer demonstrations is not supported by a controlled ablation that holds robot data volume, architecture, and training recipe fixed while varying only the VLM backbone (SPEAR-VLM vs. a standard VLM). Without this isolation, the observed delta cannot be attributed to 3D spatial reasoning acquired from non-robotic annotations.

    Authors: We agree that isolating the VLM backbone while holding robot data volume, architecture, and training recipe fixed would strengthen attribution of gains to the 3D spatial reasoning. Our existing results compare against models trained on substantially larger robot datasets, supporting the overall data-efficiency claim. To directly address the concern, we will add a controlled ablation in the revised manuscript training an otherwise identical policy with a standard VLM backbone on the same ~45M frames. revision: yes

  2. Referee: [§3 and §4] §3 and §4: the integration of SPEAR-VLM into the robot policy is described at a high level, but no quantitative metrics, baseline details, ablation studies, or error analysis are referenced in the provided abstract or summary; the full experimental results must be examined to verify whether the data support the 20× data-efficiency claim.

    Authors: Sections 3 and 4 of the full manuscript contain the detailed integration architecture, quantitative success-rate metrics across tasks, baseline comparisons (including π₀-FAST and π₀.₅), ablation studies on the 3D components, and error analysis. The abstract summarizes the headline result; the 20× data-efficiency claim is supported by explicit data-volume comparisons in the experiments. We will add cross-references and a brief summary table in the revision to make these elements more immediately visible. revision: partial

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper trains SPEAR-VLM on independently 3D-annotated non-robotic images to acquire spatial reasoning, then integrates the resulting backbone into SPEAR-1, which is trained on the public Open X-Embodiment robot datasets (~45 M frames). The headline performance claim (matching or exceeding π₀-FAST and π₀.₅ with 20× fewer demonstrations) is an empirical comparison against externally published baselines. No equation, training objective, or cited premise reduces the claimed 3D-transfer benefit or data-efficiency gain to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain. The derivation therefore remains self-contained against external data and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified transfer of 3D perception learned from annotated images to robotic control; no free parameters or invented physical entities are described, but the approach introduces two new model names whose internal mechanisms are not detailed in the abstract.

axioms (1)
  • domain assumption 3D coordinate annotations can be reliably generated and added to large-scale non-robotic image collections
    This premise underpins the creation of SPEAR-VLM and is invoked when the authors describe enriching easy-to-collect image data.
invented entities (2)
  • SPEAR-VLM no independent evidence
    purpose: A 3D-aware vision-language model that infers object coordinates from single 2D images
    New intermediate model introduced to bridge 2D VLM limitations and robotic control
  • SPEAR-1 no independent evidence
    purpose: Robotic foundation model that integrates the 3D perception with language-instructed control
    Main contribution whose performance is claimed to exceed prior models with less data

pith-pipeline@v0.9.0 · 5642 in / 1355 out tokens · 31716 ms · 2026-05-17T20:17:08.962505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 22 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Binkow...

  2. [2]

    Roboarena: Distributed real- world evaluation of generalist robot policies

    Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Ed- ward Hu, Fabio Ramos, et al. Roboarena: Distributed real- world evaluation of generalist robot policies. InProceedings of the Conference on Robot Learning (CoRL 2025), 2025. 2, 16

  3. [3]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331,

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331,

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisensch- los, Rishabh Kabra, Matthi...

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3, 4, 8, 14, 15

  6. [6]

    Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Re...

  7. [7]

    Spatialbot: Precise spatial understanding with vision lan- guage models,

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024. 2

  8. [8]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

  10. [10]

    Revla: Reverting vi- sual domain limitation of robotic foundation models.arXiv preprint arXiv:2409.15250, 2024

    Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, and Danda Pani Paudel. Revla: Reverting vi- sual domain limitation of robotic foundation models.arXiv preprint arXiv:2409.15250, 2024. 3, 5

  11. [11]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  12. [12]

    Learning with 3d rotations, a hitch- hiker’s guide to SO(3)

    Andreas Ren ´e Geist, Jonas Frey, Mikel Zhobro, Anna Lev- ina, and Georg Martius. Learning with 3d rotations, a hitch- hiker’s guide to SO(3). InForty-first International Confer- ence on Machine Learning, 2024. 5

  13. [13]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  14. [14]

    Rotation averaging.International journal of computer vision, 103(3):267–305, 2013

    Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging.International journal of computer vision, 103(3):267–305, 2013. 5

  15. [15]

    Initializing new word embeddings for pretrained language models.https : / / www

    John Hewitt. Initializing new word embeddings for pretrained language models.https : / / www . cs . columbia . edu /˜johnhew / vocab - expansion . html. 13

  16. [16]

    Huggingface transformers documen- tation.https : / / huggingface

    HuggingFace. Huggingface transformers documen- tation.https : / / huggingface . co / docs / transformers/en/main_classes/model. 13

  17. [17]

    Karamcheti, S

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic 9 vlms: Investigating the design space of visually-conditioned language models.arXiv preprint arXiv:2402.07865, 2024. 2, 3

  18. [18]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Sys- tems, 2024. 6, 7, 8, 15, 16

  19. [19]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2, 3, 5, 7, 8, 15, 16

  20. [20]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 3

  21. [21]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

  22. [22]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024. 5, 6, 7, 15

  23. [23]

    Flow matching for genera- tive modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 4

  24. [24]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez- Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024. 4

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 3, 4

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 2

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 3

  28. [28]

    Rectified Flow: A Marginal Preserving Approach to Optimal Transport

    Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022. 4

  29. [29]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 13

  30. [30]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 12

  31. [31]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  32. [32]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bew- ley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x- embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. 3, 6, 8, 15

  33. [33]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

  34. [34]

    Introducing gemini 2.0: our new ai model for the agentic era

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. Introducing gemini 2.0: our new ai model for the agentic era. https : / / blog . google / technology / google - deepmind / google - gemini - ai - update - december-2024/. 3

  35. [35]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2, 3, 7, 12, 15, 16

  36. [36]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 4, 13

  37. [37]

    Grounded sam: Assembling open-world models for diverse visual tasks,

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

  38. [38]

    FLOWER: Democratizing generalist robot policies with efficient vision- language-action flow policies

    Moritz Reuss, Hongyi Zhou, Marcel R ¨uhle, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. FLOWER: Democratizing generalist robot policies with efficient vision- language-action flow policies. In7th Robot Learning Work- shop: Towards Robots with Human-Level Abilities, 2025. 15, 16

  39. [39]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 15768–15780, 2025. 2

  40. [40]

    Generalist robot ma- 10 nipulation beyond action labeled data

    Alexander Spiridonov, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, and Danda Pani Paudel. Generalist robot ma- 10 nipulation beyond action labeled data. In9th Annual Con- ference on Robot Learning, 2025. 2, 15

  41. [41]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Andreas Steiner, Andr ´e Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Grit- senko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024. 2

  42. [42]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 3, 14

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 2

  44. [44]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 3

  45. [45]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InProceedings of the Conference on Robot Learning (CoRL), 2023. 4, 6, 7, 8, 13, 14, 15

  46. [46]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  47. [47]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024. 2, 3, 4, 13

  48. [48]

    Depth any- thing v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3

  49. [49]

    Robotic control via em- bodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via em- bodied chain-of-thought reasoning. In8th Annual Confer- ence on Robot Learning, 2024. 15

  50. [50]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3

  51. [51]

    Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 15, 16

  52. [52]

    Open3D: A Modern Library for 3D Data Processing

    Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3D: A modern library for 3D data processing.arXiv:1801.09847,

  53. [53]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

  54. [54]

    Appendix The appendix is organized as follows: • In Sec

    3 11 A. Appendix The appendix is organized as follows: • In Sec. A.1 we provide more details on the VLM pre- training including VQA tasks, encoder fusion strategies, 3D tokenization and data annotation pipeline. • In Sec. A.2 we provide more details on the VLA training including data mixture, architecture, flow matching and design decision ablation result...

  55. [55]

    Concatenating the visual features predicted by both en- coders and projecting them via a linear layer to the LLM embedding space. In particular, for SigLIP we take only the tokens at the last layer of the vision encoder, while for MoGe we take the tokens at the last 4 layers of the encoder, following the approach used by MoGe architec- ture to decode the ...

  56. [56]

    Using MoGe’s predicted 3D point cloudPin the camera ego pose (in an affine-invariant space) and adding them to the SigLIP encoder features, similar to SpatialVLA [35]. In particular, MoGe’s 3D point cloud outputP∈R H×W×3 is embedded toP ′ ∈R h×w×d through a projectorψ(·), composed of normaliza- tion, convolution, sinusoidal embeddingγ(x) = (x,sin(2 0πx),c...

  57. [57]

    Finally, the featuresF ′ =F+P ′ are fed to PaliGemma’s SigLIP linear projector, where F∈R h×w×d denotes the features at the SigLIP encoder output

    and an MLP. Finally, the featuresF ′ =F+P ′ are fed to PaliGemma’s SigLIP linear projector, where F∈R h×w×d denotes the features at the SigLIP encoder output. During our preliminary VLM evaluations we found the first strategy to demonstrate qualitatively better perfor- mance on bounding box prediction tasks. In particular, models trained with the second a...

  58. [58]

    Bridge V2, however, is not very diverse in the number of environments, objects, 15 and camera viewpoints

    and deploying on the WidowX robot in environments close to the training distribution. Bridge V2, however, is not very diverse in the number of environments, objects, 15 and camera viewpoints. As a result, we observe that models pre-trained on Bridge V2 only perform well on WidowX environments when the deployment scenario is similar to what is seen in the ...