pith. machine review for the scientific record. sign in

arxiv: 2605.10485 · v1 · submitted 2026-05-11 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

Chengyu Bai, Chun-Kai Fan, Hao Wang, Jiajun Cao, Jian Tang, Jingyang He, Jintao Chen, Ming Lu, Shanghang Zhang, Shanyu Rong, Xiaobao Wei, Xiaozhu Ju, Ying Li

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionspatial groundingvisual encoder alignmentrobotic manipulation3D awarenessimplicit groundingVLA modelsDINOv2 features
0
0 comments X

The pith

VEGA aligns VLA visual encoders directly with 3D features before language mixing to improve spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models lack accurate spatial awareness because their visual backbones are pretrained only on 2D images. It introduces VEGA to align the visual encoder output straight to features from a 3D-supervised DINOv2 model via a simple projector and cosine loss, before any language processing occurs. This early grounding produces clearer spatial representations than prior methods that align after features reach the language model. Sympathetic readers would care because precise 3D understanding is essential for robots to succeed at manipulation tasks like grasping and assembly in both simulation and the real world.

Core claim

VEGA is a framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. The alignment uses a lightweight projector trained with cosine similarity loss alongside the standard action prediction objective and is discarded at inference time. By performing this alignment at the visual encoder level before linguistic entanglement, VEGA provides a more interpretable and principled spatial target than existing implicit methods that operate on LLM-level tokens. Extensive experiments on simulation benchmarks and real-world manipulation tasks show that VEGA is

What carries the argument

A lightweight projector trained with cosine similarity loss to map VLA visual encoder outputs to DINOv2-FiT3D features before language model processing.

Load-bearing premise

That aligning visual encoder outputs directly with DINOv2-FiT3D features grounds spatial awareness before linguistic entanglement in a way that improves downstream action prediction more effectively than prior LLM-level alignments.

What would settle it

A head-to-head experiment on the same VLA backbone where VEGA-aligned models show no improvement or lower success rates than LLM-level alignment baselines on spatial manipulation benchmarks would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.10485 by Chengyu Bai, Chun-Kai Fan, Hao Wang, Jiajun Cao, Jian Tang, Jingyang He, Jintao Chen, Ming Lu, Shanghang Zhang, Shanyu Rong, Xiaobao Wei, Xiaozhu Ju, Ying Li.

Figure 1
Figure 1. Figure 1: Comparison of spatial grounding paradigms for VLA models. (1) Explicit Spatial Grounding augments VLA inputs with estimated depth maps from monocular depth estimators (e.g., Depth Anything), introducing additional inference overhead and error propagation. (2) Implicit Spatial Grounding via LLM-level Token Alignment aligns visual features at the LLM token level, where spatial structure is entangled with lin… view at source ↗
Figure 2
Figure 2. Figure 2: DINOv2-FiT3D features exhibit stronger spatial structure and improve VLA manipu￾lation performance. (a) PCA visualizations comparing DINOv2 and DINOv2-FiT3D patch features, where FiT3D produces more spatially consistent representations with cleaner object boundaries. (b) Success rate on Move Playingcard Away for four encoder variants, showing that DINOv2-FiT3D substantially outperforms DINOv2 regardless of… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the VEGA training framework. The frozen DINOv2-FiT3D encoder serves as a spatial teacher, supervising the DINOv2 branch of the VLA visual backbone via a lightweight projector. The alignment loss is computed as the cosine distance between projected DINOv2 features and FiT3D features, and is combined with the action prediction loss during training. At inference time, the projector and teacher enc… view at source ↗
Figure 4
Figure 4. Figure 4: VEGA improves both training and data efficiency over OpenVLA-OFT. (a) Success rate curves across training steps on Move Playingcard Away (Easy), showing that VEGA converges faster and reaches a higher performance ceiling. (b) Success rate under varying demonstration data fractions, demonstrating that VEGA consistently outperforms the base model across all data regimes. 0.00 0.25 0.50 0.75 1.00 Success Rate… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of alignment loss coefficient λ. Success rate (%) on RoboTwin 2.0 across six tasks under Easy and Hard settings with λ ∈ {0.05, 0.1, 0.2}, where λ = 0.1 achieves the best overall balance between spatial alignment and task learning [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualization of VEGA on real-world manipulation tasks. Keyframe sequences illustrating the execution of four tasks: Close Laptop, Handover Cucumber, Pick Dual Carrots into Dual Bowls, and Pick Dual Flowers into Vase. Each row shows the progression from the initial state to task completion [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature representation analysis across encoder variants. Each row shows a scene from the robotic manipulation domain. Columns 2–4 show PCA visualizations of patch-level features from DINOv2, DINOv2-FiT3D, and DINOv2-OpenVLA-7B, respectively. Column 5 shows the pairwise ARI matrix computed from KMeans clustering of patch features, measuring the consistency of spatial groupings across encoders. B.2 Pretraini… view at source ↗
Figure 8
Figure 8. Figure 8: Pretraining convergence curves. Training loss and action token accuracy for OpenVLA￾OFT and the FiT3D variant pretrained on Bridge Dataset v2. Both variants converge at a similar rate and to comparable final values, indicating that initializing the DINOv2 branch from a FiT3D checkpoint introduces no instability or degradation to the pretraining process. Move Playingcard Away Click Bell Turn Switch Beat Blo… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative visualization of VEGA on RoboTwin 2.0. Keyframe sequences illustrating the execution of all six bimanual manipulation tasks: Move Playingcard Away, Turn Switch, Click Bell, Beat Block, Lift Pot, and Place Shoes. Each row shows the progression from the initial state to task completion. Second, directly replacing DINOv2 with a frozen FiT3D encoder achieves substantially higher performance than bo… view at source ↗
Figure 10
Figure 10. Figure 10: Representative failure cases in real-world manipulation tasks. (a) Spatial localization error. (b) Grasp-pose-induced placement failure. (c) Workspace boundary collision. of grasp pose variance propagating into the placement stage—a limitation inherent to open-loop execution without explicit grasp quality estimation. Type III: Workspace Boundary Collision. Also, in Pick Dual Flowers into Vase, the thin fl… view at source ↗
read the original abstract

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes VEGA, a framework that directly aligns the output of a VLA model's visual encoder with spatially-aware features from DINOv2-FiT3D (a DINOv2 variant fine-tuned via multi-view 3D Gaussian Splatting) using a cosine similarity loss on a lightweight projector, trained jointly with the standard action prediction objective. Alignment occurs before linguistic tokens are formed to avoid entanglement with semantics; the projector is discarded at inference with no added cost. The paper claims this yields more interpretable spatial grounding than prior LLM-level implicit alignment methods and reports consistent outperformance on simulation benchmarks and real-world manipulation tasks, establishing a new SOTA among implicit spatial grounding approaches for VLA models.

Significance. If the superiority holds under rigorous controls, VEGA offers a lightweight, inference-free way to inject explicit 3D geometric supervision into VLA visual backbones, which could meaningfully advance spatial reasoning for robotic manipulation. The choice of a 3D-supervised target and the emphasis on pre-entanglement alignment are conceptually clean; the absence of inference overhead is a practical strength.

major comments (3)
  1. [Abstract] Abstract: the claim of 'consistent outperformance' and 'new state-of-the-art' among implicit spatial grounding methods is presented without any mention of statistical significance, error bars, data splits, or exact baseline implementations, rendering the central empirical claim unverifiable from the provided information.
  2. [Method] Method section (alignment procedure): the paper asserts that performing alignment at the visual-encoder output (before linguistic entanglement) is the decisive factor for improved spatial awareness, yet no ablation applies the identical DINOv2-FiT3D target and cosine-similarity loss at the LLM visual-token stage. Without this matched control, performance gains cannot be attributed specifically to the timing of alignment rather than to the quality of the 3D-supervised features or the projector acting as a regularizer.
  3. [Experiments] Experiments section: the absence of an ablation isolating the pre-entanglement property (as opposed to the specific target or training setup) is load-bearing for the main contribution, because the skeptic's concern that gains may arise from non-timing factors is not addressed by the reported comparisons.
minor comments (2)
  1. [Method] Clarify the precise architecture of the lightweight projector (number of layers, hidden dimension) and whether it is frozen or jointly optimized with the visual encoder.
  2. [Experiments] Ensure all tables reporting quantitative results include standard deviations or confidence intervals and specify the number of random seeds or trials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to improve the verifiability of our empirical claims and to provide additional justification for the design choices regarding alignment timing. Our point-by-point responses to the major comments are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent outperformance' and 'new state-of-the-art' among implicit spatial grounding methods is presented without any mention of statistical significance, error bars, data splits, or exact baseline implementations, rendering the central empirical claim unverifiable from the provided information.

    Authors: We agree that the original abstract lacked sufficient context for the empirical claims. In the revised manuscript, we have updated the abstract to reference the evaluation protocol (results averaged over multiple seeds with standard deviations reported in the main text) and direct readers to the experiments section for data splits and baseline details. Error bars have been added to all relevant figures, and we now report statistical significance where applicable. revision: yes

  2. Referee: [Method] Method section (alignment procedure): the paper asserts that performing alignment at the visual-encoder output (before linguistic entanglement) is the decisive factor for improved spatial awareness, yet no ablation applies the identical DINOv2-FiT3D target and cosine-similarity loss at the LLM visual-token stage. Without this matched control, performance gains cannot be attributed specifically to the timing of alignment rather than to the quality of the 3D-supervised features or the projector acting as a regularizer.

    Authors: This is a valid concern regarding attribution. We have expanded the method section in the revision with a new discussion clarifying the conceptual motivation for pre-entanglement alignment (preserving spatial structure prior to semantic mixing) and explaining why a post-tokenization application of the identical target would not serve as a clean control due to intervening language model layers. We compare against existing post-entanglement baselines and acknowledge that a perfectly matched new ablation would require substantial additional compute; this is noted as a limitation and direction for future work. revision: partial

  3. Referee: [Experiments] Experiments section: the absence of an ablation isolating the pre-entanglement property (as opposed to the specific target or training setup) is load-bearing for the main contribution, because the skeptic's concern that gains may arise from non-timing factors is not addressed by the reported comparisons.

    Authors: We recognize that stronger isolation of the timing effect would bolster the core contribution. The revised experiments section now includes additional feature analysis (e.g., spatial similarity metrics at encoder vs. token stages) and visualizations to demonstrate that geometric information is better retained before linguistic entanglement. We have also clarified how the reported baselines function as controls for post-entanglement methods, while discussing potential confounding factors. A fully isolated ablation with identical target and setup is computationally demanding and is flagged for future investigation. revision: partial

Circularity Check

0 steps flagged

No circularity in the alignment training procedure

full rationale

The paper proposes VEGA as a training framework that adds a cosine similarity loss between the VLA visual encoder outputs and external DINOv2-FiT3D features, optimized jointly with the standard action prediction objective. This is a conventional multi-task loss setup with no self-referential definitions, no fitted parameters renamed as predictions, and no load-bearing self-citations that collapse the central claim. The timing of alignment (pre-linguistic entanglement) is an explicit design choice justified by the method description and empirical results on benchmarks, not by construction from the inputs. The derivation chain is self-contained as an empirical method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quality of the external DINOv2-FiT3D model and the benefit of early alignment; no new entities are postulated and no ad-hoc fitted parameters beyond standard training are introduced.

axioms (2)
  • domain assumption Cosine similarity loss is suitable for aligning visual feature representations from different models
    Used as the alignment objective alongside the action prediction loss.
  • domain assumption DINOv2 fine-tuned with multi-view consistent 3D Gaussian Splatting supervision yields spatially-aware features superior for grounding
    This is the chosen alignment target and the basis for claiming improved geometric interpretability.

pith-pipeline@v0.9.0 · 5593 in / 1397 out tokens · 65289 ms · 2026-05-12T05:03:44.739803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

  1. [1]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

    Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    π0.5: A vision- language-action model with open-world generalization

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al. π0.5: A vision- language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  4. [4]

    π0: A vision-language-action flow model for general robot control, 2026

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  7. [7]

    Knowledge distillation with the reused teacher classifier

    Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942, 2022

  8. [8]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  9. [9]

    Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  11. [11]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

  12. [12]

    Rvt: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

  13. [13]

    arXiv preprint arXiv:2512.09619 (2025)

    Minghao Guo, Meng Cao, Jiachen Tao, Rongtao Xu, Yan Yan, Xiaodan Liang, Ivan Laptev, and Xiaojun Chang. Glad: Geometric latent distillation for vision-language-action models.arXiv preprint arXiv:2512.09619, 2025

  14. [14]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

  15. [15]

    arXiv preprint arXiv:2311.12871

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023

  16. [16]

    Mllms need 3d-aware representation supervision for scene understanding.arXiv e-prints, pages arXiv–2506, 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding.arXiv e-prints, pages arXiv–2506, 2025

  17. [17]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023

  18. [18]

    Prismatic vlms: Investigating the design space of visually-conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

  19. [19]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  20. [20]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  21. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  22. [22]

    A review of robot learning for manip- ulation: Challenges, representations, and algorithms.Journal of machine learning research, 22(30):1–82, 2021

    Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot learning for manip- ulation: Challenges, representations, and algorithms.Journal of machine learning research, 22(30):1–82, 2021

  23. [23]

    A review of spatial reasoning and interaction for real-world robotics.Advanced Robotics, 31(5):222–242, 2017

    Christian Landsiedel, Verena Rieser, Matthew Walter, and Dirk Wollherr. A review of spatial reasoning and interaction for real-world robotics.Advanced Robotics, 31(5):222–242, 2017

  24. [24]

    Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

    Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

  25. [25]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

  26. [26]

    Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025

    Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  28. [28]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  30. [30]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  31. [31]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 11

  32. [32]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  33. [33]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

  34. [34]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768– 15780, 2025

  35. [35]

    Rocket: Residual-oriented multi-layer alignment for spatially- aware vision-language-action models.arXiv preprint arXiv:2602.17951, 2026

    Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, and Ang Li. Rocket: Residual-oriented multi-layer alignment for spatially- aware vision-language-action models.arXiv preprint arXiv:2602.17951, 2026

  36. [36]

    Geovla: Em- powering 3d representations in vision-language-action models,

    Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

  37. [37]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  38. [38]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  39. [39]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  40. [40]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

  41. [41]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875– 21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875– 21911, 2024

  42. [42]

    Scannet++: A high- fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  43. [43]

    arXiv preprint arXiv:2406.10721 (2024)

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

  44. [44]

    Improving 2d feature representations by 3d-aware fine-tuning

    Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2d feature representations by 3d-aware fine-tuning. InEuropean Conference on Computer Vision, pages 57–74. Springer, 2024

  45. [45]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  46. [46]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

  47. [47]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 A Additional Experimental Setup A.1 Pretraining Details For the controlled pretrain...

  48. [48]

    This task requires precise spatial perception to locate the screen and hinge, as well as smooth and controlled motion to avoid damaging the articulated structure during contact

    Close Laptop.The robot uses a single arm to close an open laptop screen. This task requires precise spatial perception to locate the screen and hinge, as well as smooth and controlled motion to avoid damaging the articulated structure during contact

  49. [49]

    This task requires accurate object localization and a smooth transfer trajectory to ensure stable grasping and precise placement without dropping the object

    Handover Cucumber.The robot grasps a cucumber from one plate and places it onto another using a single arm. This task requires accurate object localization and a smooth transfer trajectory to ensure stable grasping and precise placement without dropping the object

  50. [50]

    Pick Dual Carrots into Dual Bowls.Each arm simultaneously grasps a carrot and places it into the nearest corresponding bowl. This bimanual task requires synchronized motion planning and spatial reasoning to correctly associate each carrot with its target bowl, while executing two independent manipulation sequences in parallel without inter-arm interference

  51. [51]

    Pick Dual Flowers into Vase.Each arm independently grasps a flower and inserts it into its corresponding vase. This bimanual task requires coordinated motion planning and spatial reasoning, as the robot must simultaneously manage two independent manipulation sequences while avoiding inter-arm interference. B Additional Analysis B.1 Feature Representation ...