arxiv: 2605.10485 · v1 · submitted 2026-05-11 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

Chengyu Bai, Chun-Kai Fan, Hao Wang, Jiajun Cao, Jian Tang, Jingyang He, Jintao Chen, Ming Lu, Shanghang Zhang, Shanyu Rong, Xiaobao Wei, Xiaozhu Ju, Ying Li

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-actionspatial groundingvisual encoder alignmentrobotic manipulation3D awarenessimplicit groundingVLA modelsDINOv2 features

0 comments

The pith

VEGA aligns VLA visual encoders directly with 3D features before language mixing to improve spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models lack accurate spatial awareness because their visual backbones are pretrained only on 2D images. It introduces VEGA to align the visual encoder output straight to features from a 3D-supervised DINOv2 model via a simple projector and cosine loss, before any language processing occurs. This early grounding produces clearer spatial representations than prior methods that align after features reach the language model. Sympathetic readers would care because precise 3D understanding is essential for robots to succeed at manipulation tasks like grasping and assembly in both simulation and the real world.

Core claim

VEGA is a framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. The alignment uses a lightweight projector trained with cosine similarity loss alongside the standard action prediction objective and is discarded at inference time. By performing this alignment at the visual encoder level before linguistic entanglement, VEGA provides a more interpretable and principled spatial target than existing implicit methods that operate on LLM-level tokens. Extensive experiments on simulation benchmarks and real-world manipulation tasks show that VEGA is

What carries the argument

A lightweight projector trained with cosine similarity loss to map VLA visual encoder outputs to DINOv2-FiT3D features before language model processing.

Load-bearing premise

That aligning visual encoder outputs directly with DINOv2-FiT3D features grounds spatial awareness before linguistic entanglement in a way that improves downstream action prediction more effectively than prior LLM-level alignments.

What would settle it

A head-to-head experiment on the same VLA backbone where VEGA-aligned models show no improvement or lower success rates than LLM-level alignment baselines on spatial manipulation benchmarks would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.10485 by Chengyu Bai, Chun-Kai Fan, Hao Wang, Jiajun Cao, Jian Tang, Jingyang He, Jintao Chen, Ming Lu, Shanghang Zhang, Shanyu Rong, Xiaobao Wei, Xiaozhu Ju, Ying Li.

**Figure 1.** Figure 1: Comparison of spatial grounding paradigms for VLA models. (1) Explicit Spatial Grounding augments VLA inputs with estimated depth maps from monocular depth estimators (e.g., Depth Anything), introducing additional inference overhead and error propagation. (2) Implicit Spatial Grounding via LLM-level Token Alignment aligns visual features at the LLM token level, where spatial structure is entangled with lin… view at source ↗

**Figure 2.** Figure 2: DINOv2-FiT3D features exhibit stronger spatial structure and improve VLA manipulation performance. (a) PCA visualizations comparing DINOv2 and DINOv2-FiT3D patch features, where FiT3D produces more spatially consistent representations with cleaner object boundaries. (b) Success rate on Move Playingcard Away for four encoder variants, showing that DINOv2-FiT3D substantially outperforms DINOv2 regardless of… view at source ↗

**Figure 3.** Figure 3: Overview of the VEGA training framework. The frozen DINOv2-FiT3D encoder serves as a spatial teacher, supervising the DINOv2 branch of the VLA visual backbone via a lightweight projector. The alignment loss is computed as the cosine distance between projected DINOv2 features and FiT3D features, and is combined with the action prediction loss during training. At inference time, the projector and teacher enc… view at source ↗

**Figure 4.** Figure 4: VEGA improves both training and data efficiency over OpenVLA-OFT. (a) Success rate curves across training steps on Move Playingcard Away (Easy), showing that VEGA converges faster and reaches a higher performance ceiling. (b) Success rate under varying demonstration data fractions, demonstrating that VEGA consistently outperforms the base model across all data regimes. 0.00 0.25 0.50 0.75 1.00 Success Rate… view at source ↗

**Figure 5.** Figure 5: Effect of alignment loss coefficient λ. Success rate (%) on RoboTwin 2.0 across six tasks under Easy and Hard settings with λ ∈ {0.05, 0.1, 0.2}, where λ = 0.1 achieves the best overall balance between spatial alignment and task learning [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of VEGA on real-world manipulation tasks. Keyframe sequences illustrating the execution of four tasks: Close Laptop, Handover Cucumber, Pick Dual Carrots into Dual Bowls, and Pick Dual Flowers into Vase. Each row shows the progression from the initial state to task completion [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Feature representation analysis across encoder variants. Each row shows a scene from the robotic manipulation domain. Columns 2–4 show PCA visualizations of patch-level features from DINOv2, DINOv2-FiT3D, and DINOv2-OpenVLA-7B, respectively. Column 5 shows the pairwise ARI matrix computed from KMeans clustering of patch features, measuring the consistency of spatial groupings across encoders. B.2 Pretraini… view at source ↗

**Figure 8.** Figure 8: Pretraining convergence curves. Training loss and action token accuracy for OpenVLAOFT and the FiT3D variant pretrained on Bridge Dataset v2. Both variants converge at a similar rate and to comparable final values, indicating that initializing the DINOv2 branch from a FiT3D checkpoint introduces no instability or degradation to the pretraining process. Move Playingcard Away Click Bell Turn Switch Beat Blo… view at source ↗

**Figure 9.** Figure 9: Qualitative visualization of VEGA on RoboTwin 2.0. Keyframe sequences illustrating the execution of all six bimanual manipulation tasks: Move Playingcard Away, Turn Switch, Click Bell, Beat Block, Lift Pot, and Place Shoes. Each row shows the progression from the initial state to task completion. Second, directly replacing DINOv2 with a frozen FiT3D encoder achieves substantially higher performance than bo… view at source ↗

**Figure 10.** Figure 10: Representative failure cases in real-world manipulation tasks. (a) Spatial localization error. (b) Grasp-pose-induced placement failure. (c) Workspace boundary collision. of grasp pose variance propagating into the placement stage—a limitation inherent to open-loop execution without explicit grasp quality estimation. Type III: Workspace Boundary Collision. Also, in Pick Dual Flowers into Vase, the thin fl… view at source ↗

read the original abstract

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VEGA aligns VLA visual encoders to 3D-supervised DINOv2 features before language mixing and reports gains on manipulation tasks, but the comparisons do not isolate whether early timing or the target features drive the results.

read the letter

The core idea is to run a cosine similarity loss on the visual encoder outputs against features from DINOv2 fine-tuned with multi-view 3D Gaussian Splatting, using a small projector that disappears at inference. This keeps the change lightweight and avoids any extra cost when the robot runs. They show consistent improvements over prior implicit grounding baselines on both simulation suites and real manipulation setups, which is the main practical takeaway.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes VEGA, a framework that directly aligns the output of a VLA model's visual encoder with spatially-aware features from DINOv2-FiT3D (a DINOv2 variant fine-tuned via multi-view 3D Gaussian Splatting) using a cosine similarity loss on a lightweight projector, trained jointly with the standard action prediction objective. Alignment occurs before linguistic tokens are formed to avoid entanglement with semantics; the projector is discarded at inference with no added cost. The paper claims this yields more interpretable spatial grounding than prior LLM-level implicit alignment methods and reports consistent outperformance on simulation benchmarks and real-world manipulation tasks, establishing a new SOTA among implicit spatial grounding approaches for VLA models.

Significance. If the superiority holds under rigorous controls, VEGA offers a lightweight, inference-free way to inject explicit 3D geometric supervision into VLA visual backbones, which could meaningfully advance spatial reasoning for robotic manipulation. The choice of a 3D-supervised target and the emphasis on pre-entanglement alignment are conceptually clean; the absence of inference overhead is a practical strength.

major comments (3)

[Abstract] Abstract: the claim of 'consistent outperformance' and 'new state-of-the-art' among implicit spatial grounding methods is presented without any mention of statistical significance, error bars, data splits, or exact baseline implementations, rendering the central empirical claim unverifiable from the provided information.
[Method] Method section (alignment procedure): the paper asserts that performing alignment at the visual-encoder output (before linguistic entanglement) is the decisive factor for improved spatial awareness, yet no ablation applies the identical DINOv2-FiT3D target and cosine-similarity loss at the LLM visual-token stage. Without this matched control, performance gains cannot be attributed specifically to the timing of alignment rather than to the quality of the 3D-supervised features or the projector acting as a regularizer.
[Experiments] Experiments section: the absence of an ablation isolating the pre-entanglement property (as opposed to the specific target or training setup) is load-bearing for the main contribution, because the skeptic's concern that gains may arise from non-timing factors is not addressed by the reported comparisons.

minor comments (2)

[Method] Clarify the precise architecture of the lightweight projector (number of layers, hidden dimension) and whether it is frozen or jointly optimized with the visual encoder.
[Experiments] Ensure all tables reporting quantitative results include standard deviations or confidence intervals and specify the number of random seeds or trials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to improve the verifiability of our empirical claims and to provide additional justification for the design choices regarding alignment timing. Our point-by-point responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent outperformance' and 'new state-of-the-art' among implicit spatial grounding methods is presented without any mention of statistical significance, error bars, data splits, or exact baseline implementations, rendering the central empirical claim unverifiable from the provided information.

Authors: We agree that the original abstract lacked sufficient context for the empirical claims. In the revised manuscript, we have updated the abstract to reference the evaluation protocol (results averaged over multiple seeds with standard deviations reported in the main text) and direct readers to the experiments section for data splits and baseline details. Error bars have been added to all relevant figures, and we now report statistical significance where applicable. revision: yes
Referee: [Method] Method section (alignment procedure): the paper asserts that performing alignment at the visual-encoder output (before linguistic entanglement) is the decisive factor for improved spatial awareness, yet no ablation applies the identical DINOv2-FiT3D target and cosine-similarity loss at the LLM visual-token stage. Without this matched control, performance gains cannot be attributed specifically to the timing of alignment rather than to the quality of the 3D-supervised features or the projector acting as a regularizer.

Authors: This is a valid concern regarding attribution. We have expanded the method section in the revision with a new discussion clarifying the conceptual motivation for pre-entanglement alignment (preserving spatial structure prior to semantic mixing) and explaining why a post-tokenization application of the identical target would not serve as a clean control due to intervening language model layers. We compare against existing post-entanglement baselines and acknowledge that a perfectly matched new ablation would require substantial additional compute; this is noted as a limitation and direction for future work. revision: partial
Referee: [Experiments] Experiments section: the absence of an ablation isolating the pre-entanglement property (as opposed to the specific target or training setup) is load-bearing for the main contribution, because the skeptic's concern that gains may arise from non-timing factors is not addressed by the reported comparisons.

Authors: We recognize that stronger isolation of the timing effect would bolster the core contribution. The revised experiments section now includes additional feature analysis (e.g., spatial similarity metrics at encoder vs. token stages) and visualizations to demonstrate that geometric information is better retained before linguistic entanglement. We have also clarified how the reported baselines function as controls for post-entanglement methods, while discussing potential confounding factors. A fully isolated ablation with identical target and setup is computationally demanding and is flagged for future investigation. revision: partial

Circularity Check

0 steps flagged

No circularity in the alignment training procedure

full rationale

The paper proposes VEGA as a training framework that adds a cosine similarity loss between the VLA visual encoder outputs and external DINOv2-FiT3D features, optimized jointly with the standard action prediction objective. This is a conventional multi-task loss setup with no self-referential definitions, no fitted parameters renamed as predictions, and no load-bearing self-citations that collapse the central claim. The timing of alignment (pre-linguistic entanglement) is an explicit design choice justified by the method description and empirical results on benchmarks, not by construction from the inputs. The derivation chain is self-contained as an empirical method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quality of the external DINOv2-FiT3D model and the benefit of early alignment; no new entities are postulated and no ad-hoc fitted parameters beyond standard training are introduced.

axioms (2)

domain assumption Cosine similarity loss is suitable for aligning visual feature representations from different models
Used as the alignment objective alongside the action prediction loss.
domain assumption DINOv2 fine-tuned with multi-view consistent 3D Gaussian Splatting supervision yields spatially-aware features superior for grounding
This is the chosen alignment target and the basis for claiming improved geometric interpretability.

pith-pipeline@v0.9.0 · 5593 in / 1397 out tokens · 65289 ms · 2026-05-12T05:03:44.739803+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
VEGA consistently outperforms existing implicit spatial grounding baselines

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

[1]

3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025

work page arXiv 2025
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

π0.5: A vision- language-action model with open-world generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al. π0.5: A vision- language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

work page 2025
[4]

π0: A vision-language-action flow model for general robot control, 2026

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

work page 2026
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[7]

Knowledge distillation with the reused teacher classifier

Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942, 2022

work page 2022
[8]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

work page arXiv 2023
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[11]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

work page 2023
[12]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

work page 2023
[13]

arXiv preprint arXiv:2512.09619 (2025)

Minghao Guo, Meng Cao, Jiachen Tao, Rongtao Xu, Yan Yan, Xiaodan Liang, Ivan Laptev, and Xiaojun Chang. Glad: Geometric latent distillation for vision-language-action models.arXiv preprint arXiv:2512.09619, 2025

work page arXiv 2025
[14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022
[15]

arXiv preprint arXiv:2311.12871

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023

work page arXiv 2023
[16]

Mllms need 3d-aware representation supervision for scene understanding.arXiv e-prints, pages arXiv–2506, 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding.arXiv e-prints, pages arXiv–2506, 2025

work page 2025
[17]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023

work page 2023
[18]

Prismatic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[19]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[20]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

A review of robot learning for manip- ulation: Challenges, representations, and algorithms.Journal of machine learning research, 22(30):1–82, 2021

Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot learning for manip- ulation: Challenges, representations, and algorithms.Journal of machine learning research, 22(30):1–82, 2021

work page 2021
[23]

A review of spatial reasoning and interaction for real-world robotics.Advanced Robotics, 31(5):222–242, 2017

Christian Landsiedel, Verena Rieser, Matthew Walter, and Dirk Wollherr. A review of spatial reasoning and interaction for real-world robotics.Advanced Robotics, 31(5):222–242, 2017

work page 2017
[24]

Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

work page 2026
[25]

Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

work page arXiv 2025
[26]

Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025

Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025

work page arXiv 2025
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[28]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[31]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 11

work page 2018
[32]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[33]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

work page 2023
[34]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768– 15780, 2025

work page 2025
[35]

Rocket: Residual-oriented multi-layer alignment for spatially- aware vision-language-action models.arXiv preprint arXiv:2602.17951, 2026

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, and Ang Li. Rocket: Residual-oriented multi-layer alignment for spatially- aware vision-language-action models.arXiv preprint arXiv:2602.17951, 2026

work page arXiv 2026
[36]

Geovla: Em- powering 3d representations in vision-language-action models,

Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

work page arXiv 2025
[37]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

work page 2023
[39]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[40]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

work page 2024
[41]

Depth anything v2.Advances in Neural Information Processing Systems, 37:21875– 21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875– 21911, 2024

work page 2024
[42]

Scannet++: A high- fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

work page 2023
[43]

arXiv preprint arXiv:2406.10721 (2024)

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

work page arXiv 2024
[44]

Improving 2d feature representations by 3d-aware fine-tuning

Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2d feature representations by 3d-aware fine-tuning. InEuropean Conference on Computer Vision, pages 57–74. Springer, 2024

work page 2024
[45]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[46]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review arXiv 2024
[47]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 A Additional Experimental Setup A.1 Pretraining Details For the controlled pretrain...

work page 2023
[48]

This task requires precise spatial perception to locate the screen and hinge, as well as smooth and controlled motion to avoid damaging the articulated structure during contact

Close Laptop.The robot uses a single arm to close an open laptop screen. This task requires precise spatial perception to locate the screen and hinge, as well as smooth and controlled motion to avoid damaging the articulated structure during contact

work page
[49]

This task requires accurate object localization and a smooth transfer trajectory to ensure stable grasping and precise placement without dropping the object

Handover Cucumber.The robot grasps a cucumber from one plate and places it onto another using a single arm. This task requires accurate object localization and a smooth transfer trajectory to ensure stable grasping and precise placement without dropping the object

work page
[50]

Pick Dual Carrots into Dual Bowls.Each arm simultaneously grasps a carrot and places it into the nearest corresponding bowl. This bimanual task requires synchronized motion planning and spatial reasoning to correctly associate each carrot with its target bowl, while executing two independent manipulation sequences in parallel without inter-arm interference

work page
[51]

Pick Dual Flowers into Vase.Each arm independently grasps a flower and inserts it into its corresponding vase. This bimanual task requires coordinated motion planning and spatial reasoning, as the robot must simultaneously manage two independent manipulation sequences while avoiding inter-arm interference. B Additional Analysis B.1 Feature Representation ...

work page