pith. sign in

arxiv: 2606.08530 · v2 · pith:CX2PNCG3new · submitted 2026-06-07 · 💻 cs.RO · cs.AI

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Pith reviewed 2026-06-27 18:27 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords actiongear-vlaembodimentgeometry-awarelearningmanipulationrepresentationbenchmark
0
0 comments X

The pith

GEAR-VLA learns unified geometry-aware action representations through coarse-to-fine learning, 3D alignment, and embodiment canonicalization to generalize across unseen objects and robot bodies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models achieve strong benchmark results but fail when objects, backgrounds, or robot embodiments change because they lack a shared geometry-aware representation of manipulation actions. GEAR-VLA addresses this by first pretraining a VLM on multi-source embodied data for discrete action understanding, then connecting latent action tokens to a continuous DiT expert, while aligning a trainable 3D spatial backbone to the frozen VLM visual features and confining robot differences to low-level states through embodiment canonicalization. If these steps succeed, the same higher-level action representation can transfer to new robots and novel items without retraining the core model. A sympathetic reader would care because this targets the specific gap between controlled benchmarks and practical deployment where current VLAs remain brittle. The framework keeps geometry at the center so that actions become embodiment-invariant while still respecting spatial structure.

Core claim

GEAR-VLA shows that a unified geometry-aware action representation arises when coarse-to-fine action learning equips the VLM with embodied reasoning before latent tokens link semantics to a gradient-decoupled DiT continuous expert, when semantic-aligned 3D integration trains a 3D backbone while freezing the original VLM visual pathway, and when embodiment canonicalization uses embodiment-aware states together with embodiment-invariant actions; these elements together produce representations that achieve state-of-the-art results on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, plus 85.9 percent success on AgileX, 81.0 percent on the unseen LDT-01 embodiment, and 90.1 percent on a 6,360-tri

What carries the argument

The GEAR-VLA framework, which combines coarse-to-fine action learning, semantic-aligned 3D integration, and embodiment canonicalization to produce unified geometry-aware action representations.

If this is right

  • The same representation transfers to zero-shot evaluation on LIBERO-Plus without further training.
  • Real-world success reaches 85.9 percent on AgileX and 81.0 percent on a previously unseen robot embodiment.
  • A single model attains 90.1 percent success across 212 novel objects in a large-scale grasping test.
  • Simulation benchmarks such as LIBERO and RoboTwin 2.0 reach state-of-the-art levels under the new representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of invariant actions from embodiment-specific states may allow pooling of demonstration data across heterogeneous robot fleets more efficiently than current approaches.
  • Freezing the original VLM visual pathway while training only the 3D backbone suggests a route to add spatial awareness to existing large models without full retraining.
  • If the geometry-aware tokens remain effective, the same coarse-to-fine structure could extend to manipulation tasks that require contact-rich or deformable-object handling.
  • The canonicalization step may reduce the need for embodiment-specific fine-tuning when deploying the model on new hardware.
  • keywords:[

Load-bearing premise

The central performance gains come from the geometry-aware components rather than from pretraining data volume or base VLM choice.

What would settle it

An ablation that removes either the semantic-aligned 3D integration or the embodiment canonicalization and shows unchanged success rates on the 6,360-trial universal grasping benchmark with 212 unseen objects would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.08530 by Jiajia Wu, Jiajun Deng, Jianmin Ji, Jia Pan, Shiqi Zhang, Shuai Dong, Xingyi Zhang, Xin Nie, Xin Zhang, Yanyong Zhang, Yedong Shen, Yuan Zhang, Yuxuan Gao, Zhiyuan Cheng.

Figure 1
Figure 1. Figure 1: GEAR-VLA learns geometry-aware action representations through three key designs: coarse-to-fine action learning, semantic-aligned 3D integration, and embodiment canonicalization. We evaluate these capa￾bilities across real-world appearance generalization, cross-embodiment transfer, and universal object grasping. reasoning. Second, VLAs need 3D spatial understanding, but existing depth [5], 3D position en￾c… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of GEAR-VLA.GEAR-VLA combines coarse-to-fine action learning, semantic-aligned 3D integration, and embodiment canonicalization to learn transferable geometry-aware action representations. Here, State Projectore denotes the lightweight state projector for robot embodiment e. • Extensive evaluations on simulation benchmarks, real-world bimanual manipulation, cross￾embodiment transfer, and universal… view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Manipulation Results. (a) Three evaluation tasks: T-shirt folding, shorts folding, and parcel-label flipping. (b) Results on AgileX, where each task is trained with 200 demonstrations using one fixed object color and tested on three unseen colors. (c) Results on LDT-01, a pretraining-unseen 16-DoF bimanual robot. GEAR-VLA achieves the best performance on both platforms, showing strong real-world… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the real-world universal grasping benchmark. Each column shows the ob￾servation (top) and execution result (bottom). With only a first-frame target mask, GEAR-VLA localizes and manipulates diverse unseen objects in sparse and cluttered scenes. Real-World Generalization. We first evaluate GEAR-VLA on AgileX, a 14-DoF dual-arm robot. For the three tasks, we collect 200 bimanual teleope… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the causal VQ-VAE latent action tokenizer. The tokenizer compresses action-induced visual changes into discrete latent action IDs, which are then used as supervision for Embodied Pretraining on videos without robot action labels. For f j i , the subscript i denotes the frame ID and the superscript j denotes the image-feature patch ID. For z j i , the subscript i denotes the frame ID and the sup… view at source ↗
Figure 6
Figure 6. Figure 6: Masked cross-attention used in the causal VQ-VAE latent action tokenizer. Each query corresponds to an image frame, visual features are arranged in temporal order, and the query for a given frame can only attend to current and previous visual features. Gray cells denote future positions masked by the causal attention rule. The video frames are sampled at 5 Hz, while robot actions are represented at 30 Hz. … view at source ↗
Figure 7
Figure 7. Figure 7: (c) freezes the original ViT and injects VGGT features through the zero-initialized connec￾tor, yielding clearer inter-class separation and more structured intra-class distributions. These results support our semantic-aligned 3D integration design. Original 2D ViT Trainable ViT + Zero-init 3D Connector Frozen ViT + Zero-init 3D Connector [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mask-guided universal object grasping pipeline. The same VLM is used for target grounding and mask-guided reasoning. Given the first-frame target mask and current multi-view observation, the DiT-based action expert predicts the robot action for manipulating the specified object. Object split [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes GEAR-VLA, a VLA framework that learns unified geometry-aware action representations via coarse-to-fine action learning (multi-source pretraining followed by latent tokens and gradient-decoupled DiT), semantic-aligned 3D integration (trainable 3D backbone aligned to frozen VLM visual pathway), and embodiment canonicalization (embodiment-aware states with invariant actions). It reports SOTA results on LIBERO, zero-shot LIBERO-Plus, RoboTwin 2.0, 85.9% success on AgileX, 81.0% on unseen LDT-01 embodiment, and 90.1% on a 6,360-trial grasping benchmark with 212 unseen objects, with code/models to be released.

Significance. If the performance gains are shown to stem specifically from the geometry-aware components rather than pretraining scale or base VLM choice, the work would offer a concrete path to improving VLA robustness to unseen objects, backgrounds, and embodiments. The multi-benchmark evaluation including real-robot and cross-embodiment tests, together with the planned code release, would strengthen the contribution by supporting reproducibility and further research.

major comments (1)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claim attributes SOTA performance to the three geometry-aware components and states that lack of unified geometry-aware representation is the root cause of prior VLA failures. However, no controlled ablations are described that hold pretraining data volume, VLM backbone, and training recipe fixed while removing or varying each component (coarse-to-fine tokens+DiT, 3D alignment, canonicalization). Without these, the reported success rates on LIBERO, RoboTwin, AgileX, LDT-01, and the grasping benchmark cannot substantiate the causal attribution over alternative explanations such as multi-source pretraining scale.
minor comments (2)
  1. [Method] Notation for latent action tokens and the DiT expert could be introduced with explicit equations in the method section to improve clarity of the coarse-to-fine pipeline.
  2. [Experiments] Table captions for benchmark results should explicitly state the number of trials and whether statistical significance or variance across seeds is reported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that isolating the contributions of the proposed geometry-aware components through controlled ablations is essential to substantiate the central claims, and we will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim attributes SOTA performance to the three geometry-aware components and states that lack of unified geometry-aware representation is the root cause of prior VLA failures. However, no controlled ablations are described that hold pretraining data volume, VLM backbone, and training recipe fixed while removing or varying each component (coarse-to-fine tokens+DiT, 3D alignment, canonicalization). Without these, the reported success rates on LIBERO, RoboTwin, AgileX, LDT-01, and the grasping benchmark cannot substantiate the causal attribution over alternative explanations such as multi-source pretraining scale.

    Authors: We acknowledge the validity of this point. Our current experiments compare GEAR-VLA against prior VLAs and include some component evaluations, but they do not include fully controlled ablations that hold pretraining data volume, VLM backbone, and training recipe fixed while ablating each of the three components individually. In the revised manuscript, we will add these controlled ablations (e.g., variants with/without the gradient-decoupled DiT, 3D alignment, and canonicalization under matched conditions) to directly address potential alternative explanations such as pretraining scale and to strengthen the causal attribution to the geometry-aware design. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is an empirical VLA framework proposal whose central claims rest on reported benchmark results (LIBERO, RoboTwin, real-robot trials) rather than any mathematical derivation chain. The abstract and provided text contain no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatzes smuggled via prior work. The three proposed components are motivated by stated limitations of prior VLAs and evaluated via experiments; the performance attribution is therefore not forced by construction or self-reference but remains open to external verification through ablations and replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, training details, or explicit assumptions, so no free parameters, axioms, or invented entities can be identified; full methods would be required to populate the ledger.

pith-pipeline@v0.9.1-grok · 5876 in / 1388 out tokens · 28124 ms · 2026-06-27T18:27:28.054333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

  2. [2]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control. InProceeding...

  3. [3]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

  4. [4]

    Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

    Z. Liang, Y . Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, Y . Mu, and P. Luo. Discrete diffusion VLA: Bringing discrete diffusion to action decoding in vision- language-action policies.arXiv preprint arXiv:2508.20072, 2025

  5. [5]

    T. Yuan, Y . Liu, C. Lu, Z. Chen, T. Jiang, and H. Zhao. DepthVLA: Enhancing vision- language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

  6. [6]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring spatial representations for visual-language-action model. In Proceedings of Robotics: Science and Systems, 2025

  7. [7]

    Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. ABot-M0: VLA foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  8. [8]

    H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-RDT: Human manipulation enhanced bimanual robotic manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18135–18143, 2026. doi:10.1609/aaai.v40i22.38875

  9. [9]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, Y .-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  10. [10]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023

  11. [11]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 9

  12. [12]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H.-a. Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXi...

  13. [13]

    Zhong, X

    Y . Zhong, X. Huang, R. Li, C. Zhang, Z. Chen, T. Guan, F. Zeng, K. N. Lui, Y . Ye, Y . Liang, Y . Yang, and Y . Chen. DexGraspVLA: A vision-language-action framework towards general dexterous grasping.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22): 18836–18844, 2026. doi:10.1609/aaai.v40i22.38953

  14. [14]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  15. [15]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  16. [16]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, 2021

  17. [17]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Informa- tion Processing Systems, 2023

  18. [18]

    Ichter, A

    B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. To- shev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Lu...

  19. [19]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InProceedings of the IEEE International Conference on Robotics and Automation, pages 9493–9500, 2023. doi:10.1109/ ICRA48891.2023.10160591

  20. [20]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, B. Ichter, A. Irpan, D. Kalashnikov, S. Levine, I. Mordatch, C. Parada, M. Ryoo, P. Ser- manet, V . Vanhoucke, F. Xia, T. Xiao, T. Yu, and B. Zitkovich. RT-1: Robotics transformer for real-world control at scale. InProceedings of Robotics: Science and Systems, 2023

  21. [21]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

  22. [22]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model. InProceedings of the International Conference o...

  23. [23]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InProceedings of the IEEE International Conference on Robotics and Automation, 2024

  24. [24]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  25. [25]

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3D-LLM: Injecting the 3D world into large language models. InAdvances in Neural Information Processing Systems, 2023

  26. [26]

    J. L. Sch ¨onberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

  27. [27]

    J. L. Sch ¨onberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, 2016

  28. [28]

    J. Wang, N. Karaev, C. Rupprecht, and D. Novotny. VGGSfM: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21686–21697, 2024

  29. [29]

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  30. [30]

    R. Wang, S. Xu, C. Dai, J. Xiang, Y . Deng, X. Tong, and J. Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261–5271, 2025

  31. [31]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024

  32. [32]

    Leroy, Y

    V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3D with MASt3R. In European Conference on Computer Vision, 2024

  33. [33]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5294–5306, 2025

  34. [34]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

  35. [35]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  36. [36]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. C. M. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023. 11

  37. [37]

    Z. Ren, Y . Wei, X. Yu, G. Luo, Y . Zhao, B. Kang, J. Feng, and X. Jin. Videoworld 2: Learning transferable knowledge from real-world videos, 2026

  38. [38]

    Y . Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InICCV, 2025

  39. [39]

    S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

  40. [40]

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  41. [41]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  42. [42]

    put the apple into the plate

    L. Zhong, Y . Liu, Y . Wei, Z. Xiong, S. Liu, and G. Ren. ACoT-VLA: Action chain-of-thought for vision-language-action models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8152–8162, 2026. 12 Appendix A Training Data and Implementation Details Training data.Tables 4 and 5 summarize the two data pools used in C...