pith. machine review for the scientific record. sign in

arxiv: 2605.13403 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:44 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Rotational Latent ActionVision-Language-ActionSO(n)Latent Action ModelsFlow MatchingRobot ManipulationCross-embodiment Pretraining
0
0 comments X

The pith

RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing latent action models for VLA pretraining rely on discrete quantization that often collapses into trivial frame copying and lacks geometric structure. RotVLA instead represents latent actions as elements of the rotation group SO(n), which supplies continuity and compositionality aligned with physical robot motion. A triplet-frame objective enforces temporal consistency across frames without allowing degenerate solutions. The resulting 1.7 billion parameter model is pretrained on more than 1700 hours of cross-embodiment robot and human video data, then fine-tuned with a flow-matching head that treats latent rotations as high-level planners for actual robot actions. On standard benchmarks this yields 98.2 percent success on LIBERO and 89.6/88.5 percent on RoboTwin2.0 clean and randomized splits, plus strong real-world manipulation results.

Core claim

Latent actions modeled as elements of SO(n) together with a triplet-frame objective replace discrete quantization pipelines, delivering continuity, compositionality, and physically grounded structure while avoiding trivial reconstruction. The pretrained VLM backbone plus flow-matching action head uses these latent rotations as planners that condition unified denoising of robot actions, achieving the reported benchmark numbers with 1.7B parameters and 1700+ hours of data.

What carries the argument

Continuous rotational latent actions represented as elements of SO(n), learned under a triplet-frame objective that supplies temporal dynamics without collapse.

If this is right

  • Latent rotations can be composed and interpolated without discretization artifacts during planning.
  • The same latent space serves as a unified planner across embodiments once the flow-matching head is trained.
  • Performance remains high under both clean and randomized visual conditions on multiple manipulation suites.
  • Real-world deployment shows consistent gains over existing VLA baselines without extra embodiment-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometric prior may transfer to other sequence tasks that require physically plausible interpolation, such as video prediction.
  • If SO(n) structure proves robust, similar rotational embeddings could replace codebooks in non-robotics domains like motion capture or physics simulation.
  • Further scaling of the pretraining corpus could raise success rates on longer-horizon or multi-step tasks.

Load-bearing premise

Representing latent actions as rotations in SO(n) plus a triplet-frame loss automatically gives continuity, compositionality, and physical meaning without trivial solutions.

What would settle it

An ablation that removes the SO(n) constraint or the triplet loss and measures whether benchmark success falls to the level of prior discrete quantization methods on identical pretraining data.

Figures

Figures reproduced from arXiv: 2605.13403 by Hangjun Ye, Jiahuan Zhou, Peiyan Li, Qiwei Li, Quanyun Zhou, Xicheng Gong, Xinghang Li, Yadong Mu.

Figure 1
Figure 1. Figure 1: We introduce RotVLA, a Vision-Language-Action framework pretrained with a continuous [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of existing LAMs (a) and RotVLA (b). Existing LAMs follow an encode–decode [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of real-world tasks and generalization setting. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration that the latent action extracted by one dataset can generalize to other seen and p (RotVLA) መ𝐼′𝑡+1 = 𝒟(𝐼𝑡+1 𝑧𝑡 𝑡+1) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization results of ˆIt+1 and ˆI ′ t+2. varying backgrounds and the introduction of distractor objects, the performance gap becomes more pronounced. This robustness can be attributed to the continuous rotational latent action based pretraining and the latent planner finetuning, which captures high-level motion semantics rather than overfitting to background appearance. In terms of efficiency, on an NV… view at source ↗
Figure 6
Figure 6. Figure 6: The robotic platform used in real-world experiments. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Statistics of the pretraining data used by RotVLA, grouped by dataset and embodiment [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of pretraining data scale on performance. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RotVLA, a Vision-Language-Action (VLA) framework that models latent actions as elements of the special orthogonal group SO(n) for continuous and compositional representations. It employs a triplet frame learning objective to enforce temporal dynamics and uses a VLM backbone with a flow-matching action head. Pretrained on large-scale cross-embodiment datasets totaling over 1700 hours, RotVLA reports state-of-the-art success rates of 98.2% on the LIBERO benchmark and 89.6%/88.5% on RoboTwin2.0 under clean and randomized conditions, along with strong real-world manipulation performance.

Significance. If the reported performance gains are attributable to the rotational latent action representation and triplet objective rather than scale or data alone, this work could significantly advance VLA models by introducing a geometrically structured continuous latent space that better aligns with physical action dynamics, potentially improving generalization across embodiments.

major comments (2)
  1. Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.
  2. Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.
minor comments (1)
  1. Abstract: The parameter count is given as 1.7B but no breakdown of the VLM backbone versus action head is provided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the paper to directly address the concerns about supporting evidence for the abstract claims and the level of detail in the results section. All changes are highlighted in the revised version.

read point-by-point responses
  1. Referee: Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.

    Authors: We agree that the abstract would benefit from explicit references to supporting experiments. In the revised manuscript we have added a dedicated latent-space analysis subsection (Section 4.3) that includes: (i) linear interpolation between latent actions demonstrating continuity on the manifold, (ii) explicit SO(n) group composition tests showing that composing two latent actions yields a valid third action that matches the observed transition, and (iii) an ablation replacing the SO(n) representation with an unconstrained Euclidean latent space of identical dimensionality. These experiments are now cited in the abstract and demonstrate that the geometric structure contributes measurably to the reported performance gains beyond scale alone. revision: yes

  2. Referee: Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.

    Authors: We acknowledge that the original results section was too concise. The revised version now contains: (i) a full experimental protocol subsection detailing training hyperparameters, data splits, and evaluation protocols for both LIBERO and RoboTwin2.0; (ii) expanded baseline tables comparing against all prior discrete LAM-based VLAs and recent flow-matching methods; (iii) systematic ablations that isolate the SO(n) representation, the triplet-frame objective, and the flow-matching head; and (iv) per-task error analysis with failure-mode categorization. These additions allow readers to verify the specific contribution of the rotational latent action design. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are empirical outcomes of pretraining and evaluation

full rationale

The paper reports success rates (98.2% LIBERO, 89.6/88.5% RoboTwin2.0) as results of pretraining a 1.7B VLA model on 1700+ hours of data using a flow-matching head conditioned on SO(n) latent actions and triplet-frame supervision. No equations, derivations, or self-citations are shown that reduce these metrics to fitted parameters, self-defined quantities, or tautological inputs. The geometric properties of SO(n) and the triplet objective are stated as design motivations for continuity and non-degeneracy, without any reduction that makes the reported numbers follow by construction from the modeling choices themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that SO(n) geometry supplies physically meaningful action structure and that the triplet objective prevents collapse; no free parameters or invented entities beyond the representation itself are quantified in the abstract.

axioms (1)
  • domain assumption Elements of SO(n) provide continuity, compositionality, and geometry aligned with real-world action dynamics
    Invoked to justify the choice of rotational latent actions over discrete codes.
invented entities (1)
  • Rotational latent action in SO(n) no independent evidence
    purpose: To serve as a continuous, composable, and physically structured latent planner
    New representation introduced to replace discrete quantization pipelines

pith-pipeline@v0.9.0 · 5587 in / 1279 out tokens · 24582 ms · 2026-05-14T17:44:25.288826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 49 canonical work pages · 24 internal anchors

  1. [1]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

  2. [2]

    arXiv preprint arXiv:2505.04769 (2025)

    Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025

  3. [3]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  7. [7]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  8. [8]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  9. [9]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  10. [10]

    Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

    Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

  11. [11]

    Softvq-vae: Efficient 1-dimensional continuous tokenizer

    Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358– 28370, 2025

  12. [12]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  13. [13]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  14. [14]

    Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

  15. [15]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  16. [16]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 11

  17. [17]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  18. [18]

    Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

  19. [19]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  20. [20]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  21. [21]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language- action model with open-world generalization, 2025.URL https://arxiv. org/abs/2504.16054, 1 (2):3

  22. [22]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  23. [23]

    arXiv preprint arXiv:2511.04555 (2025)

    Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025

  24. [24]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  25. [25]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523, 2025

  26. [26]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  27. [27]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  28. [28]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

  29. [29]

    Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

    Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

  30. [30]

    Latbot: Distilling universal latent actions for vision-language-action models.arXiv preprint arXiv:2511.23034, 2025

    Zuolei Li, Xingyu Gao, Xiaofan Wang, and Jianlong Fu. Latbot: Distilling universal latent actions for vision-language-action models.arXiv preprint arXiv:2511.23034, 2025

  31. [31]

    Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

    Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

  32. [32]

    Villa-x: enhancing latent action modeling in vision-language-action models,

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 12

  33. [33]

    Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

    Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

  34. [34]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  35. [35]

    Seeing space and motion: Enhancing latent actions with spatial and dynamic awareness for vla.arXiv preprint arXiv:2509.26251, 2025

    Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, and Ruqi Huang. Seeing space and motion: Enhancing latent actions with spatial and dynamic awareness for vla.arXiv preprint arXiv:2509.26251, 2025

  36. [36]

    Laof: Robust latent action learning with optical flow constraints.arXiv preprint arXiv:2511.16407, 2025

    Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, and Wei Li. Laof: Robust latent action learning with optical flow constraints.arXiv preprint arXiv:2511.16407, 2025

  37. [37]

    UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    Manish Kumar Govind, Dominick Reilly, Pu Wang, and Srijan Das. Unilact: Depth-aware rgb latent action learning for vision-language-action models.arXiv preprint arXiv:2602.20231, 2026

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  39. [39]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  40. [40]

    Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

  41. [41]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

  42. [42]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  43. [43]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  44. [44]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model develop- ing.arXiv preprint arXiv:2604.05014, 2026

  45. [45]

    arXiv preprint arXiv:2601.18692 (2026)

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  46. [46]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  47. [47]

    Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

    Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

  48. [48]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 13

  49. [49]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  50. [50]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

  51. [51]

    LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

    Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, and Xunliang Cai. Lary: A latent action representation yielding benchmark for generalizable vision-to-action alignment. arXiv preprint arXiv:2604.11689, 2026

  52. [52]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

  53. [53]

    URLhttps://realmanrobotics.com/

    Realman Robotics. URLhttps://realmanrobotics.com/

  54. [54]

    URLhttps://global.agilex.ai/products/cobot-magic

    Agilex Robotics. URLhttps://global.agilex.ai/products/cobot-magic

  55. [55]

    URLhttps://galaxea-ai.com/products/R1-Lite

    Galaxea. URLhttps://galaxea-ai.com/products/R1-Lite

  56. [56]

    URLhttps://www.agibot.com/products/G1

    Agibot. URLhttps://www.agibot.com/products/G1

  57. [57]

    URLhttps://airbots.online/mmk2

    Airbot. URLhttps://airbots.online/mmk2

  58. [58]

    URLhttps://www.unitree.com/g1/

    Unitree Robotics. URLhttps://www.unitree.com/g1/

  59. [59]

    URLhttps://www.tqartisan.com/productDetails?type=A2

    TQ-Artisan. URLhttps://www.tqartisan.com/productDetails?type=A2

  60. [60]

    URLhttps://www.universal-robots.com/products/ur5e/

    UR5e Robotics. URLhttps://www.universal-robots.com/products/ur5e/

  61. [61]

    URLhttps://franka.de/

    Franka Emika Panda Robotics. URLhttps://franka.de/

  62. [62]

    URLhttps://x-humanoid.com/

    Tien Kung Robotics. URLhttps://x-humanoid.com/

  63. [63]

    Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

  64. [64]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  65. [65]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018

  66. [66]

    On bringing robots home

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023

  67. [67]

    Fmb: a functional manipulation benchmark for generalizable robotic learning

    Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44(4):592–606, 2025

  68. [68]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  69. [69]

    Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

    Rutav Shah, Roberto Martín-Martín, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

  70. [70]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. InConference on Robot Learning, pages 2113–2133. PMLR, 2023. 14

  71. [71]

    Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

  72. [72]

    Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025

  73. [73]

    Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023

    Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023

  74. [74]

    Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

  75. [75]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

  76. [76]

    Multi-resolution sensing for real-time control with vision-language models

    Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In2nd Workshop on Language and Robot Learning: Language as Grounding, 2023

  77. [77]

    Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home

    Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home

  78. [78]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_ jaco_play_dataset

  79. [79]

    Viola: Imitation learning for vision- based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. InConference on Robot Learning, pages 1199–1210. PMLR, 2023

  80. [80]

    Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023

    Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023

Showing first 80 references.