arxiv: 2605.13403 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li , Xicheng Gong , Xinghang Li , Peiyan Li , Quanyun Zhou , Hangjun Ye , Jiahuan Zhou , Yadong Mu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:44 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords Rotational Latent ActionVision-Language-ActionSO(n)Latent Action ModelsFlow MatchingRobot ManipulationCross-embodiment Pretraining

0 comments

The pith

RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing latent action models for VLA pretraining rely on discrete quantization that often collapses into trivial frame copying and lacks geometric structure. RotVLA instead represents latent actions as elements of the rotation group SO(n), which supplies continuity and compositionality aligned with physical robot motion. A triplet-frame objective enforces temporal consistency across frames without allowing degenerate solutions. The resulting 1.7 billion parameter model is pretrained on more than 1700 hours of cross-embodiment robot and human video data, then fine-tuned with a flow-matching head that treats latent rotations as high-level planners for actual robot actions. On standard benchmarks this yields 98.2 percent success on LIBERO and 89.6/88.5 percent on RoboTwin2.0 clean and randomized splits, plus strong real-world manipulation results.

Core claim

Latent actions modeled as elements of SO(n) together with a triplet-frame objective replace discrete quantization pipelines, delivering continuity, compositionality, and physically grounded structure while avoiding trivial reconstruction. The pretrained VLM backbone plus flow-matching action head uses these latent rotations as planners that condition unified denoising of robot actions, achieving the reported benchmark numbers with 1.7B parameters and 1700+ hours of data.

What carries the argument

Continuous rotational latent actions represented as elements of SO(n), learned under a triplet-frame objective that supplies temporal dynamics without collapse.

If this is right

Latent rotations can be composed and interpolated without discretization artifacts during planning.
The same latent space serves as a unified planner across embodiments once the flow-matching head is trained.
Performance remains high under both clean and randomized visual conditions on multiple manipulation suites.
Real-world deployment shows consistent gains over existing VLA baselines without extra embodiment-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric prior may transfer to other sequence tasks that require physically plausible interpolation, such as video prediction.
If SO(n) structure proves robust, similar rotational embeddings could replace codebooks in non-robotics domains like motion capture or physics simulation.
Further scaling of the pretraining corpus could raise success rates on longer-horizon or multi-step tasks.

Load-bearing premise

Representing latent actions as rotations in SO(n) plus a triplet-frame loss automatically gives continuity, compositionality, and physical meaning without trivial solutions.

What would settle it

An ablation that removes the SO(n) constraint or the triplet loss and measures whether benchmark success falls to the level of prior discrete quantization methods on identical pretraining data.

Figures

Figures reproduced from arXiv: 2605.13403 by Hangjun Ye, Jiahuan Zhou, Peiyan Li, Qiwei Li, Quanyun Zhou, Xicheng Gong, Xinghang Li, Yadong Mu.

**Figure 2.** Figure 2: Illustration of existing LAMs (a) and RotVLA (b). Existing LAMs follow an encode–decode [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of real-world tasks and generalization setting. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration that the latent action extracted by one dataset can generalize to other seen and p (RotVLA) መ𝐼′𝑡+1 = 𝒟(𝐼𝑡+1 𝑧𝑡 𝑡+1) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization results of ˆIt+1 and ˆI ′ t+2. varying backgrounds and the introduction of distractor objects, the performance gap becomes more pronounced. This robustness can be attributed to the continuous rotational latent action based pretraining and the latent planner finetuning, which captures high-level motion semantics rather than overfitting to background appearance. In terms of efficiency, on an NV… view at source ↗

**Figure 6.** Figure 6: The robotic platform used in real-world experiments. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 8.** Figure 8: Statistics of the pretraining data used by RotVLA, grouped by dataset and embodiment [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Impact of pretraining data scale on performance. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RotVLA swaps discrete latent actions for SO(n) rotations and posts strong benchmark numbers, but the geometric benefits still need direct checks to explain the gains.

read the letter

The main point is that RotVLA models latent actions as elements of SO(n) inside a flow-matching VLA pipeline, then reports 98.2% on LIBERO and roughly 89% on RoboTwin2.0 with a 1.7B model after 1700+ hours of mixed robot and human data. The triplet frame objective is added to keep the latents from collapsing into trivial reconstructions. For control they treat the latent actions as a high-level planner that conditions the denoising of actual robot actions. That setup is the clearest departure from earlier discrete quantization LAMs. The cross-embodiment pretraining and the extension of the flow-matching head to a unified expert are practical steps that make the method usable downstream. The real-world manipulation results are presented as consistent outperformance over other VLAs, which is the kind of evidence that matters for this area. The soft spot is the missing link between the SO(n) representation and the claimed continuity, compositionality, and physical structure. The abstract states these properties as inherent to the choice of group and the triplet loss, yet supplies no interpolation tests, group composition checks, or ablations against Euclidean latents to show the geometry is doing the work. The numbers could come from the dataset scale, the VLM backbone, or the flow-matching head alone. If the full paper contains those controls, the contribution sharpens; without them the causal story stays loose. This paper is for people already working on latent action models and generalist policies who want a continuous alternative to quantization. A reader focused on geometric priors or cross-embodiment transfer would get concrete ideas to try. It deserves peer review because the core representation is new, the reported results are competitive, and the open questions about what actually drives performance are straightforward to address with standard checks.

Referee Report

2 major / 1 minor

Summary. The paper introduces RotVLA, a Vision-Language-Action (VLA) framework that models latent actions as elements of the special orthogonal group SO(n) for continuous and compositional representations. It employs a triplet frame learning objective to enforce temporal dynamics and uses a VLM backbone with a flow-matching action head. Pretrained on large-scale cross-embodiment datasets totaling over 1700 hours, RotVLA reports state-of-the-art success rates of 98.2% on the LIBERO benchmark and 89.6%/88.5% on RoboTwin2.0 under clean and randomized conditions, along with strong real-world manipulation performance.

Significance. If the reported performance gains are attributable to the rotational latent action representation and triplet objective rather than scale or data alone, this work could significantly advance VLA models by introducing a geometrically structured continuous latent space that better aligns with physical action dynamics, potentially improving generalization across embodiments.

major comments (2)

Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.
Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.

minor comments (1)

Abstract: The parameter count is given as 1.7B but no breakdown of the VLM backbone versus action head is provided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the paper to directly address the concerns about supporting evidence for the abstract claims and the level of detail in the results section. All changes are highlighted in the revised version.

read point-by-point responses

Referee: Abstract: The abstract claims that modeling latent actions as SO(n) elements provides continuity, compositionality, and structured geometry while the triplet frame objective avoids degeneration, yet no supporting experiments (e.g., latent interpolation, group composition tests, or ablations against Euclidean latents) are referenced to demonstrate these properties are realized or responsible for the benchmark results.

Authors: We agree that the abstract would benefit from explicit references to supporting experiments. In the revised manuscript we have added a dedicated latent-space analysis subsection (Section 4.3) that includes: (i) linear interpolation between latent actions demonstrating continuity on the manifold, (ii) explicit SO(n) group composition tests showing that composing two latent actions yields a valid third action that matches the observed transition, and (iii) an ablation replacing the SO(n) representation with an unconstrained Euclidean latent space of identical dimensionality. These experiments are now cited in the abstract and demonstrate that the geometric structure contributes measurably to the reported performance gains beyond scale alone. revision: yes
Referee: Results section: The headline performance numbers (98.2% LIBERO, 89.6%/88.5% RoboTwin) are stated without experimental details, baseline comparisons, ablation studies, or error analysis, preventing verification of the contribution of the proposed SO(n) representation over prior discrete LAMs or the flow-matching head.

Authors: We acknowledge that the original results section was too concise. The revised version now contains: (i) a full experimental protocol subsection detailing training hyperparameters, data splits, and evaluation protocols for both LIBERO and RoboTwin2.0; (ii) expanded baseline tables comparing against all prior discrete LAM-based VLAs and recent flow-matching methods; (iii) systematic ablations that isolate the SO(n) representation, the triplet-frame objective, and the flow-matching head; and (iv) per-task error analysis with failure-mode categorization. These additions allow readers to verify the specific contribution of the rotational latent action design. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are empirical outcomes of pretraining and evaluation

full rationale

The paper reports success rates (98.2% LIBERO, 89.6/88.5% RoboTwin2.0) as results of pretraining a 1.7B VLA model on 1700+ hours of data using a flow-matching head conditioned on SO(n) latent actions and triplet-frame supervision. No equations, derivations, or self-citations are shown that reduce these metrics to fitted parameters, self-defined quantities, or tautological inputs. The geometric properties of SO(n) and the triplet objective are stated as design motivations for continuity and non-degeneracy, without any reduction that makes the reported numbers follow by construction from the modeling choices themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that SO(n) geometry supplies physically meaningful action structure and that the triplet objective prevents collapse; no free parameters or invented entities beyond the representation itself are quantified in the abstract.

axioms (1)

domain assumption Elements of SO(n) provide continuity, compositionality, and geometry aligned with real-world action dynamics
Invoked to justify the choice of rotational latent actions over discrete codes.

invented entities (1)

Rotational latent action in SO(n) no independent evidence
purpose: To serve as a continuous, composable, and physically structured latent planner
New representation introduced to replace discrete quantization pipelines

pith-pipeline@v0.9.0 · 5587 in / 1279 out tokens · 24582 ms · 2026-05-14T17:44:25.288826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry... triplet learning framework... z_comp t→t+2 = z t+1→t+2 · z_I^{-1} · z t→t+1
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

continuous rotational latent action representation... SoftVQ... Proj(M) via SVD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 49 canonical work pages · 24 internal anchors

[1]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

arXiv preprint arXiv:2505.04769 (2025)

Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025

work page arXiv 2025
[3]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[10]

Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

work page arXiv 2025
[11]

Softvq-vae: Efficient 1-dimensional continuous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358– 28370, 2025

2025
[12]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[13]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

work page arXiv 2024
[15]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 11

work page arXiv 2025
[17]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

work page arXiv 2025
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language- action model with open-world generalization, 2025.URL https://arxiv. org/abs/2504.16054, 1 (2):3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2511.04555 (2025)

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025

work page arXiv 2025
[24]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523, 2025

work page arXiv 2025
[26]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[28]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024
[29]

Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

work page arXiv 2024
[30]

Latbot: Distilling universal latent actions for vision-language-action models.arXiv preprint arXiv:2511.23034, 2025

Zuolei Li, Xingyu Gao, Xiaofan Wang, and Jianlong Fu. Latbot: Distilling universal latent actions for vision-language-action models.arXiv preprint arXiv:2511.23034, 2025

work page arXiv 2025
[31]

Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

work page arXiv 2025
[32]

Villa-x: enhancing latent action modeling in vision-language-action models,

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 12

work page arXiv 2025
[33]

Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026

work page arXiv 2026
[34]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Seeing space and motion: Enhancing latent actions with spatial and dynamic awareness for vla.arXiv preprint arXiv:2509.26251, 2025

Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, and Ruqi Huang. Seeing space and motion: Enhancing latent actions with spatial and dynamic awareness for vla.arXiv preprint arXiv:2509.26251, 2025

work page arXiv 2025
[36]

Laof: Robust latent action learning with optical flow constraints.arXiv preprint arXiv:2511.16407, 2025

Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, and Wei Li. Laof: Robust latent action learning with optical flow constraints.arXiv preprint arXiv:2511.16407, 2025

work page arXiv 2025
[37]

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Manish Kumar Govind, Dominick Reilly, Pu Wang, and Srijan Das. Unilact: Depth-aware rgb latent action learning for vision-language-action models.arXiv preprint arXiv:2602.20231, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[40]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

work page arXiv 2024
[41]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[43]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model develop- ing.arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

arXiv preprint arXiv:2601.18692 (2026)

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

work page arXiv 2026
[46]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

work page arXiv 2001
[48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 13

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[50]

On the continuity of rotation representations in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

2019
[51]

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, and Xunliang Cai. Lary: A latent action representation yielding benchmark for generalizable vision-to-action alignment. arXiv preprint arXiv:2604.11689, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

2022
[53]

URLhttps://realmanrobotics.com/

Realman Robotics. URLhttps://realmanrobotics.com/
[54]

URLhttps://global.agilex.ai/products/cobot-magic

Agilex Robotics. URLhttps://global.agilex.ai/products/cobot-magic
[55]

URLhttps://galaxea-ai.com/products/R1-Lite

Galaxea. URLhttps://galaxea-ai.com/products/R1-Lite
[56]

URLhttps://www.agibot.com/products/G1

Agibot. URLhttps://www.agibot.com/products/G1
[57]

URLhttps://airbots.online/mmk2

Airbot. URLhttps://airbots.online/mmk2
[58]

URLhttps://www.unitree.com/g1/

Unitree Robotics. URLhttps://www.unitree.com/g1/
[59]

URLhttps://www.tqartisan.com/productDetails?type=A2

TQ-Artisan. URLhttps://www.tqartisan.com/productDetails?type=A2
[60]

URLhttps://www.universal-robots.com/products/ur5e/

UR5e Robotics. URLhttps://www.universal-robots.com/products/ur5e/
[61]

URLhttps://franka.de/

Franka Emika Panda Robotics. URLhttps://franka.de/
[62]

URLhttps://x-humanoid.com/

Tien Kung Robotics. URLhttps://x-humanoid.com/
[63]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025

2025
[64]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018

2018
[66]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[67]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44(4):592–606, 2025

2025
[68]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[69]

Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

Rutav Shah, Roberto Martín-Martín, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications.arXiv preprint arXiv:2309.14320, 2023

work page arXiv 2023
[70]

Hydra: Hybrid robot actions for imitation learning

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. InConference on Robot Learning, pages 2113–2133. PMLR, 2023. 14

2023
[71]

Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

work page arXiv 2022
[72]

Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment.The International Journal of Robotics Research, 44(10-11):1727–1742, 2025

2025
[73]

Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023

Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark.arXiv preprint arXiv:2306.00942, 2023

work page arXiv 2023
[74]

Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

work page arXiv 2022
[75]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

2018
[76]

Multi-resolution sensing for real-time control with vision-language models

Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In2nd Workshop on Language and Robot Learning: Language as Grounding, 2023

2023
[77]

Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset.https://sites.google.com/view/berkeley-ur5/home
[78]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_ jaco_play_dataset

2023
[79]

Viola: Imitation learning for vision- based manipulation with object proposal priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. InConference on Robot Learning, pages 1199–1210. PMLR, 2023

2023
[80]

Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023

Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot, 2023

2023

Showing first 80 references.