pith. machine review for the scientific record. sign in

arxiv: 2605.10819 · v2 · submitted 2026-05-11 · 💻 cs.RO · cs.AI· cs.CV

Recognition: unknown

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:12 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords latent action modelsvision-language-actionflow matchingrobot manipulationalgebraic consistencyvideo priorspolicy learningrepresentation learning
0
0 comments X

The pith

ALAM turns action-free videos into algebraically consistent latent transitions that flow-matching policies can use directly to raise manipulation success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALAM to address the shortage of action-labeled robot data by extracting structured priors from abundant action-free videos. It trains on frame triplets to produce latent transitions that remain consistent under composition and reversal, forming a locally additive space grounded by reconstruction. These transitions serve as auxiliary generative targets that are co-generated with actual robot actions inside a joint flow-matching objective. The resulting policies exploit the transition geometry without any need to decode latents back into explicit actions. Experiments show large gains on standard benchmarks and real-world tasks when the pretrained ALAM encoder is frozen and reused.

Core claim

ALAM learns latent transitions from video frame triplets that satisfy reconstruction while obeying composition and reversal consistency, creating a locally additive transition space; when these structured sequences are supplied as auxiliary targets in a joint flow-matching loss with robot actions, the policy can directly exploit the latent geometry to improve generation without latent-to-action decoding.

What carries the argument

Algebraically consistent latent transitions from frame triplets, regularized for composition and reversal to enforce local additivity while remaining reconstruction-grounded.

If this is right

  • Average success on MetaWorld MT50 rises from 47.9% to 85.0% and on LIBERO from 94.1% to 98.1% when ALAM latents are used as auxiliary targets.
  • Additivity and reversibility errors drop by factors of 25-85 compared with unstructured latent-action models.
  • Long-horizon cumulative reconstruction improves because the transition space is locally additive.
  • Real-world manipulation tasks show consistent gains once the frozen ALAM encoder supplies the structured sequences.
  • The strongest performance lift occurs only when algebraic consistency is combined with joint flow matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency regularizers could be applied to other generative policies that operate in latent spaces, such as diffusion-based planners.
  • Because the transitions are locally additive, they may support zero-shot composition of novel long-horizon behaviors from short video clips.
  • If the additive property holds across domains, the approach might transfer to navigation or multi-agent interaction settings where only passive video is available.
  • Removing the need for explicit latent decoding simplifies the training pipeline and could reduce compounding errors in long sequences.

Load-bearing premise

The algebraically consistent latent transitions learned from videos supply useful structure that a flow-matching policy can exploit directly without decoding the latents back to actions.

What would settle it

Train the same flow-matching VLA policy on MetaWorld MT50 using ALAM latents versus an unstructured latent-action baseline; if success rates remain statistically indistinguishable or the consistency ablations produce no drop, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.10819 by Bin Liu, Changjie Wu, De Ma, Dongjie Huo, Feng Xiong, Gang Pan, Haoyun Liu, Jiachen Luo, Mu Xu, Xinyuan Chang, Yandan Yang, Zhejia Cai, Zhiheng Ma, Zuojin Tang.

Figure 1
Figure 1. Figure 1: Left: Pretraining pipeline. From a sampled frame triplet (oa, ob, oc), a relational encoder maps each pair (oi , oj ) to a continuous latent transition z j i , and a decoder reconstructs oˆj from oi and z j i . Beyond conventional pair-based latent action models, ALAM further regularizes the latent space with additivity and reversal consistency losses. Right: Algebraically structured latent action space. U… view at source ↗
Figure 2
Figure 2. Figure 2: Downstream transfer of ALAM. The frozen ALAM encoder turns frame pairs over horizon T:T+H into latent transition tokens, interleaved with action tokens. Conditioned on visual and language context, a shared expert co-generates both streams via K-step joint flow matching with the Gemma backbone; only the action stream is executed on the robot. 3 Method We introduce ALAM, a two-stage framework that learns lat… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world results on a Piper 6-DoF manipulator. ALAM is compared with π0 and π0.5 on four tasks: insert cylinder, insert cube, stack cup, and fold towel. Evaluation benchmarks. We use three settings. (i) MetaWorld MT50 [34]: a single policy is trained on all 50 tabletop tasks. Following Evo-1 [29], we report per-task success rates averaged over 10 trials per task and group tasks into Easy, Medium, Hard, a… view at source ↗
Figure 4
Figure 4. Figure 4: LAM vs. ALAM on the 5% held-out split. (A) Algebraic probes (log y, shaded t≥3k not seen during training): (A.1) additivity error and (A.2) reversibility error, with LAM/ALAM ratios at t=5k shown in red. (B) Per-horizon ∆ from each model’s own k-step score, for direct (top, (a)–(c)) and cumulative (bottom, (d)–(f)) reconstruction. See Sec. 4.1 for definitions and k-step scores. z b a = E(oa, ob) for the la… view at source ↗
Figure 5
Figure 5. Figure 5: Key frames of real-world rollouts from ALAM. We evaluate on four tasks. Insert Cylinder. The robot grasps a cylinder and inserts it into a matching socket base, with one object and its base on the tabletop per trial; we run 20 trials, and a trial counts as successful if the cylinder is correctly inserted and remains stably seated after release. The success rate is reported on a 0–100 scale, equivalent to 5… view at source ↗
Figure 6
Figure 6. Figure 6: Cross-domain additivity (ground-truth target reference). Each panel has two rows. Source is a trajectory in the source domain; Target is the reconstruction in a different target domain obtained by transferring the inferred latent transitions. Columns: anchor oa; forward state oˆ b a ; forward state oˆ c a ; and additive composition oˆ b a + ˆo c b , which should align with oˆ c a if the latent transitions … view at source ↗
Figure 7
Figure 7. Figure 7: Cross-domain additivity (transferred target). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ALAM, an Algebraically Consistent Latent Action Model that extracts structured latent transitions from action-free video frame triplets by combining reconstruction with explicit composition and reversal consistency constraints to produce a locally additive latent space. These frozen latents serve as auxiliary targets in a joint flow-matching objective for training vision-language-action policies, allowing the policy to exploit the transition geometry without explicit latent-to-action decoding. Representation probes demonstrate 25-85x reductions in additivity and reversibility errors relative to unstructured baselines, and downstream transfer yields large gains: average success rising from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with further real-world improvements. Ablations attribute the gains to the interaction between algebraic structure and joint flow matching.

Significance. If the gains are reproducible, the work offers a practical route to leverage abundant unlabeled video for VLA improvement by imposing algebraic structure on latent transitions rather than relying solely on reconstruction. The avoidance of latent decoding and the joint flow-matching formulation are pragmatic strengths that could improve data efficiency in robotics. The reported synergy between consistency regularization and policy training, if confirmed, would strengthen the case for structured latent priors in sequential decision-making.

major comments (2)
  1. [Methods (joint flow-matching objective)] The central performance claims rest on the joint flow-matching objective that co-generates latent transitions and robot actions; the exact form of this objective, the weighting between the two generative targets, and how the frozen encoder outputs are injected into the flow network must be specified with equations in the methods section to permit verification of the coupling mechanism.
  2. [Representation probes and ablations] Representation probes report 25-85x error reductions in additivity and reversibility; the precise definitions of these error metrics, the baseline latent-action models, and the long-horizon cumulative reconstruction protocol should be detailed (including any statistical tests) because these numbers are used to validate that the consistency constraints produce usable structure.
minor comments (2)
  1. Ensure that all hyperparameters for the consistency regularization weights are reported with their chosen values and any sensitivity analysis, as these are listed among the free parameters.
  2. Figure captions and table footnotes should explicitly state the number of evaluation episodes or seeds used for the MetaWorld and LIBERO success rates to support the reported averages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The comments highlight opportunities to improve clarity in the methods and evaluation sections, which we will address by adding the requested details and equations to the revised manuscript.

read point-by-point responses
  1. Referee: [Methods (joint flow-matching objective)] The central performance claims rest on the joint flow-matching objective that co-generates latent transitions and robot actions; the exact form of this objective, the weighting between the two generative targets, and how the frozen encoder outputs are injected into the flow network must be specified with equations in the methods section to permit verification of the coupling mechanism.

    Authors: We agree that the joint flow-matching objective requires explicit mathematical specification for full reproducibility and verification. In the revised manuscript, we will expand the Methods section with a new subsection that presents the complete equations for the joint objective, including the precise weighting coefficients between the latent transition and robot action generation terms, as well as the conditioning mechanism that injects the frozen ALAM encoder outputs into the flow network. This will directly address the coupling mechanism and allow independent verification of the reported performance gains. revision: yes

  2. Referee: [Representation probes and ablations] Representation probes report 25-85x error reductions in additivity and reversibility; the precise definitions of these error metrics, the baseline latent-action models, and the long-horizon cumulative reconstruction protocol should be detailed (including any statistical tests) because these numbers are used to validate that the consistency constraints produce usable structure.

    Authors: We concur that precise definitions and protocols are necessary to substantiate the representation probe results. In the revision, we will augment the Representation Probes and Ablations sections to include: the exact mathematical formulations of the additivity and reversibility error metrics; descriptions of all baseline latent-action models; the full protocol for long-horizon cumulative reconstruction (including sequence lengths and evaluation procedure); and any statistical tests or variance measures (e.g., standard deviations across seeds). These additions will strengthen the evidence that the algebraic consistency constraints yield usable structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper explicitly adds composition and reversal consistency terms as regularization to the reconstruction loss on frame triplets, then freezes the encoder and uses the resulting latent transitions as auxiliary targets in a joint flow-matching objective for downstream VLA policies. Performance gains (47.9% to 85.0% on MetaWorld MT50) are measured on independent held-out robot tasks and real-world manipulation, with separate representation probes and ablations isolating the contribution of the algebraic constraints. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing premise rests solely on self-citation, and no ansatz is smuggled in; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that video-derived latent transitions with algebraic structure can serve as effective auxiliary targets for robot policy learning.

free parameters (1)
  • regularization weights for consistency losses
    Weights balancing reconstruction, composition consistency, and reversal consistency in the latent action model training.
axioms (1)
  • domain assumption Enforcing composition and reversal consistency on latent transitions creates a locally additive transition space suitable for policy generation.
    This is the core assumption that the algebraic properties will transfer to better action generation in VLA models.

pith-pipeline@v0.9.0 · 5640 in / 1315 out tokens · 60387 ms · 2026-05-14T21:12:26.211102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 31 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  5. [5]

    π0: A vision-language-action flow model for general robot control, 2026

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  8. [8]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  10. [10]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  11. [11]

    Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:36902–36913, 2022

    Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Jianhao Wang, Alex Yuan Gao, Wenzhe Li, Liang Bin, Chelsea Finn, and Chongjie Zhang. Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:36902–36913, 2022

  12. [12]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

  13. [13]

    Villa-x: enhancing latent action modeling in vision-language-action models,

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

  14. [14]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  15. [15]

    Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024

    Zichen J Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024. 10

  16. [16]

    Gpt-3: Its nature, scope, limits, and consequences

    Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and machines, 30(4):681–694, 2020

  17. [17]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

  18. [18]

    Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

    Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

  19. [19]

    Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

  20. [20]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  21. [21]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 : a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  22. [22]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  23. [23]

    Hierarchical latent action model.arXiv preprint arXiv:2603.05815, 2026

    Hanjung Kim, Lerrel Pinto, and Seon Joo Kim. Hierarchical latent action model.arXiv preprint arXiv:2603.05815, 2026

  24. [24]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  26. [26]

    Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

  27. [27]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  28. [28]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  29. [29]

    arXiv preprint arXiv:2511.04555 (2025)

    Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025

  30. [30]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  31. [31]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  32. [32]

    Neural implicit action fields: From discrete waypoints to continuous functions for vision-language-action models, 2026

    Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, SongLin Dong, Zhiheng Ma, Yihong Gong, and Sheng Zhong. Neural implicit action fields: From discrete waypoints to continuous functions for vision-language-action models, 2026. 11

  33. [33]

    Joint-aligned latent action: Towards scalable vla pretraining in the wild, 2026

    Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild, 2026

  34. [34]

    Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro

    Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhan- peng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2025

  35. [35]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  36. [36]

    arXiv preprint arXiv:2512.00903 (2025)

    Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, et al. Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025

  37. [37]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  38. [38]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  39. [39]

    Masked world models for visual control

    Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InConference on Robot Learning, pages 1332–1344. PMLR, 2023

  40. [40]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  41. [41]

    Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

  42. [42]

    Vlascd: A visual language action model for simultaneous chatting and decision making

    Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, and Bin Liu. Vlascd: A visual language action model for simultaneous chatting and decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9223–9243, 2025

  43. [43]

    One token per frame: Reconsidering visual bandwidth in world models for vla policy, 2026

    Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jin, De Ma, Gang Pan, and Bin Liu. One token per frame: Reconsidering visual bandwidth in world models for vla policy, 2026

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  45. [45]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  46. [46]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 12

  47. [47]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  48. [48]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  49. [49]

    Learning Additively Compositional Latent Actions for Embodied AI

    Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, and Jiang Bian. Learning additively compositional latent actions for embodied ai.arXiv preprint arXiv:2604.03340, 2026

  50. [50]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

  51. [51]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

  52. [52]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  53. [53]

    ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  54. [54]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  55. [55]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  56. [56]

    What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

    Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

  57. [57]

    Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  58. [58]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

  59. [59]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric, 2018

  60. [60]

    arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

  61. [61]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  62. [62]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  63. [63]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A Implementation Details A.1 ALAM pretraining The relational encoder E maps a frame...