arxiv: 2605.10819 · v2 · submitted 2026-05-11 · 💻 cs.RO · cs.AI· cs.CV

Recognition: unknown

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Zuojin Tang , Haoyun Liu , Xinyuan Chang , Changjie Wu , Dongjie Huo , Yandan Yang , Bin Liu , Zhejia Cai

show 6 more authors

Feng Xiong Mu Xu Jiachen Luo De Ma Zhiheng Ma Gang Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:12 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords latent action modelsvision-language-actionflow matchingrobot manipulationalgebraic consistencyvideo priorspolicy learningrepresentation learning

0 comments

The pith

ALAM turns action-free videos into algebraically consistent latent transitions that flow-matching policies can use directly to raise manipulation success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALAM to address the shortage of action-labeled robot data by extracting structured priors from abundant action-free videos. It trains on frame triplets to produce latent transitions that remain consistent under composition and reversal, forming a locally additive space grounded by reconstruction. These transitions serve as auxiliary generative targets that are co-generated with actual robot actions inside a joint flow-matching objective. The resulting policies exploit the transition geometry without any need to decode latents back into explicit actions. Experiments show large gains on standard benchmarks and real-world tasks when the pretrained ALAM encoder is frozen and reused.

Core claim

ALAM learns latent transitions from video frame triplets that satisfy reconstruction while obeying composition and reversal consistency, creating a locally additive transition space; when these structured sequences are supplied as auxiliary targets in a joint flow-matching loss with robot actions, the policy can directly exploit the latent geometry to improve generation without latent-to-action decoding.

What carries the argument

Algebraically consistent latent transitions from frame triplets, regularized for composition and reversal to enforce local additivity while remaining reconstruction-grounded.

If this is right

Average success on MetaWorld MT50 rises from 47.9% to 85.0% and on LIBERO from 94.1% to 98.1% when ALAM latents are used as auxiliary targets.
Additivity and reversibility errors drop by factors of 25-85 compared with unstructured latent-action models.
Long-horizon cumulative reconstruction improves because the transition space is locally additive.
Real-world manipulation tasks show consistent gains once the frozen ALAM encoder supplies the structured sequences.
The strongest performance lift occurs only when algebraic consistency is combined with joint flow matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency regularizers could be applied to other generative policies that operate in latent spaces, such as diffusion-based planners.
Because the transitions are locally additive, they may support zero-shot composition of novel long-horizon behaviors from short video clips.
If the additive property holds across domains, the approach might transfer to navigation or multi-agent interaction settings where only passive video is available.
Removing the need for explicit latent decoding simplifies the training pipeline and could reduce compounding errors in long sequences.

Load-bearing premise

The algebraically consistent latent transitions learned from videos supply useful structure that a flow-matching policy can exploit directly without decoding the latents back to actions.

What would settle it

Train the same flow-matching VLA policy on MetaWorld MT50 using ALAM latents versus an unstructured latent-action baseline; if success rates remain statistically indistinguishable or the consistency ablations produce no drop, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.10819 by Bin Liu, Changjie Wu, De Ma, Dongjie Huo, Feng Xiong, Gang Pan, Haoyun Liu, Jiachen Luo, Mu Xu, Xinyuan Chang, Yandan Yang, Zhejia Cai, Zhiheng Ma, Zuojin Tang.

**Figure 1.** Figure 1: Left: Pretraining pipeline. From a sampled frame triplet (oa, ob, oc), a relational encoder maps each pair (oi , oj ) to a continuous latent transition z j i , and a decoder reconstructs oˆj from oi and z j i . Beyond conventional pair-based latent action models, ALAM further regularizes the latent space with additivity and reversal consistency losses. Right: Algebraically structured latent action space. U… view at source ↗

**Figure 2.** Figure 2: Downstream transfer of ALAM. The frozen ALAM encoder turns frame pairs over horizon T:T+H into latent transition tokens, interleaved with action tokens. Conditioned on visual and language context, a shared expert co-generates both streams via K-step joint flow matching with the Gemma backbone; only the action stream is executed on the robot. 3 Method We introduce ALAM, a two-stage framework that learns lat… view at source ↗

**Figure 3.** Figure 3: Real-world results on a Piper 6-DoF manipulator. ALAM is compared with π0 and π0.5 on four tasks: insert cylinder, insert cube, stack cup, and fold towel. Evaluation benchmarks. We use three settings. (i) MetaWorld MT50 [34]: a single policy is trained on all 50 tabletop tasks. Following Evo-1 [29], we report per-task success rates averaged over 10 trials per task and group tasks into Easy, Medium, Hard, a… view at source ↗

**Figure 4.** Figure 4: LAM vs. ALAM on the 5% held-out split. (A) Algebraic probes (log y, shaded t≥3k not seen during training): (A.1) additivity error and (A.2) reversibility error, with LAM/ALAM ratios at t=5k shown in red. (B) Per-horizon ∆ from each model’s own k-step score, for direct (top, (a)–(c)) and cumulative (bottom, (d)–(f)) reconstruction. See Sec. 4.1 for definitions and k-step scores. z b a = E(oa, ob) for the la… view at source ↗

**Figure 5.** Figure 5: Key frames of real-world rollouts from ALAM. We evaluate on four tasks. Insert Cylinder. The robot grasps a cylinder and inserts it into a matching socket base, with one object and its base on the tabletop per trial; we run 20 trials, and a trial counts as successful if the cylinder is correctly inserted and remains stably seated after release. The success rate is reported on a 0–100 scale, equivalent to 5… view at source ↗

**Figure 6.** Figure 6: Cross-domain additivity (ground-truth target reference). Each panel has two rows. Source is a trajectory in the source domain; Target is the reconstruction in a different target domain obtained by transferring the inferred latent transitions. Columns: anchor oa; forward state oˆ b a ; forward state oˆ c a ; and additive composition oˆ b a + ˆo c b , which should align with oˆ c a if the latent transitions … view at source ↗

**Figure 7.** Figure 7: Cross-domain additivity (transferred target). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALAM adds composition and reversal consistency to latent action models from video and shows large downstream gains on VLA tasks when those latents are used as joint flow targets.

read the letter

The core idea is straightforward: train a latent action model on action-free video with explicit algebraic constraints so that latent transitions compose and reverse reliably, then freeze the encoder and let a flow-matching VLA policy generate both robot actions and those structured latent transitions together. The reported numbers are the headline—MetaWorld MT50 success jumps from 47.9% to 85.0%, LIBERO from 94.1% to 98.1%—and the representation probes back it up with 25-85x lower additivity and reversibility error than unstructured baselines. That is concrete evidence the regularizers are shaping the space in a usable way. The joint flow-matching step is also sensible; it sidesteps any need to decode latents back to actions and lets the policy exploit the geometry directly. Ablations apparently isolate the synergy between the algebraic structure and the joint objective, which is the right check to run. The approach is new enough in how it combines these pieces for VLA, and the video pretraining angle addresses the data scarcity problem without adding decoding overhead. One soft spot is the reliance on freezing the encoder; it is not obvious whether the gains would survive if the encoder were allowed to adapt or if a simpler auxiliary loss could produce similar regularization without the video stage. The real-world results are cited but not quantified in detail here, so the practical margin there still needs checking. Overall the construction is clean and the evidence presented is stronger than most latent-action papers I have seen. This is worth sending to review for anyone working on VLA or video priors for robotics; the experimental claims are large enough to merit referee scrutiny on controls and baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces ALAM, an Algebraically Consistent Latent Action Model that extracts structured latent transitions from action-free video frame triplets by combining reconstruction with explicit composition and reversal consistency constraints to produce a locally additive latent space. These frozen latents serve as auxiliary targets in a joint flow-matching objective for training vision-language-action policies, allowing the policy to exploit the transition geometry without explicit latent-to-action decoding. Representation probes demonstrate 25-85x reductions in additivity and reversibility errors relative to unstructured baselines, and downstream transfer yields large gains: average success rising from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with further real-world improvements. Ablations attribute the gains to the interaction between algebraic structure and joint flow matching.

Significance. If the gains are reproducible, the work offers a practical route to leverage abundant unlabeled video for VLA improvement by imposing algebraic structure on latent transitions rather than relying solely on reconstruction. The avoidance of latent decoding and the joint flow-matching formulation are pragmatic strengths that could improve data efficiency in robotics. The reported synergy between consistency regularization and policy training, if confirmed, would strengthen the case for structured latent priors in sequential decision-making.

major comments (2)

[Methods (joint flow-matching objective)] The central performance claims rest on the joint flow-matching objective that co-generates latent transitions and robot actions; the exact form of this objective, the weighting between the two generative targets, and how the frozen encoder outputs are injected into the flow network must be specified with equations in the methods section to permit verification of the coupling mechanism.
[Representation probes and ablations] Representation probes report 25-85x error reductions in additivity and reversibility; the precise definitions of these error metrics, the baseline latent-action models, and the long-horizon cumulative reconstruction protocol should be detailed (including any statistical tests) because these numbers are used to validate that the consistency constraints produce usable structure.

minor comments (2)

Ensure that all hyperparameters for the consistency regularization weights are reported with their chosen values and any sensitivity analysis, as these are listed among the free parameters.
Figure captions and table footnotes should explicitly state the number of evaluation episodes or seeds used for the MetaWorld and LIBERO success rates to support the reported averages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The comments highlight opportunities to improve clarity in the methods and evaluation sections, which we will address by adding the requested details and equations to the revised manuscript.

read point-by-point responses

Referee: [Methods (joint flow-matching objective)] The central performance claims rest on the joint flow-matching objective that co-generates latent transitions and robot actions; the exact form of this objective, the weighting between the two generative targets, and how the frozen encoder outputs are injected into the flow network must be specified with equations in the methods section to permit verification of the coupling mechanism.

Authors: We agree that the joint flow-matching objective requires explicit mathematical specification for full reproducibility and verification. In the revised manuscript, we will expand the Methods section with a new subsection that presents the complete equations for the joint objective, including the precise weighting coefficients between the latent transition and robot action generation terms, as well as the conditioning mechanism that injects the frozen ALAM encoder outputs into the flow network. This will directly address the coupling mechanism and allow independent verification of the reported performance gains. revision: yes
Referee: [Representation probes and ablations] Representation probes report 25-85x error reductions in additivity and reversibility; the precise definitions of these error metrics, the baseline latent-action models, and the long-horizon cumulative reconstruction protocol should be detailed (including any statistical tests) because these numbers are used to validate that the consistency constraints produce usable structure.

Authors: We concur that precise definitions and protocols are necessary to substantiate the representation probe results. In the revision, we will augment the Representation Probes and Ablations sections to include: the exact mathematical formulations of the additivity and reversibility error metrics; descriptions of all baseline latent-action models; the full protocol for long-horizon cumulative reconstruction (including sequence lengths and evaluation procedure); and any statistical tests or variance measures (e.g., standard deviations across seeds). These additions will strengthen the evidence that the algebraic consistency constraints yield usable structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper explicitly adds composition and reversal consistency terms as regularization to the reconstruction loss on frame triplets, then freezes the encoder and uses the resulting latent transitions as auxiliary targets in a joint flow-matching objective for downstream VLA policies. Performance gains (47.9% to 85.0% on MetaWorld MT50) are measured on independent held-out robot tasks and real-world manipulation, with separate representation probes and ablations isolating the contribution of the algebraic constraints. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing premise rests solely on self-citation, and no ansatz is smuggled in; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that video-derived latent transitions with algebraic structure can serve as effective auxiliary targets for robot policy learning.

free parameters (1)

regularization weights for consistency losses
Weights balancing reconstruction, composition consistency, and reversal consistency in the latent action model training.

axioms (1)

domain assumption Enforcing composition and reversal consistency on latent transitions creates a locally additive transition space suitable for policy generation.
This is the core assumption that the algebraic properties will transfer to better action generation in VLA models.

pith-pipeline@v0.9.0 · 5640 in / 1315 out tokens · 60387 ms · 2026-05-14T21:12:26.211102+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 31 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

π0: A vision-language-action flow model for general robot control, 2026

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

work page 2026
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[8]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:36902–36913, 2022

Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Jianhao Wang, Alex Yuan Gao, Wenzhe Li, Liang Bin, Chelsea Finn, and Chongjie Zhang. Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:36902–36913, 2022

work page 2022
[12]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024
[13]

Villa-x: enhancing latent action modeling in vision-language-action models,

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

work page arXiv 2025
[14]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[15]

Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024

Zichen J Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024. 10

work page 2024
[16]

Gpt-3: Its nature, scope, limits, and consequences

Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and machines, 30(4):681–694, 2020

work page 2020
[17]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

work page arXiv 2025
[18]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

work page arXiv 2025
[19]

Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

work page 2024
[20]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 : a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

work page arXiv 2025
[23]

Hierarchical latent action model.arXiv preprint arXiv:2603.05815, 2026

Hanjung Kim, Lerrel Pinto, and Seon Joo Kim. Hierarchical latent action model.arXiv preprint arXiv:2603.05815, 2026

work page arXiv 2026
[24]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024
[27]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

arXiv preprint arXiv:2511.04555 (2025)

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025

work page arXiv 2025
[30]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Neural implicit action fields: From discrete waypoints to continuous functions for vision-language-action models, 2026

Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, SongLin Dong, Zhiheng Ma, Yihong Gong, and Sheng Zhong. Neural implicit action fields: From discrete waypoints to continuous functions for vision-language-action models, 2026. 11

work page 2026
[33]

Joint-aligned latent action: Towards scalable vla pretraining in the wild, 2026

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild, 2026

work page 2026
[34]

Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhan- peng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2025

work page 2025
[35]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

work page 2022
[36]

arXiv preprint arXiv:2512.00903 (2025)

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, et al. Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025

work page arXiv 2025
[37]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[38]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InConference on Robot Learning, pages 1332–1344. PMLR, 2023

work page 2023
[40]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

work page arXiv 2026
[42]

Vlascd: A visual language action model for simultaneous chatting and decision making

Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, and Bin Liu. Vlascd: A visual language action model for simultaneous chatting and decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9223–9243, 2025

work page 2025
[43]

One token per frame: Reconsidering visual bandwidth in world models for vla policy, 2026

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jin, De Ma, Gang Pan, and Bin Liu. One token per frame: Reconsidering visual bandwidth in world models for vla policy, 2026

work page 2026
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Learning Additively Compositional Latent Actions for Embodied AI

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, and Jiang Bian. Learning additively compositional latent actions for embodied ai.arXiv preprint arXiv:2604.03340, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

work page arXiv 2025
[57]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025
[58]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

work page Pith review arXiv 2023
[59]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric, 2018

work page 2018
[60]

arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

work page arXiv 2025
[61]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A Implementation Details A.1 ALAM pretraining The relational encoder E maps a frame...

work page 2023