Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Jiajun Li; Meng Guo; Qianpu Sun; Rongyu Zhang; Shanghang Zhang; Tiecheng Guo; Xiaowei Chi; Yan Huang; Yifan Ye; Ying Li

arxiv: 2606.10040 · v2 · pith:UBZGHTPBnew · submitted 2026-06-08 · 💻 cs.RO

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Jiajun Li , Tiecheng Guo , Yifan Ye , Rongyu Zhang , Xiaowei Chi , Qianpu Sun , Ying Li , Yunfan Lou

show 4 more authors

Yan Huang Zhihe Lu Meng Guo Shanghang Zhang

This is my paper

Pith reviewed 2026-06-27 16:02 UTC · model grok-4.3

classification 💻 cs.RO

keywords world-action modelsefficient inferencerobot manipulationfuture predictionembodied controlvideo guidancereal-time deployment1B parameters

0 comments

The pith

Efficient-WAM shows a 1B-parameter world-action model can reach 100 ms latency and 30x speedup by treating coarse video predictions as action guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Efficient-WAM to address the high inference cost of world-action models that couple future visual prediction with action generation. It achieves efficiency through a compact transferred video expert, token-sparse video latents, and asymmetric denoising that spends fewer steps on video than on actions. The design treats future video output strictly as a compact guidance signal rather than a high-fidelity target. Experiments on RoboTwin 2.0 and real manipulation tasks confirm that competitive control performance is retained even with visibly coarse predictions. The resulting 1B-parameter model reaches approximately 100 ms per-chunk latency in physical deployment.

Core claim

Efficient-WAM reduces the cost of future imagination while preserving its control benefit. It improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, the

What carries the argument

Asymmetric video-action denoising that allocates fewer sampling steps to video than to actions, combined with token-sparse latents and a compact transferred video expert, to produce low-cost future predictions used only as guidance for actions.

If this is right

The model sustains competitive control performance on RoboTwin 2.0 simulation and real manipulation tasks.
Per-chunk latency drops to around 100 ms in physical robot deployment.
Inference achieves a 30x speedup relative to existing world-action models.
Visible coarseness in predicted futures does not prevent retention of control capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The guidance-signal treatment of video prediction may extend to other sensor streams where utility for control matters more than reconstruction quality.
Further reductions in video fidelity could enable still smaller models or higher control frequencies on the same hardware.
Testing on tasks with greater dynamics or longer horizons would check whether the coarse-prediction assumption continues to hold.

Load-bearing premise

Coarse low-fidelity future video predictions remain sufficient to preserve control benefits on the tested tasks.

What would settle it

Ablating or further degrading the video prediction branch on the real-world manipulation tasks and measuring whether action success rates fall below those of non-WAM baselines.

Figures

Figures reproduced from arXiv: 2606.10040 by Jiajun Li, Meng Guo, Qianpu Sun, Rongyu Zhang, Shanghang Zhang, Tiecheng Guo, Xiaowei Chi, Yan Huang, Yifan Ye, Ying Li, Yunfan Lou, Zhihe Lu.

**Figure 1.** Figure 1: Overview of Efficient-WAM. Efficient-WAM uses low-cost future imagination to capture task-relevant object and robot dynamics without photorealistic video generation. Compared with prior WAMs, it achieves lower latency and strong task success in simulation and real-world settings. arXiv:2606.10040v2 [cs.RO] 10 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Efficient-WAM architecture. The model utilizes a multiscale video-latent layout where high-resolution current observations and low-resolution future latents are concatenated. A compact video expert and an action expert interact via layer-wise MoT to predict optimal action chunks. 3.2 Compact Architecture with World-Knowledge Transfer Large-scale video generation backbones encode rich world priors, yet thei… view at source ↗

**Figure 3.** Figure 3: Real-world manipulation tasks. Evaluation on the Astribot S1 robot covers precise grasping, object transfer, semantic sorting, and bimanual coordination. 4.4 Ablation Studies We isolate the contribution of our three video-side efficiency designs on RoboTwin 2.0. Specifically, the resolution and denoising ablations are evaluated progressively on top of our compact structural baseline. To balance computation… view at source ↗

**Figure 4.** Figure 4: Latency–success trade-off of asymmetric denoising on RoboTwin 2.0. Bars denote latency, the line denotes success rate, and the shaded region marks our selected configuration. Asymmetric Video-Action Denoising. We vary the video denoising budget while keeping the action denoising budget fixed. We denote configurations as [Tv, Ta], where Tv and Ta represent video and action denoising steps, respectively.… view at source ↗

**Figure 5.** Figure 5: Real-world robot setup. Evaluation protocol: Policies are evaluated on the Astribot S1 using 100 training demonstrations and 20 trials per task. Inputs include three RGB views (left/right wrists, head) alongside 31- dimensional joint states. Objects are manually reset to randomized, feasible poses before each run. Performance is assessed via strict binary task-specific criteria (detailed below). Additional… view at source ↗

**Figure 6.** Figure 6: Representative real-world failure cases [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Full WAN future prediction examples [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Efficient-WAM-RT future prediction examples. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Efficient-WAM delivers a practical 30x latency cut for WAMs via transferred compact expert, sparse latents, and asymmetric denoising, but the experiments leave open whether the video guidance actually drives the control results.

read the letter

The main takeaway is that this paper shows how to make world-action models fast enough for physical robots by deliberately making future video prediction cheap and low-fidelity, then using it only as guidance for the action branch. The 1B-parameter model reaches roughly 100 ms per chunk and 30x speedup while claiming competitive control on RoboTwin 2.0 and real manipulation tasks.

What is new is the specific combination: transfer of a compact video expert from WAN-2.2-5B, token-sparse video latents, and asymmetric denoising that gives video fewer steps than actions. Treating prediction as compact guidance rather than photorealistic output is a clear design choice that directly targets the inference barrier.

The paper does well at focusing on deployment constraints that matter for embodied work. The efficiency numbers, if they hold, address a real limit on using predictive models in closed-loop control.

The soft spot is the experimental support. The abstract states that performance holds despite coarse predictions, but gives no baselines, metrics, statistical details, or error bars. More importantly, there is no indication of ablations that disable or replace the video branch to test whether the guidance signal is required or incidental. If those controls are absent from the full paper, the claim that control benefits are preserved because of the future prediction rests on an assumption rather than direct evidence.

This is for researchers working on real-time robot learning and embodied AI who need models that run on hardware. A reader looking for concrete efficiency tricks in the WAM setting would find the design choices useful even if the results need tighter verification.

It deserves a serious referee because the latency problem is concrete and the proposed changes are specific enough to check. I would send it for review and ask for the missing ablations and full metrics.

Referee Report

2 major / 0 minor

Summary. The paper introduces Efficient-WAM, a 1B-parameter World-Action Model for embodied robot control. It reduces inference cost of future visual prediction via a compact expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric denoising (fewer steps for video than actions), treating predictions as compact guidance signals rather than high-fidelity outputs. Experiments on RoboTwin 2.0 and real-world manipulation tasks are claimed to show maintained competitive control performance despite visibly coarse predictions, with per-chunk latency reduced to ~100 ms (30x speedup over prior WAMs).

Significance. If the empirical claims hold with proper validation, the result would be significant for robotics: it directly addresses the deployment barrier of high-latency WAMs by showing that low-fidelity future prediction can still deliver control benefits at real-time speeds. The approach of asymmetric compute allocation and guidance-only video modeling is a practical contribution to efficient embodied models.

major comments (2)

[Abstract / Experiments] Abstract and experimental results section: the claim that 'experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions' is unsupported by any reported baselines, metrics (e.g., success rate, trajectory error), statistical significance, error bars, or number of trials. This renders the central efficiency claim (30x speedup while preserving control) unverifiable from the provided information.
[Experiments] Experiments / ablation studies: no ablations are described that disable the video prediction branch or replace it with null/constant input to test whether the coarse future predictions contribute measurable guidance signal to action generation versus the action-denoising pathway alone. Given the design choice to deliberately de-emphasize visual fidelity (compact expert, token-sparse latents, fewer denoising steps), this test is load-bearing for the claim that control benefits of future visual prediction are preserved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important gaps in experimental reporting that we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results section: the claim that 'experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions' is unsupported by any reported baselines, metrics (e.g., success rate, trajectory error), statistical significance, error bars, or number of trials. This renders the central efficiency claim (30x speedup while preserving control) unverifiable from the provided information.

Authors: We agree that the current manuscript version does not report explicit numerical metrics, baselines, success rates, trajectory errors, trial counts, or statistical details to support the performance claim. The abstract and experiments section rely on qualitative statements and the latency figure without accompanying quantitative tables. To make the central claim verifiable, we will add a results table in the revised manuscript that includes success rates on RoboTwin 2.0, real-world task success percentages, trajectory error metrics, number of trials, error bars, and direct comparisons to prior WAM baselines. This will directly substantiate the statement that competitive control performance is maintained at ~100 ms latency. revision: yes
Referee: [Experiments] Experiments / ablation studies: no ablations are described that disable the video prediction branch or replace it with null/constant input to test whether the coarse future predictions contribute measurable guidance signal to action generation versus the action-denoising pathway alone. Given the design choice to deliberately de-emphasize visual fidelity (compact expert, token-sparse latents, fewer denoising steps), this test is load-bearing for the claim that control benefits of future visual prediction are preserved.

Authors: We acknowledge that the manuscript does not contain an ablation that isolates the video prediction branch (e.g., by disabling it or replacing predictions with constant/null input). Such an experiment would directly test whether the coarse video guidance provides measurable benefit beyond the action-denoising pathway alone. We will add this ablation study to the revised experiments section, reporting action performance with and without the video branch on the same tasks to quantify its contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments, not self-referential derivations

full rationale

The paper introduces Efficient-WAM via architectural choices (compact expert from WAN-2.2-5B, token-sparse latents, asymmetric denoising) and reports empirical results on latency (~100 ms, 30x speedup) and control performance on RoboTwin 2.0 and real tasks. No equations, first-principles derivations, or 'predictions' are presented that reduce by construction to fitted parameters or prior outputs. The central claim that coarse video predictions preserve control benefits is supported by direct experimental comparison rather than any self-definitional or fitted-input mechanism. No load-bearing self-citations or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central efficiency claim rests on the domain assumption that future visual prediction improves control even when low-fidelity; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Future visual prediction provides control benefits in embodied tasks even when predictions are coarse
Stated as the motivation for keeping a future branch while reducing its cost

pith-pipeline@v0.9.1-grok · 5786 in / 1251 out tokens · 22757 ms · 2026-06-27T16:02:49.784103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 18 linked inside Pith

[1]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y . Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y .-G. Jiang. World action models: The next frontier in embodied ai. arXiv preprint arXiv:2605.12090, 2026

Pith/arXiv arXiv 2026
[2]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

Pith/arXiv arXiv 2026
[3]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InProceedings of the IEEE International Conference on Robotics and Automation, 2017

2017
[4]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the International Conference on Machine Learning, 2025

2025
[5]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InProceedings of the International Conference on Learning Representations, 2024

2024
[6]

Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Informa- tion Processing Systems, 2023

2023
[7]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model. InProceedings of Robotics: Science and Systems, 2025

2025
[8]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[9]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[10]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-W AM: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[11]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[12]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[13]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[14]

Y . Yue, Y . Wang, B. Kang, Y . Han, S. Wang, S. Song, J. Feng, and G. Huang. DeeR-VLA: Dynamic inference of multimodal large language models for efficient robot execution. In Advances in Neural Information Processing Systems, volume 37, 2024

2024
[15]

Z. Yang, Y . Qi, T. Xie, B. Yu, S. Liu, and M. Li. DySL-VLA: Efficient vision-language-action model inference via dynamic-static layer-skipping for robot manipulation.arXiv preprint arXiv:2602.22896, 2026. 9

arXiv 2026
[16]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252, 2023

2023
[17]

Prasad, K

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. InProceedings of Robotics: Science and Systems, 2024. arXiv:2405.07503

arXiv 2024
[18]

Z. Wang, Z. Li, A. Mandlekar, Z. Xu, J. Fan, Y . Narang, L. Fan, Y . Zhu, Y . Balaji, M. Zhou, M.-Y . Liu, and Y . Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion dis- tillation. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 59770–59791, 2025

2025
[19]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems, 2025. arXiv:2504.02792

Pith/arXiv arXiv 2025
[20]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-H0.7: A latent World-Action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026
[21]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, M. Cao, P. Li, Q. Deng, W. Mei, X. Wang, X. Chen, X. Zhou, Y . Wang, Y . Chang, Y . Li, Y . Zhou, Y . Ye, Z. Liu, and Z. Zhu. GigaWorld-Policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026
[22]

J. Guo, Q. Li, P. Li, Z. Chen, N. Sun, Y . Su, H. Wang, Y . Zhang, X. Li, and H. Liu. Unified 4D world action modeling from video priors with asynchronous denoising.arXiv preprint arXiv:2604.26694, 2026

Pith/arXiv arXiv 2026
[23]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025
[24]

Y . Ye, J. Ma, J. Cen, and Z. Lu. Token expand-merge: Training-free token compression for vision-language-action models.arXiv preprint arXiv:2512.09927, 2025

arXiv 2025
[25]

W. Guan, Q. Hu, A. Li, and J. Cheng. Efficient vision-language-action models for embodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025

Pith/arXiv arXiv 2025
[26]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023. arXiv:2303.04137

Pith/arXiv arXiv 2023
[27]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[28]

X. Jiao, Y . Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

2020
[29]

A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. InProceedings of the International Conference on Learning Representations, 2020

2020
[30]

Molchanov, S

P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. InProceedings of the International Conference on Learning Representations, 2017. 10

2017
[31]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations, 2023

2023
[32]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProceedings of the International Conference on Learning Representations, 2023

2023
[33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023

2023
[34]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H.-a. Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXi...

Pith/arXiv arXiv 2025
[35]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control. InProceedings of Robotics...

2025
[36]

J. Ye, N. Gao, S. Yang, J. Zheng, Z. Wang, Y . Chen, P. Chen, Y . Chen, S. Liu, and J. Jia. StarVLA-α: Reducing complexity in vision-language-action systems.arXiv preprint arXiv:2604.11757, 2026

Pith/arXiv arXiv 2026
[37]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

2025
[38]

Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. ABot-M0: VLA foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026
[39]

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, Y . Ren, K. Zhang, H. Yu, J. Zhao, S. Zhou, Z. Qiu, H. Xiong, Z. Wang, Z. Wang, R. Cheng, Y .-L. Li, Y . Huang, X. Zhu, Y . Shen, and K. Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[40]

Q. Sun, X. Chi, Y . Rui, Y . Li, K. Ge, J. Li, S. Han, and S. Zhang. Labshield: A multimodal benchmark for safety-critical reasoning and planning in scientific laboratories.arXiv preprint arXiv:2603.11987, 2026. 11 Appendix A Training Details Table 4: Training stages and main optimization settings. Stage Trainable Batch Size LR Video Loss Wt. Action Loss ...

arXiv 2026

[1] [1]

S. Wang, J. Shi, Z. Fu, X. He, F. Liu, C. Yang, Y . Zhou, Z. Fei, J. Gong, J. Fu, M. Z. Shou, X. Huang, X. Qiu, and Y .-G. Jiang. World action models: The next frontier in embodied ai. arXiv preprint arXiv:2605.12090, 2026

Pith/arXiv arXiv 2026

[2] [2]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

Pith/arXiv arXiv 2026

[3] [3]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InProceedings of the IEEE International Conference on Robotics and Automation, 2017

2017

[4] [4]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the International Conference on Machine Learning, 2025

2025

[5] [5]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InProceedings of the International Conference on Learning Representations, 2024

2024

[6] [6]

Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Informa- tion Processing Systems, 2023

2023

[7] [7]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model. InProceedings of Robotics: Science and Systems, 2025

2025

[8] [8]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[9] [9]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[10] [10]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-W AM: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[11] [11]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[12] [12]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[13] [13]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[14] [14]

Y . Yue, Y . Wang, B. Kang, Y . Han, S. Wang, S. Song, J. Feng, and G. Huang. DeeR-VLA: Dynamic inference of multimodal large language models for efficient robot execution. In Advances in Neural Information Processing Systems, volume 37, 2024

2024

[15] [15]

Z. Yang, Y . Qi, T. Xie, B. Yu, S. Liu, and M. Li. DySL-VLA: Efficient vision-language-action model inference via dynamic-static layer-skipping for robot manipulation.arXiv preprint arXiv:2602.22896, 2026. 9

arXiv 2026

[16] [16]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252, 2023

2023

[17] [17]

Prasad, K

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. InProceedings of Robotics: Science and Systems, 2024. arXiv:2405.07503

arXiv 2024

[18] [18]

Z. Wang, Z. Li, A. Mandlekar, Z. Xu, J. Fan, Y . Narang, L. Fan, Y . Zhu, Y . Balaji, M. Zhou, M.-Y . Liu, and Y . Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion dis- tillation. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 59770–59791, 2025

2025

[19] [19]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems, 2025. arXiv:2504.02792

Pith/arXiv arXiv 2025

[20] [20]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-H0.7: A latent World-Action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026

[21] [21]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, M. Cao, P. Li, Q. Deng, W. Mei, X. Wang, X. Chen, X. Zhou, Y . Wang, Y . Chang, Y . Li, Y . Zhou, Y . Ye, Z. Liu, and Z. Zhu. GigaWorld-Policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026

[22] [22]

J. Guo, Q. Li, P. Li, Z. Chen, N. Sun, Y . Su, H. Wang, Y . Zhang, X. Li, and H. Liu. Unified 4D world action modeling from video priors with asynchronous denoising.arXiv preprint arXiv:2604.26694, 2026

Pith/arXiv arXiv 2026

[23] [23]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025

[24] [24]

Y . Ye, J. Ma, J. Cen, and Z. Lu. Token expand-merge: Training-free token compression for vision-language-action models.arXiv preprint arXiv:2512.09927, 2025

arXiv 2025

[25] [25]

W. Guan, Q. Hu, A. Li, and J. Cheng. Efficient vision-language-action models for embodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025

Pith/arXiv arXiv 2025

[26] [26]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023. arXiv:2303.04137

Pith/arXiv arXiv 2023

[27] [27]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[28] [28]

X. Jiao, Y . Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

2020

[29] [29]

A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. InProceedings of the International Conference on Learning Representations, 2020

2020

[30] [30]

Molchanov, S

P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. InProceedings of the International Conference on Learning Representations, 2017. 10

2017

[31] [31]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations, 2023

2023

[32] [32]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProceedings of the International Conference on Learning Representations, 2023

2023

[33] [33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023

2023

[34] [34]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H.-a. Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXi...

Pith/arXiv arXiv 2025

[35] [35]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control. InProceedings of Robotics...

2025

[36] [36]

J. Ye, N. Gao, S. Yang, J. Zheng, Z. Wang, Y . Chen, P. Chen, Y . Chen, S. Liu, and J. Jia. StarVLA-α: Reducing complexity in vision-language-action systems.arXiv preprint arXiv:2604.11757, 2026

Pith/arXiv arXiv 2026

[37] [37]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

2025

[38] [38]

Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. ABot-M0: VLA foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026

[39] [39]

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, Y . Ren, K. Zhang, H. Yu, J. Zhao, S. Zhou, Z. Qiu, H. Xiong, Z. Wang, Z. Wang, R. Cheng, Y .-L. Li, Y . Huang, X. Zhu, Y . Shen, and K. Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[40] [40]

Q. Sun, X. Chi, Y . Rui, Y . Li, K. Ge, J. Li, S. Han, and S. Zhang. Labshield: A multimodal benchmark for safety-critical reasoning and planning in scientific laboratories.arXiv preprint arXiv:2603.11987, 2026. 11 Appendix A Training Details Table 4: Training stages and main optimization settings. Stage Trainable Batch Size LR Video Loss Wt. Action Loss ...

arXiv 2026