arxiv: 2603.22078 · v3 · submitted 2026-03-23 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Do World Action Models Generalize Better than VLAs? A Robustness Study

Zhanguang Zhang , Zhiyuan Li , Behnam Rahmati , Rui Heng Yang , Yintao Ma , Amir Rasouli , Sajjad Pakdamansavoji , Yangzheng Wu

show 6 more authors

Lingfeng Zhang Tongtong Cao Feng Wen Xinyu Wang Xingyue Quan Yingxue Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords world action modelsvision language actionrobot generalizationrobustnessvideo pretrainingperturbation evaluationLIBERORoboTwin

0 comments

The pith

World action models reach higher success rates than most VLAs under visual and language perturbations in robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether world action models (WAMs), which adapt video-trained world models to predict robot actions, generalize better than vision-language-action (VLA) models when facing changes in visuals or language. Through tests on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks with added perturbations, it finds that WAMs like Cosmos-Policy achieve 82.2% success on LIBERO-Plus and LingBot-VA reaches 74.2% on RoboTwin 2.0-Plus. VLAs can approach this level only after training on diverse robotic data with multiple objectives, while hybrids fall in between. The findings point to the value of explicit dynamic prediction from large-scale video pretraining for handling real-world variability in robotics.

Core claim

World action models, constructed by training on large video datasets to forecast future states and then decoding their latent states into actions, demonstrate greater robustness to visual and language perturbations than standard VLAs on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks. Specific results include Cosmos-Policy attaining an 82.2% success rate on LIBERO-Plus and LingBot-VA reaching 74.2% on RoboTwin 2.0-Plus. Although certain VLAs such as π0.5 can match this robustness on selected tasks, they generally demand substantial training with varied robotic datasets and learning objectives, whereas hybrid methods that blend video dynamics show intermediate levels of performance.

What carries the argument

World action models (WAMs) that leverage spatiotemporal priors from web-scale video pretraining to predict future states, with their latent representations decoded into robot actions.

Load-bearing premise

The selected visual and language perturbations on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks represent the generalization challenges that are most relevant for real-world robot deployment.

What would settle it

Observing that a VLA trained with additional diverse data surpasses WAM success rates on the same perturbed benchmarks, or that WAM performance drops sharply on a novel perturbation type not tested in the study.

read the original abstract

Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as $\pi_{0.5}$ can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WAMs show better robustness than most VLAs on these perturbed benchmarks, but the perturbations themselves are the part that needs more justification.

read the letter

The main thing to know is that this paper runs a head-to-head robustness test on LIBERO-Plus and RoboTwin 2.0-Plus with visual and language perturbations, and the world action models come out ahead on average. LingBot-VA hits 74.2% on RoboTwin 2.0-Plus while Cosmos-Policy reaches 82.2% on LIBERO-Plus; most VLAs drop further unless they were trained on large and varied robot datasets. Hybrids land in the middle, which lines up with the idea that video priors help but only when integrated properly.

Referee Report

2 major / 1 minor

Summary. The manuscript conducts a comparative robustness study of World Action Models (WAMs) versus Vision-Language-Action (VLA) policies on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under visual and language perturbations. It claims that WAMs exhibit stronger generalization than most VLAs due to their video-based dynamic prediction and spatiotemporal priors, reporting specific success rates such as 74.2% for LingBot-VA on RoboTwin 2.0-Plus and 82.2% for Cosmos-Policy on LIBERO-Plus, while noting that certain VLAs (e.g., π_{0.5}) can achieve comparable robustness only after extensive training on diverse robotic data.

Significance. If the central empirical findings hold under representative perturbations, the work would provide evidence favoring world-model-based approaches over standard VLAs for robotic generalization, with implications for designing more robust action policies that leverage large-scale video pretraining.

major comments (2)

[Abstract] Abstract: the reported success rates (LingBot-VA 74.2% on RoboTwin 2.0-Plus; Cosmos-Policy 82.2% on LIBERO-Plus) are stated without error bars, statistical significance tests, or details on perturbation generation and training protocols, which is load-bearing for assessing whether the observed robustness gap is reliable.
[Evaluation] Evaluation (perturbation design): the paper does not demonstrate that the chosen visual and language perturbations fairly represent the distribution of real-world variations (e.g., sensor noise, novel dynamics, lighting shifts, or out-of-vocabulary instructions), which directly undermines the generalization claim since the abstract acknowledges some VLAs reach comparable robustness on certain tasks.

minor comments (1)

[Abstract] Abstract: clarify the exact meaning of the subscript in π_{0.5} and ensure consistent model naming across the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript comparing the robustness of World Action Models and Vision-Language-Action policies. We address each major comment in detail below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported success rates (LingBot-VA 74.2% on RoboTwin 2.0-Plus; Cosmos-Policy 82.2% on LIBERO-Plus) are stated without error bars, statistical significance tests, or details on perturbation generation and training protocols, which is load-bearing for assessing whether the observed robustness gap is reliable.

Authors: We agree with this observation and will revise the abstract and evaluation sections to include error bars (standard deviations across 3-5 random seeds), details on how perturbations were generated (e.g., specific noise levels for visual perturbations and synonym replacements for language), and summaries of training protocols for each model. We will also add statistical significance tests, such as paired t-tests, to support the reported differences in robustness. revision: yes
Referee: [Evaluation] Evaluation (perturbation design): the paper does not demonstrate that the chosen visual and language perturbations fairly represent the distribution of real-world variations (e.g., sensor noise, novel dynamics, lighting shifts, or out-of-vocabulary instructions), which directly undermines the generalization claim since the abstract acknowledges some VLAs reach comparable robustness on certain tasks.

Authors: While we cannot exhaustively prove that our perturbations cover the entire real-world distribution, we selected them based on established benchmarks and prior robustness studies in robotics to simulate key variations like lighting changes, camera noise, and instruction paraphrasing. In the revision, we will add a dedicated subsection justifying the perturbation choices with references to real-world robotic challenges and include additional experiments with novel dynamics if feasible. We maintain that the study provides valuable comparative insights, particularly noting that VLAs achieving similar robustness require significantly more diverse training data. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark results

full rationale

The paper conducts a comparative robustness study by measuring success rates of WAMs and VLAs on LIBERO-Plus and RoboTwin 2.0-Plus under fixed visual and language perturbations. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; claims rest on reported experimental outcomes (e.g., 74.2% and 82.2% success rates) rather than any reduction to inputs by construction. This matches the reader's assessment of score 1.0 and qualifies as a self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical comparative study; no new free parameters, mathematical axioms, or invented entities are introduced beyond standard robotics evaluation assumptions.

axioms (1)

domain assumption Success rate under the applied perturbations measures meaningful generalization
The paper's conclusions rest on this metric reflecting real-world robustness.

pith-pipeline@v0.9.0 · 5678 in / 1122 out tokens · 42038 ms · 2026-05-15T00:36:25.060531+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WAMs... built upon world models that are trained on large corpora of video data to predict future states... explicit dynamic prediction capacity, combined with spatiotemporal priors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.