World Action Models: The Next Frontier in Embodied AI

· 2026 · cs.RO · arXiv 2605.12090

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

representative citing papers

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Introduces SII framework and PIWM using AIDA and BDI models to predict intent transitions and select from five intervention classes, reporting 0.641 macro F1 with ground-truth state on a new benchmark.

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

cs.RO · 2026-06-08 · unverdicted · novelty 5.0

Efficient-WAM delivers 30x lower latency than prior WAMs at 100 ms per chunk while keeping competitive manipulation performance by treating coarse future video as guidance rather than high-fidelity output.

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

cs.RO · 2026-06-05 · unverdicted · novelty 5.0

AdaWAM introduces an adaptive router that triggers textual or visual reasoning as needed in world action models, claiming better efficiency and performance than prior embodied policies on simulated and real tasks.

ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning

cs.RO · 2026-05-31 · unverdicted · novelty 5.0

ImagineUAV is a 1.3B-parameter cascaded world-action framework that generates instruction-conditioned future observations via latent video diffusion, infers motions, and applies kinodynamic planning to outperform VLN/VLA baselines in aerial navigation.

SANTS: A State-Adaptive Scheduler for World Action Models

cs.RO · 2026-05-27 · unverdicted · novelty 5.0

SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

cs.LG · 2026-05-22 · unverdicted · novelty 5.0

VAE world model trained on embodied exploration develops latent representations aligned with physical geometry, with metrics improving together and collapsing together under high KL regularization.

citing papers explorer

Showing 5 of 5 citing papers after filters.

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models cs.RO · 2026-06-08 · unverdicted · none · ref 33 · internal anchor
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination cs.RO · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
Efficient-WAM delivers 30x lower latency than prior WAMs at 100 ms per chunk while keeping competitive manipulation performance by treating coarse future video as guidance rather than high-fidelity output.
Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning cs.RO · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
AdaWAM introduces an adaptive router that triggers textual or visual reasoning as needed in world action models, claiming better efficiency and performance than prior embodied policies on simulated and real tasks.
ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning cs.RO · 2026-05-31 · unverdicted · none · ref 15 · internal anchor
ImagineUAV is a 1.3B-parameter cascaded world-action framework that generates instruction-conditioned future observations via latent video diffusion, infers motions, and applies kinodynamic planning to outperform VLN/VLA baselines in aerial navigation.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

World Action Models: The Next Frontier in Embodied AI

fields

years

verdicts

representative citing papers

citing papers explorer