REGEN uses recurrent generative replays from World Action Models to cut catastrophic forgetting by up to 50% in continual imitation learning compared to sequential fine-tuning.
hub
MotuBrain: An Advanced World Action Model for Robot Control
16 Pith papers cite this work. Polarity classification is still indexing.
abstract
Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
hub tools
citation-role summary
citation-polarity summary
years
2026 16roles
background 2polarities
background 2representative citing papers
VLMs excel at semantic and grouping tasks while VGMs are stronger on dense geometry and camera motion, with naive fusion yielding balanced representations.
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.
World Value Model (WVM) integrates world models with value estimation to achieve SOTA Value-Order Correlation on expert and suboptimal robotic data and improves downstream policy performance.
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.
AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot control, improving localization, affordance, and robustness.
HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.
citing papers explorer
No citing papers match the current filters.