hub

MotuBrain: An Advanced World Action Model for Robot Control

· 2026 · cs.RO · arXiv 2604.27792

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open full Pith review browse 11 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

VLMs excel at semantic and grouping tasks while VGMs are stronger on dense geometry and camera motion, with naive fusion yielding balanced representations.

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.

What Are We Actually Benchmarking in Robot Manipulation?

cs.RO · 2026-06-02 · conditional · novelty 6.0

LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 5.0

PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.

Kairos: A Native World Model Stack for Physical AI

cs.AI · 2026-06-15 · unverdicted · novelty 5.0

Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot control, improving localization, affordance, and robustness.

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.

SANTS: A State-Adaptive Scheduler for World Action Models

cs.RO · 2026-05-27 · unverdicted · novelty 5.0

SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

WALL-WM: Carving World Action Modeling at the Event Joints

cs.RO · 2026-06-01 · unverdicted · novelty 4.0

WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

citing papers explorer

Showing 6 of 6 citing papers after filters.

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models cs.RO · 2026-06-08 · unverdicted · none · ref 32 · internal anchor
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 60 · internal anchor
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 18 · internal anchor
HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 30 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
WALL-WM: Carving World Action Modeling at the Event Joints cs.RO · 2026-06-01 · unverdicted · none · ref 69 · internal anchor
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 111 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

MotuBrain: An Advanced World Action Model for Robot Control

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer