pith. sign in

hub

MotuBrain: An Advanced World Action Model for Robot Control

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it
abstract

Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.

hub tools

citation-role summary

background 1

citation-polarity summary

years

2026 11

roles

background 1

polarities

background 1

clear filters

representative citing papers

What Are We Actually Benchmarking in Robot Manipulation?

cs.RO · 2026-06-02 · conditional · novelty 6.0

LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.

Kairos: A Native World Model Stack for Physical AI

cs.AI · 2026-06-15 · unverdicted · novelty 5.0

Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.

SANTS: A State-Adaptive Scheduler for World Action Models

cs.RO · 2026-05-27 · unverdicted · novelty 5.0

SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

WALL-WM: Carving World Action Modeling at the Event Joints

cs.RO · 2026-06-01 · unverdicted · novelty 4.0

WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

citing papers explorer

Showing 6 of 6 citing papers after filters.

  • $\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models cs.RO · 2026-06-08 · unverdicted · none · ref 32 · internal anchor

    ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.

  • PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 60 · internal anchor

    PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.

  • HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 18 · internal anchor

    HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.

  • SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 30 · internal anchor

    SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

  • WALL-WM: Carving World Action Modeling at the Event Joints cs.RO · 2026-06-01 · unverdicted · none · ref 69 · internal anchor

    WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.

  • World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 111 · internal anchor

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.