pith. machine review for the scientific record. sign in

arxiv: 2512.13030 · v2 · submitted 2025-12-15 · 💻 cs.CV · cs.LG· cs.RO

Recognition: 3 theorem links

Motus: A Unified Latent Action World Model

Chendong Xiang, Haitian Liu, Hang Su, Hanyu Liu, Hengkai Tan, Hongyan Zhao, Hongzhe Bi, Jun Zhu, Lei Ma, Ruowen Zhao, Shenghao Xie, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, Zhizhong Su

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:39 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords unified world modellatent actionrobotic tasksmixture of transformersoptical flowembodied AIvision-language-actionworld modeling
0
0 comments X

The pith

A unified latent action world model combines understanding, generation, and control to enhance robotic task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues for building embodied agents as a single unified system rather than relying on separate models for different functions. Motus achieves this by using a Mixture-of-Transformer architecture that incorporates experts for understanding, video generation, and action, along with a scheduler that permits switching between various modeling modes. It further learns latent actions from optical flow in videos and applies a three-phase training process with a layered data structure to support large-scale pretraining on motion data. If successful, this unified approach leads to improved results on both simulated and physical robot tasks, indicating that shared modeling of capabilities and priors is advantageous for downstream applications.

Core claim

The authors propose Motus as a unified latent action world model that integrates understanding, world modeling, and control capabilities. It employs a Mixture-of-Transformer architecture with three experts and a flexible scheduler to handle multiple modes, extracts latent actions using optical flow, and trains via a three-phase pipeline on a six-layer data pyramid. This enables the model to serve as world models, vision-language-action models, and other variants while achieving better performance on robotic tasks than fragmented approaches.

What carries the argument

Mixture-of-Transformer experts for understanding, video generation, and action, paired with optical flow-derived latent actions and a three-phase training pipeline with data pyramid.

Load-bearing premise

The gains observed are due to the unified architecture and training rather than differences in model size, data quantity, or implementation details.

What would settle it

Running the same benchmarks with a version that uses separate models for each expert or mode but matches the total compute and data used would show whether unification is necessary for the reported benefits.

read the original abstract

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Motus, a unified latent action world model for embodied agents. It introduces a Mixture-of-Transformer (MoT) architecture integrating three experts (understanding, video generation, action), a UniDiffuser-style scheduler for switching between modes (world models, VLA, inverse dynamics, video generation, joint prediction), optical-flow-based latent actions, and a three-phase training pipeline with a six-layer data pyramid for large-scale action pretraining. The central empirical claim is that this unified approach yields superior performance over SOTA baselines: +15% over X-VLA and +45% over Pi0.5 in simulation, and +11–48% in real-world scenarios.

Significance. If the reported gains are shown to stem from the unified MoT + latent-action + multi-phase design rather than unmatched data scale or pretraining volume, the work would demonstrate a practical path toward consolidating fragmented embodied capabilities into a single model that can leverage heterogeneous motion data, with potential downstream benefits for robotic task learning.

major comments (3)
  1. [§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.
  2. [§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.
  3. [§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.
minor comments (2)
  1. [Abstract] Abstract: missing space before parenthesis in 'real-world scenarios(improved by +11~48%)'.
  2. [§3.2] Notation: 'UniDiffuser-style scheduler' is referenced repeatedly but never given an explicit equation or pseudocode; a short formal definition would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that stronger empirical validation is needed to support the claims of unification benefits. We will make revisions to address the concerns about experimental rigor, including adding error bars, ablations, and analysis of the routing mechanism. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.

    Authors: We acknowledge the validity of this concern. In the revised manuscript, we will report error bars based on at least three independent runs for all key metrics. We will also include ablation experiments that isolate the contributions of the MoT architecture, the UniDiffuser-style scheduler, and the optical-flow-based latent actions by comparing against variants without these components. For matched data controls, we will add a detailed comparison of the training datasets and scales used in our work versus the baselines, noting that our six-layer data pyramid enables leveraging a broader set of motion data. However, fully retraining the baselines on our exact data distribution is beyond our current computational resources, so we will explicitly discuss this as a limitation while providing the available controls. revision: partial

  2. Referee: [§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.

    Authors: We agree that empirical analysis of the routing is essential. We will augment Section 3.2 with details on the routing optimization process, including the loss terms that encourage balanced expert utilization. Additionally, we will provide new experiments showing the distribution of routing weights for different tasks and modes, as well as performance comparisons when using learned routing versus fixed or uniform routing. These results will demonstrate that the MoT does not incur significant trade-offs across understanding, generation, and action capabilities. revision: yes

  3. Referee: [§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.

    Authors: We recognize that ablations are required to validate the pipeline design. In the revised version, we will add sensitivity analyses and ablations in Section 4: specifically, results from training without the full data pyramid (using only subsets of the layers) and without the optical-flow prior (using raw action labels instead). These will be evaluated on the simulation and real-world benchmarks to quantify the performance degradation, thereby supporting the necessity of the proposed three-phase training and latent action extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by external benchmarks

full rationale

The paper proposes Motus as an engineering combination of MoT experts, UniDiffuser scheduler, optical-flow latent actions, and a three-phase training pipeline with data pyramid. All load-bearing claims are performance numbers obtained from held-out simulation and real-robot evaluations against external baselines (X-VLA, Pi0.5). No equations, uniqueness theorems, or first-principles derivations appear; nothing reduces by construction to a fitted parameter or self-citation. Self-citations, if present, are not invoked to justify the central result. The reported gains may or may not be attributable to unification versus scale, but that is a question of experimental controls, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that optical flow supplies usable latent actions and that the MoT plus scheduler can be trained to switch modes without destructive interference; these are domain assumptions rather than derived results.

free parameters (2)
  • expert routing weights in MoT
    Learned parameters that decide how much each expert contributes at each step; fitted during the three-phase training.
  • UniDiffuser-style scheduler parameters
    Control the flexible switching between modeling modes; chosen or fitted as part of the training recipe.
axioms (2)
  • domain assumption Optical flow provides a sufficient pixel-level representation of latent actions for downstream control
    Invoked when the paper states it adopts optical flow to learn latent actions and extract delta actions.
  • domain assumption Pretrained general models can be integrated via MoT without losing their individual capabilities
    Stated when the paper says it leverages existing general pretrained models.
invented entities (2)
  • Mixture-of-Transformer (MoT) no independent evidence
    purpose: Integrate understanding, video generation, and action experts inside one transformer
    New architecture introduced to enable the unified model.
  • latent action from optical flow no independent evidence
    purpose: Provide sharable motion information that replaces explicit action labels for large-scale pretraining
    Core mechanism for extracting pixel-level delta actions.

pith-pipeline@v0.9.0 · 5578 in / 1695 out tokens · 47897 ms · 2026-05-12T18:39:05.774409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  6. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  7. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  8. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  9. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  10. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  11. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  12. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  13. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  14. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  15. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  16. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  17. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  18. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  19. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  20. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  21. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  22. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  23. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  24. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  25. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  26. AttenA+: Rectifying Action Inequality in Robotic Foundation Models

    cs.RO 2026-05 unverdicted novelty 5.0

    AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.

  27. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  28. STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...

  29. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  30. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  31. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  32. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

  33. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 30 Pith papers · 16 internal anchors

  1. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5

  2. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

  3. [4]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.CoRR, abs/2409.16283, 2024. 1

  4. [5]

    H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025. 1

  5. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foun- dation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 3

  6. [7]

    Black, M

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero- shot robotic manipulation with pretrained image-editing dif- fusion models.CoRR, abs/2310.10639, 2023. 1

  7. [8]

    In 9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.\π0.5: a vision- language-action model with open-world generalization. In 9th Annual Conference on Robot Learning, 2025. 1, 3, 4, 6

  8. [9]

    Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, and Tim Rocktäschel

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

  9. [10]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025. 3

  10. [11]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 1, 3

  11. [12]

    VideoJam: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for en- hanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025. 3

  12. [13]

    Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5

  13. [14]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 6, 5

  14. [15]

    Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos.arXiv preprint arXiv:2412.04445,

  15. [16]

    Action-free reasoning for policy generalization

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. InICRA 2025 Workshop on Foundation Models and Neuro- Symbolic AI for Robotics, 2025. 3

  16. [17]

    Collins, J

    Jeremy A Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, and Animesh Garg. Amplify: Ac- 9 tionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025. 3

  17. [18]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025. 3

  18. [19]

    Learn- ing universal policies via text-guided video generation.Ad- vances in neural information processing systems, 36:9156– 9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learn- ing universal policies via text-guided video generation.Ad- vances in neural information processing systems, 36:9156– 9172, 2023. 1, 3

  19. [20]

    Imitating latent policies from observation

    Ashley Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Isbell. Imitating latent policies from observation. In International conference on machine learning, pages 1755–

  20. [21]

    Vidar: Embodied video diffusion model for generalist manipulation, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation. arXiv preprint arXiv:2507.12898, 2025. 1, 3

  21. [22]

    Adaworld: Learning adaptable world models with latent actions

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. InForty-second International Conference on Machine Learning, 2025. 3

  22. [23]

    beta-V AE: Learning basic visual con- cepts with a constrained variational framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InInterna- tional Conference on Learning Representations, 2017. 3

  23. [24]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapu- rapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 6, 5

  24. [25]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.CoRR, abs/2412.14803, 2024. 1

  25. [26]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning. 1

  26. [27]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...

  27. [28]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.CoRR, abs/2503.00200, 2025. 1

  28. [29]

    Dual diffusion for unified image generation and understanding

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025. 3

  29. [30]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

    Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR 2025 Workshop on World Models: Under- standing, Modelling and Scaling, 2025. 3

  30. [31]

    Rdt-1b: a diffusion foundation model for bimanual manipula- tion

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipula- tion. InThe Thirteenth International Conference on Learning Representations. 1, 4, 6, 5

  31. [32]

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiang- miao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025. 1, 3

  32. [33]

    Cesar, Xi- angyang Ji, and Xu-Cheng Yin

    Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xi- angyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive opti- cal flow estimation with a dual-pyramid framework. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 17810–17820. Computer Vision Foundation / IEEE, 2025. 5

  33. [34]

    Latent action learning requires supervision in the presence of distractors.arXiv preprint arXiv:2502.00379,

    Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, and Vladislav Kurenkov. Latent action learning requires supervision in the presence of distractors.arXiv preprint arXiv:2502.00379,

  34. [35]

    Unimedvl: Unifying medical multimodal understanding and generation through observation-knowledge-analysis, 2025

    Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, and Junjun He. Unimedvl: Unifying medical multimodal...

  35. [36]

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alexander Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew E. Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Ant...

  36. [37]

    Derpanis, and Kostas Daniilidis

    Oleh Rybkin, Karl Pertsch, Andrew Jaegle, Konstantinos G. Derpanis, and Kostas Daniilidis. Learning what you can do before doing anything. InInternational Conference on Learning Representations, 2019. 3

  37. [38]

    Learning to act without actions

    Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations, 2024. 3

  38. [39]

    Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025

    Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025. 1, 5

  39. [40]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3

  40. [41]

    Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.CoRR, abs/2412.15109, 2024. 1

  41. [42]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  42. [43]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 4

  43. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  44. [45]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3

  45. [46]

    Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025

    Yiqi Wang, Mrinal Verghese, and Jeff Schneider. Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025. 3

  46. [47]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 3

  47. [48]

    RoboMIND: A multi- embodiment dataset with cross-robot failure demonstra- tions.https://arxiv.org/abs/2412.13877, December 2024

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan 11 Xu, Guang Yang, et al. Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion.arXiv preprint arXiv:2412.13877, 2024. 6, 5

  48. [49]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3

  49. [50]

    Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

    Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kai- jing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from in- ternet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025. 3

  50. [51]

    Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing

    Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6960–6970, 2025. 3

  51. [52]

    Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

  52. [53]

    Learning interactive real-world simulators

    Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1

  53. [54]

    Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

    Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025. 3

  54. [55]

    Latent action pretraining from videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jian- feng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 3

  55. [56]

    Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025

    Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Ry- bkin, and Pieter Abbeel. Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025. 1

  56. [57]

    Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies

    Chengbo Yuan, Rui Zhou, Mengzhen Liu, Yingdong Hu, Shengjie Wang, Li Yi, Shanghang Zhang, Chuan Wen, and Yang Gao. Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025. 3

  57. [58]

    What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

    Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025. 3

  58. [59]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low- cost hardware. InRobotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. 3

  59. [60]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 1, 3, 4, 6

  60. [61]

    arXiv preprint arXiv:2508.18269 (2025)

    Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025. 3

  61. [62]

    Robodreamer: Learning compo- sitional world models for robot imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compo- sitional world models for robot imagination. InInterna- tional Conference on Machine Learning, pages 61885–61896. PMLR, 2024. 1, 3

  62. [63]

    Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.arXiv preprint arXiv:2501.14729, 2025

    Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.arXiv preprint arXiv:2501.14729, 2025. 3

  63. [64]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 1, 3, 4

  64. [65]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 12 Motus: A Unified Latent Action World Model Supplementary Material

  65. [66]

    Training and Inference of the Unified Model In this section, we analyze the training and inference proce- dures of the unified model, from both theoretical and experi- mental perspectives. 7.1. Theorectical Analysis During each training iteration, given o0 t:t+k and a0 t:t+k, Mo- tus samples different timesteps τo, τa and noise ϵo, ϵa for them respectivel...

  66. [67]

    Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab

    More Experiments Results 8.1. Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab. 14 shows the evaluation results on RoboTwin 2.0 Simu- lation, presenting the performance of Motus and other base- lines on all 50 tasks under both clean scenes and randomized scenes. 8.2. Other Benchmarks LIBERO-Long.LIBERO-Long is the long-horizon ...

  67. [68]

    Model Architecture Tab

    Implementation Details 9.1. Model Architecture Tab. 11 provides the key hyperparameter settings for the Motus model architecture. Grind Coffee Beans With Grinder (AC-One) Touch Instructed Keyboard (AC-One) Brew Coffee using Coffee Maker (AC-One) Place Green Cube Into Plate (AC-One) Pour Water from Kettle to Flowers (AC-One) Get Water from Water Dispenser ...