arxiv: 2512.13030 · v2 · submitted 2025-12-15 · 💻 cs.CV · cs.LG· cs.RO

Recognition: 3 theorem links

Motus: A Unified Latent Action World Model

Chendong Xiang, Haitian Liu, Hang Su, Hanyu Liu, Hengkai Tan, Hongyan Zhao, Hongzhe Bi, Jun Zhu, Lei Ma, Ruowen Zhao, Shenghao Xie, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, Zhizhong Su

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:39 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords unified world modellatent actionrobotic tasksmixture of transformersoptical flowembodied AIvision-language-actionworld modeling

0 comments

The pith

A unified latent action world model combines understanding, generation, and control to enhance robotic task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues for building embodied agents as a single unified system rather than relying on separate models for different functions. Motus achieves this by using a Mixture-of-Transformer architecture that incorporates experts for understanding, video generation, and action, along with a scheduler that permits switching between various modeling modes. It further learns latent actions from optical flow in videos and applies a three-phase training process with a layered data structure to support large-scale pretraining on motion data. If successful, this unified approach leads to improved results on both simulated and physical robot tasks, indicating that shared modeling of capabilities and priors is advantageous for downstream applications.

Core claim

The authors propose Motus as a unified latent action world model that integrates understanding, world modeling, and control capabilities. It employs a Mixture-of-Transformer architecture with three experts and a flexible scheduler to handle multiple modes, extracts latent actions using optical flow, and trains via a three-phase pipeline on a six-layer data pyramid. This enables the model to serve as world models, vision-language-action models, and other variants while achieving better performance on robotic tasks than fragmented approaches.

What carries the argument

Mixture-of-Transformer experts for understanding, video generation, and action, paired with optical flow-derived latent actions and a three-phase training pipeline with data pyramid.

Load-bearing premise

The gains observed are due to the unified architecture and training rather than differences in model size, data quantity, or implementation details.

What would settle it

Running the same benchmarks with a version that uses separate models for each expert or mode but matches the total compute and data used would show whether unification is necessary for the reported benefits.

read the original abstract

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Motus unifies MoT experts with optical-flow latent actions and a scheduler for multi-mode robotic modeling, but the performance claims need controls to separate unification benefits from data scale.

read the letter

Motus tries to fix the split between understanding, world modeling, and control by putting three experts inside one Mixture-of-Transformer and using a UniDiffuser-style scheduler to flip between modes. It pulls latent actions from optical flow, runs a three-phase training recipe, and stacks a six-layer data pyramid to do large-scale action pretraining on top of existing models. That combination is the concrete new piece; prior work had separate models or simpler mixtures, so the specific routing plus scheduler plus flow-based actions is not already published in the cited papers. The write-up is straightforward about how the experts share parameters and how the scheduler avoids mode collapse during training. It also makes a practical point that you can bootstrap from general pretrained checkpoints and sharable motion data instead of starting over. The evaluation section states clear gains: +15% over X-VLA and +45% over Pi0.5 in simulation, plus 11-48% in real-world tasks. Those numbers are presented as evidence that unified priors help downstream robotics. The main soft spot is exactly the one in the stress-test note. The abstract mentions large-scale action pretraining and heterogeneous data but gives no matched-scale or matched-data baselines against the cited methods. Without those controls it is impossible to know whether the reported lifts come from the MoT design and latent-action trick or simply from training on more volume. No ablations or error bars appear in the summary either, so the attribution stays unverified. This paper is aimed at groups working on general embodied agents who want one backbone instead of task-specific stacks. A reader already following world-model or VLA papers will find the architecture recipe useful to discuss even before the numbers are stress-tested. I would bring it to a reading group to walk through the expert routing and scheduler details. I would not cite it yet. It deserves peer review so the full experiments and controls can be checked.

Referee Report

3 major / 2 minor

Summary. The paper proposes Motus, a unified latent action world model for embodied agents. It introduces a Mixture-of-Transformer (MoT) architecture integrating three experts (understanding, video generation, action), a UniDiffuser-style scheduler for switching between modes (world models, VLA, inverse dynamics, video generation, joint prediction), optical-flow-based latent actions, and a three-phase training pipeline with a six-layer data pyramid for large-scale action pretraining. The central empirical claim is that this unified approach yields superior performance over SOTA baselines: +15% over X-VLA and +45% over Pi0.5 in simulation, and +11–48% in real-world scenarios.

Significance. If the reported gains are shown to stem from the unified MoT + latent-action + multi-phase design rather than unmatched data scale or pretraining volume, the work would demonstrate a practical path toward consolidating fragmented embodied capabilities into a single model that can leverage heterogeneous motion data, with potential downstream benefits for robotic task learning.

major comments (3)

[§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.
[§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.
[§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.

minor comments (2)

[Abstract] Abstract: missing space before parenthesis in 'real-world scenarios(improved by +11~48%)'.
[§3.2] Notation: 'UniDiffuser-style scheduler' is referenced repeatedly but never given an explicit equation or pseudocode; a short formal definition would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that stronger empirical validation is needed to support the claims of unification benefits. We will make revisions to address the concerns about experimental rigor, including adding error bars, ablations, and analysis of the routing mechanism. Our point-by-point responses follow.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.

Authors: We acknowledge the validity of this concern. In the revised manuscript, we will report error bars based on at least three independent runs for all key metrics. We will also include ablation experiments that isolate the contributions of the MoT architecture, the UniDiffuser-style scheduler, and the optical-flow-based latent actions by comparing against variants without these components. For matched data controls, we will add a detailed comparison of the training datasets and scales used in our work versus the baselines, noting that our six-layer data pyramid enables leveraging a broader set of motion data. However, fully retraining the baselines on our exact data distribution is beyond our current computational resources, so we will explicitly discuss this as a limitation while providing the available controls. revision: partial
Referee: [§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.

Authors: We agree that empirical analysis of the routing is essential. We will augment Section 3.2 with details on the routing optimization process, including the loss terms that encourage balanced expert utilization. Additionally, we will provide new experiments showing the distribution of routing weights for different tasks and modes, as well as performance comparisons when using learned routing versus fixed or uniform routing. These results will demonstrate that the MoT does not incur significant trade-offs across understanding, generation, and action capabilities. revision: yes
Referee: [§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.

Authors: We recognize that ablations are required to validate the pipeline design. In the revised version, we will add sensitivity analyses and ablations in Section 4: specifically, results from training without the full data pyramid (using only subsets of the layers) and without the optical-flow prior (using raw action labels instead). These will be evaluated on the simulation and real-world benchmarks to quantify the performance degradation, thereby supporting the necessity of the proposed three-phase training and latent action extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by external benchmarks

full rationale

The paper proposes Motus as an engineering combination of MoT experts, UniDiffuser scheduler, optical-flow latent actions, and a three-phase training pipeline with data pyramid. All load-bearing claims are performance numbers obtained from held-out simulation and real-robot evaluations against external baselines (X-VLA, Pi0.5). No equations, uniqueness theorems, or first-principles derivations appear; nothing reduces by construction to a fitted parameter or self-citation. Self-citations, if present, are not invoked to justify the central result. The reported gains may or may not be attributable to unification versus scale, but that is a question of experimental controls, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that optical flow supplies usable latent actions and that the MoT plus scheduler can be trained to switch modes without destructive interference; these are domain assumptions rather than derived results.

free parameters (2)

expert routing weights in MoT
Learned parameters that decide how much each expert contributes at each step; fitted during the three-phase training.
UniDiffuser-style scheduler parameters
Control the flexible switching between modeling modes; chosen or fitted as part of the training recipe.

axioms (2)

domain assumption Optical flow provides a sufficient pixel-level representation of latent actions for downstream control
Invoked when the paper states it adopts optical flow to learn latent actions and extract delta actions.
domain assumption Pretrained general models can be integrated via MoT without losing their individual capabilities
Stated when the paper says it leverages existing general pretrained models.

invented entities (2)

Mixture-of-Transformer (MoT) no independent evidence
purpose: Integrate understanding, video generation, and action experts inside one transformer
New architecture introduced to enable the unified model.
latent action from optical flow no independent evidence
purpose: Provide sharable motion information that replaces explicit action labels for large-scale pretraining
Core mechanism for extracting pixel-level delta actions.

pith-pipeline@v0.9.0 · 5578 in / 1695 out tokens · 47897 ms · 2026-05-12T18:39:05.774409+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
cs.RO 2026-04 unverdicted novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
JailWAM: Jailbreaking World Action Models in Robot Control
cs.RO 2026-04 unverdicted novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
Causal World Modeling for Robot Control
cs.CV 2026-01 unverdicted novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 30 Pith papers · 16 internal anchors

[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.CoRR, abs/2409.16283, 2024. 1

work page arXiv 2024
[5]

H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025. 1

work page 2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foun- dation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Black, M

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero- shot robotic manipulation with pretrained image-editing dif- fusion models.CoRR, abs/2310.10639, 2023. 1

work page arXiv 2023
[8]

In 9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.\π0.5: a vision- language-action model with open-world generalization. In 9th Annual Conference on Robot Learning, 2025. 1, 3, 4, 6

work page 2025
[9]

Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, and Tim Rocktäschel

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

work page
[10]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

VideoJam: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for en- hanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025. 3

work page arXiv 2025
[13]

Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5

work page 2025
[14]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 6, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 8, 2024

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos.arXiv preprint arXiv:2412.04445,

work page arXiv
[16]

Action-free reasoning for policy generalization

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. InICRA 2025 Workshop on Foundation Models and Neuro- Symbolic AI for Robotics, 2025. 3

work page 2025
[17]

Collins, J

Jeremy A Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, and Animesh Garg. Amplify: Ac- 9 tionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025. 3

work page arXiv 2025
[18]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Learn- ing universal policies via text-guided video generation.Ad- vances in neural information processing systems, 36:9156– 9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learn- ing universal policies via text-guided video generation.Ad- vances in neural information processing systems, 36:9156– 9172, 2023. 1, 3

work page 2023
[20]

Imitating latent policies from observation

Ashley Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Isbell. Imitating latent policies from observation. In International conference on machine learning, pages 1755–

work page
[21]

Vidar: Embodied video diffusion model for generalist manipulation, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation. arXiv preprint arXiv:2507.12898, 2025. 1, 3

work page arXiv 2025
[22]

Adaworld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. InForty-second International Conference on Machine Learning, 2025. 3

work page 2025
[23]

beta-V AE: Learning basic visual con- cepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InInterna- tional Conference on Learning Representations, 2017. 3

work page 2017
[24]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapu- rapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 6, 5

work page arXiv 2025
[25]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.CoRR, abs/2412.14803, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning. 1

work page
[27]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...

work page 2024
[28]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.CoRR, abs/2503.00200, 2025. 1

work page internal anchor Pith review arXiv 2025
[29]

Dual diffusion for unified image generation and understanding

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025. 3

work page 2025
[30]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR 2025 Workshop on World Models: Under- standing, Modelling and Scaling, 2025. 3

work page 2025
[31]

Rdt-1b: a diffusion foundation model for bimanual manipula- tion

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipula- tion. InThe Thirteenth International Conference on Learning Representations. 1, 4, 6, 5

work page
[32]

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiang- miao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025. 1, 3

work page arXiv 2025
[33]

Cesar, Xi- angyang Ji, and Xu-Cheng Yin

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xi- angyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive opti- cal flow estimation with a dual-pyramid framework. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 17810–17820. Computer Vision Foundation / IEEE, 2025. 5

work page 2025
[34]

Latent action learning requires supervision in the presence of distractors.arXiv preprint arXiv:2502.00379,

Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, and Vladislav Kurenkov. Latent action learning requires supervision in the presence of distractors.arXiv preprint arXiv:2502.00379,

work page arXiv
[35]

Unimedvl: Unifying medical multimodal understanding and generation through observation-knowledge-analysis, 2025

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, and Junjun He. Unimedvl: Unifying medical multimodal...

work page 2025
[36]

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alexander Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew E. Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Ant...

work page 2024
[37]

Derpanis, and Kostas Daniilidis

Oleh Rybkin, Karl Pertsch, Andrew Jaegle, Konstantinos G. Derpanis, and Kostas Daniilidis. Learning what you can do before doing anything. InInternational Conference on Learning Representations, 2019. 3

work page 2019
[38]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations, 2024. 3

work page 2024
[39]

Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025. 1, 5

work page 2025
[40]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.CoRR, abs/2412.15109, 2024. 1

work page arXiv 2024
[42]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 4

work page 2024
[44]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025

Yiqi Wang, Mrinal Verghese, and Jeff Schneider. Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025. 3

work page arXiv 2025
[47]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 3

work page 2025
[48]

RoboMIND: A multi- embodiment dataset with cross-robot failure demonstra- tions.https://arxiv.org/abs/2412.13877, December 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan 11 Xu, Guang Yang, et al. Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion.arXiv preprint arXiv:2412.13877, 2024. 6, 5

work page arXiv 2024
[49]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3

work page internal anchor Pith review arXiv 2025
[50]

Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kai- jing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from in- ternet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025. 3

work page arXiv 2025
[51]

Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing

Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6960–6970, 2025. 3

work page 2025
[52]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

work page arXiv
[53]

Learning interactive real-world simulators

Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1

work page 2024
[54]

Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025. 3

work page arXiv 2025
[55]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jian- feng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 3

work page 2025
[56]

Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025

Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Ry- bkin, and Pieter Abbeel. Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025. 1

work page arXiv 2025
[57]

Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies

Chengbo Yuan, Rui Zhou, Mengzhen Liu, Yingdong Hu, Shengjie Wang, Li Yi, Shanghang Zhang, Chuan Wen, and Yang Gao. Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025. 3

work page 2025
[58]

What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025. 3

work page arXiv 2025
[59]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low- cost hardware. InRobotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. 3

work page 2023
[60]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 1, 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2508.18269 (2025)

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025. 3

work page arXiv 2025
[62]

Robodreamer: Learning compo- sitional world models for robot imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compo- sitional world models for robot imagination. InInterna- tional Conference on Machine Learning, pages 61885–61896. PMLR, 2024. 1, 3

work page 2024
[63]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.arXiv preprint arXiv:2501.14729, 2025

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.arXiv preprint arXiv:2501.14729, 2025. 3

work page arXiv 2025
[64]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 1, 3, 4

work page internal anchor Pith review arXiv 2025
[65]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 12 Motus: A Unified Latent Action World Model Supplementary Material

work page 2023
[66]

Training and Inference of the Unified Model In this section, we analyze the training and inference proce- dures of the unified model, from both theoretical and experi- mental perspectives. 7.1. Theorectical Analysis During each training iteration, given o0 t:t+k and a0 t:t+k, Mo- tus samples different timesteps τo, τa and noise ϵo, ϵa for them respectivel...

work page
[67]

Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab

More Experiments Results 8.1. Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab. 14 shows the evaluation results on RoboTwin 2.0 Simu- lation, presenting the performance of Motus and other base- lines on all 50 tasks under both clean scenes and randomized scenes. 8.2. Other Benchmarks LIBERO-Long.LIBERO-Long is the long-horizon ...

work page
[68]

Model Architecture Tab

Implementation Details 9.1. Model Architecture Tab. 11 provides the key hyperparameter settings for the Motus model architecture. Grind Coffee Beans With Grinder (AC-One) Touch Instructed Keyboard (AC-One) Brew Coffee using Coffee Maker (AC-One) Place Green Cube Into Plate (AC-One) Pour Water from Kettle to Flowers (AC-One) Get Water from Water Dispenser ...

work page