pith. sign in

arxiv: 2605.19957 · v1 · pith:UYDOA34Qnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.RO

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

Pith reviewed 2026-05-20 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords world modelsembodied intelligencelong-horizon planninghybrid navigation-manipulationdisentanglementdiffusion modelsworld-ego modeling
0
0 comments X

The pith

Decomposing future embodied predictions into separate world and ego streams improves long-horizon hybrid task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard world models degrade on long-horizon embodied scenarios because they entangle persistent scene regularities with robot-centric, instruction-conditioned dynamics inside a single prediction stream. It introduces World-Ego Modeling as a paradigm that explicitly decomposes future evolution along a boundary defined from motion, semantic, and intention perspectives, then tests post-, pre-, and full disentanglement strategies. The authors instantiate the idea in the World-Ego Model, which pairs an implicit planner with a cascade-parallel mixture-of-experts diffusion generator. A sympathetic reader would care because hybrid tasks that interleave navigation and manipulation are common in real robotics yet remain difficult for current single-stream models. Experiments on a newly constructed benchmark with 125K video clips and 300 multi-turn trajectories show the separated approach reaches state-of-the-art results while holding competitive performance on existing manipulation-only suites.

Core claim

World-Ego Modeling decomposes future evolution into distinct world and ego components by defining their boundary from motion-, semantic-, and intention-based views and by analyzing post-, pre-, and full disentanglement. The paradigm is realized as the World-Ego Model, which couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts diffusion generator. On the introduced HTEWorld benchmark of long-horizon hybrid navigation-manipulation tasks the resulting model outperforms prior world models while remaining competitive on manipulation-only benchmarks.

What carries the argument

The world-ego boundary, defined from motion-, semantic-, and intention-based views, which enables post-, pre-, and full disentanglement of future evolutions inside the World-Ego Model architecture.

If this is right

  • WEM reaches state-of-the-art results on the HTEWorld benchmark of 300 multi-turn hybrid trajectories.
  • The same model remains competitive with prior methods on existing manipulation-only benchmarks.
  • Explicit separation of persistent scene regularities from instruction-conditioned robot dynamics reduces performance degradation over long interleaved navigation-manipulation sequences.
  • The three disentanglement strategies (post, pre, full) provide measurable trade-offs for modeling hybrid task evolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary definitions could be tested on purely navigation or purely manipulation datasets to check whether hybrid-task gains generalize.
  • Isolating world from ego dynamics may reduce error accumulation in downstream planners that rely on multi-step rollouts.
  • Real-robot experiments with sensor noise and physical dynamics would reveal whether the video-clip improvements transfer outside simulation.

Load-bearing premise

The world-ego boundary defined from motion-, semantic-, and intention-based views permits effective post-, pre-, and full disentanglement that improves long-horizon hybrid task modeling.

What would settle it

A controlled ablation on HTEWorld in which removing the motion-semantic-intention boundary definitions and the associated disentanglement strategies produces equal or higher performance on long-horizon hybrid trajectories would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19957 by Jianhui Zhang, Peidong Jia, Shanghang Zhang, Xiaoguang Zhao, Xingyu Chen, Zuyao Lin.

Figure 1
Figure 1. Figure 1: Conceptual illustration of World-Ego Modeling. Embodied evolution involves two heterogeneous dynamics: the world [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three perspectives of the world-ego definition. The motion-based view separates the world and ego by the source of visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: General framework of World-Ego Modeling. Our framework contains two stages. (a) The prediction stage uses a vision [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the World-Ego Model (WEM). WEM instantiates World-Ego Modeling with a semantic-based world-ego view [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on HTEWorld. Given the same initial observation and five-step instruction sequence, each model [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of the proposed HTEWorld benchmark. HTEWorld provides large-scale training clips and multi-turn evaluation [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mask-guided visualization of role expert specialization. We visualize the world expert and ego expert outputs in WEM [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes World-Ego Modeling (WEM), a new paradigm that decomposes future evolution in embodied tasks into separate world (persistent, instruction-agnostic scene regularities) and ego (robot-centric, instruction-conditioned dynamics) components to mitigate entanglement in long-horizon hybrid navigation-manipulation scenarios. It defines the world-ego boundary from motion-, semantic-, and intention-based views and analyzes post-, pre-, and full disentanglement strategies. The paradigm is instantiated as the World-Ego Model coupling an implicit planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. The authors also introduce the HTEWorld benchmark with 125K video clips (4.5M frames) and 300 multi-turn trajectories for rigorous evaluation. Experiments claim SOTA performance on HTEWorld and competitive results on existing manipulation-only benchmarks.

Significance. If the reported gains are causally attributable to the world-ego boundary definitions and disentanglement strategies rather than the CP-MoE generator capacity alone, the work would provide a meaningful conceptual and technical advance for long-horizon embodied world modeling in hybrid tasks. The HTEWorld benchmark addresses a clear gap in evaluating interleaved navigation-manipulation behaviors and could serve as a community resource. The three-view boundary analysis offers a structured way to separate persistent vs. dynamic elements that may generalize beyond the specific instantiation.

major comments (2)
  1. [Experiments] Experiments section: No ablations are described that hold the CP-MoE diffusion generator and planner architecture fixed while toggling only the disentanglement strategies (post-/pre-/full) or the boundary definitions (motion-/semantic-/intention-based). This is load-bearing for the central claim, as the abstract and paradigm definition locate the long-horizon improvements in the world-ego separation, yet the new generator's training dynamics and capacity could independently explain SOTA on the author-constructed HTEWorld benchmark.
  2. [Method / Benchmark] §3 (Method) and HTEWorld construction: The paper does not report controls or statistics showing that the 300 multi-turn evaluation trajectories were selected independently of the proposed WEM (e.g., via held-out human demonstrations or cross-validation against baseline generators). Without this, it remains possible that benchmark construction interacts with the model design, weakening the claim that WEM's disentanglement drives the hybrid-task gains.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'fine-grained action annotations' is used without specifying the action vocabulary size, annotation protocol, or inter-annotator agreement for the 125K clips; adding one sentence would improve reproducibility.
  2. [Method] Notation: The three boundary views and the mapping from disentanglement strategy to the implicit planner outputs are introduced conceptually but lack a compact equation or diagram reference early in the method; a small table summarizing the three strategies would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential value of the world-ego paradigm and HTEWorld benchmark. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablations are described that hold the CP-MoE diffusion generator and planner architecture fixed while toggling only the disentanglement strategies (post-/pre-/full) or the boundary definitions (motion-/semantic-/intention-based). This is load-bearing for the central claim, as the abstract and paradigm definition locate the long-horizon improvements in the world-ego separation, yet the new generator's training dynamics and capacity could independently explain SOTA on the author-constructed HTEWorld benchmark.

    Authors: We agree that stronger isolation of the disentanglement strategies is needed to support the central claim. The current manuscript reports performance across post-, pre-, and full disentanglement as well as the three boundary definitions, but these comparisons do not hold the CP-MoE generator and planner completely fixed across all variants. We will add dedicated ablation experiments in the revision that fix the generator architecture and planner while varying only the disentanglement strategies and boundary views. This will more directly attribute gains to the world-ego separation. revision: yes

  2. Referee: [Method / Benchmark] §3 (Method) and HTEWorld construction: The paper does not report controls or statistics showing that the 300 multi-turn evaluation trajectories were selected independently of the proposed WEM (e.g., via held-out human demonstrations or cross-validation against baseline generators). Without this, it remains possible that benchmark construction interacts with the model design, weakening the claim that WEM's disentanglement drives the hybrid-task gains.

    Authors: The 300 multi-turn trajectories were drawn from a held-out portion of the HTEWorld data that was not used to train WEM or any baseline models. However, the original manuscript does not include explicit statistics, cross-validation details, or controls demonstrating independence from the WEM design. We will revise the benchmark construction section to add these details, including data splits and selection criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: paradigm, disentanglement strategies, and HTEWorld benchmark introduced as independent contributions with external evaluation.

full rationale

The paper defines a new World-Ego Modeling paradigm by decomposing future evolution into world and ego components using motion-, semantic-, and intention-based boundary views, then analyzes post-/pre-/full disentanglement strategies before instantiating WEM as an implicit planner coupled to a CP-MoE diffusion generator. It further constructs the HTEWorld benchmark with 125K clips and 300 trajectories for evaluation. No step reduces a claimed prediction or first-principles result to its own inputs by construction; the performance claims rest on empirical results on the new benchmark and existing manipulation suites rather than self-referential fitting or self-citation chains. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; main reliance is on the domain assumption that world-ego disentanglement yields measurable gains in hybrid scenarios.

axioms (1)
  • domain assumption Future evolution in embodied tasks can be usefully decomposed into instruction-agnostic world regularities and instruction-conditioned ego dynamics.
    Foundational premise stated in the opening of the abstract.
invented entities (1)
  • World-Ego Model (WEM) with CP-MoE diffusion generator no independent evidence
    purpose: Unified model coupling implicit planner and cascade-parallel mixture-of-experts for disentangled generation
    New architecture introduced to instantiate the paradigm.

pith-pipeline@v0.9.0 · 5801 in / 1245 out tokens · 33058 ms · 2026-05-20T06:32:41.502464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 19 internal anchors

  1. [1]

    Recurrent world models facilitate policy evolution

    David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. InNeurIPS, 2018

  2. [2]

    Mastering diverse control tasks through world models.Nature, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025

  3. [3]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  4. [4]

    Dreamgen: Unlocking generalization in robot learning through video world models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. In CoRL, 2025

  5. [5]

    Learning interactive real-world simulators

    Mengjiao Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InICLR, 2024

  6. [6]

    Daydreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

  7. [7]

    FLARE: Robot learning with implicit world modeling

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Lo¨ıc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InCoRL, 2025

  8. [8]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In ICLR, 2026

  9. [9]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602...

  10. [10]

    Any-point trajectory modeling for policy learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRSS, 2024. 11 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

  11. [11]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  12. [12]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

  13. [13]

    Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

    Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, and Jiwen Lu. Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

  14. [14]

    WORLDMEM: Long-term consistent world simulation with memory

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. WORLDMEM: Long-term consistent world simulation with memory. InNeurIPS, 2025

  15. [15]

    StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

    Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, and Chenyang Si. StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

  16. [16]

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

  17. [17]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

  18. [18]

    Robodreamer: Learning composi- tional world models for robot imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning composi- tional world models for robot imagination. InICML, 2024

  19. [19]

    PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

    Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

  20. [20]

    Tesseract: Learning 4d embodied world models

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. InICCV, 2025

  21. [21]

    Learning world models for interactive video generation

    Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation. InNeurIPS, 2025

  22. [22]

    VideoREPA: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. VideoREPA: Learning physics for video generation through relational alignment with foundation models. InNeurIPS, 2025

  23. [23]

    A path towards autonomous machine intelligence, 2022

    Yann LeCun. A path towards autonomous machine intelligence, 2022. URL https://openreview.net/forum? id=BZ5a1r-kVsf. OpenReview position paper, version 0.9.2

  24. [24]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  25. [25]

    Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, 2025

  26. [26]

    Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

    Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, and Yong Li. Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

  27. [27]

    Dexterous world models

    Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models. InCVPR, 2026. 12 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

  28. [28]

    Towards long-horizon vision- language navigation: Platform, benchmark and method

    Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. InCVPR, 2025

  29. [29]

    Mobility VLA: Multimodal instruction navigation with long-context VLMs and topological graphs

    Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...

  30. [30]

    MoManipVLA: Transferring vision-language- action models for general mobile manipulation

    Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. MoManipVLA: Transferring vision-language- action models for general mobile manipulation. InCVPR, 2025

  31. [31]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  32. [32]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Mart´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  33. [33]

    Rolling forcing: Autoregressive long video diffusion in real time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

  34. [34]

    Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

  35. [35]

    Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

    Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

  36. [36]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

  37. [37]

    Worldmodelbench: Judging video generation models as world models

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InNeurIPS, 2025

  38. [38]

    Rethinking video generation model for the embodied world

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world. InICML, 2026

  39. [39]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971, 2026

  40. [40]

    BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ın-Mart´ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InCoRL, 2023

  41. [41]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InICLR, 2020

  42. [42]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

  43. [43]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  44. [44]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 13 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

  45. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  46. [46]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025

  47. [47]

    Genie 2: A large-scale foundation world model, 2024

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Mou- farek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model, 2024. URL https: //deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/ . Google Deep- Mind blog

  48. [48]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  49. [49]

    Vid2World: Crafting video diffusion models to interactive world models

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InICLR, 2026

  50. [50]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  51. [51]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  52. [52]

    MotionCtrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

  53. [53]

    FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis

    Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. InCVPR, 2025

  54. [54]

    Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez- Rua, Yiannis Douratsos, and Tao Xiang. Rays as pixels: Learning a joint distribution of videos and camera trajectories. arXiv preprint arXiv:2604.09429, 2026

  55. [55]

    Motion attribution for video generation

    Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taix´e, Olga Russakovsky, Sanja Fidler, and Jonathan Lorraine. Motion attribution for video generation. InICML, 2026

  56. [56]

    The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

  57. [57]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  58. [58]

    LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

    Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

  59. [59]

    Roboscape: Physics-informed embodied world model

    Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model. InNeurIPS, 2025

  60. [60]

    BridgeData V2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. InCoRL, 2023. 14 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

  61. [61]

    Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

    Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, et al. Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

  62. [62]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  63. [63]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

  64. [64]

    Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

  65. [65]

    Ctrl-world: A controllable generative world model for robot manipulation

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InICLR, 2026

  66. [66]

    Vla-jepa: Enhancing vision- language-action model with latent world model,

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. VLA-JEPA: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

  67. [67]

    JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation

    Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. InICLR, 2026

  68. [68]

    4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. 4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration. InNeurIPS, 2025

  69. [69]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InRSS, 2025

  70. [70]

    Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models

    Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models. InNeurIPS, 2022

  71. [71]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  72. [72]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

  73. [73]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  74. [74]

    Vision transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021

  75. [75]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 15 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

  76. [76]

    Dice loss for data-imbalanced NLP tasks

    Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. Dice loss for data-imbalanced NLP tasks. In ACL, 2020

  77. [77]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

  78. [78]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  79. [79]

    IRASim: A fine-grained world model for robot manipulation

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InICCV, 2025

  80. [80]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

Showing first 80 references.