World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

Jianhui Zhang; Peidong Jia; Shanghang Zhang; Xiaoguang Zhao; Xingyu Chen; Zuyao Lin

arxiv: 2605.19957 · v1 · pith:UYDOA34Qnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.RO

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

Zuyao Lin , Jianhui Zhang , Peidong Jia , Xiaoguang Zhao , Shanghang Zhang , Xingyu Chen This is my paper

Pith reviewed 2026-05-20 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords world modelsembodied intelligencelong-horizon planninghybrid navigation-manipulationdisentanglementdiffusion modelsworld-ego modeling

0 comments

The pith

Decomposing future embodied predictions into separate world and ego streams improves long-horizon hybrid task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard world models degrade on long-horizon embodied scenarios because they entangle persistent scene regularities with robot-centric, instruction-conditioned dynamics inside a single prediction stream. It introduces World-Ego Modeling as a paradigm that explicitly decomposes future evolution along a boundary defined from motion, semantic, and intention perspectives, then tests post-, pre-, and full disentanglement strategies. The authors instantiate the idea in the World-Ego Model, which pairs an implicit planner with a cascade-parallel mixture-of-experts diffusion generator. A sympathetic reader would care because hybrid tasks that interleave navigation and manipulation are common in real robotics yet remain difficult for current single-stream models. Experiments on a newly constructed benchmark with 125K video clips and 300 multi-turn trajectories show the separated approach reaches state-of-the-art results while holding competitive performance on existing manipulation-only suites.

Core claim

World-Ego Modeling decomposes future evolution into distinct world and ego components by defining their boundary from motion-, semantic-, and intention-based views and by analyzing post-, pre-, and full disentanglement. The paradigm is realized as the World-Ego Model, which couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts diffusion generator. On the introduced HTEWorld benchmark of long-horizon hybrid navigation-manipulation tasks the resulting model outperforms prior world models while remaining competitive on manipulation-only benchmarks.

What carries the argument

The world-ego boundary, defined from motion-, semantic-, and intention-based views, which enables post-, pre-, and full disentanglement of future evolutions inside the World-Ego Model architecture.

If this is right

WEM reaches state-of-the-art results on the HTEWorld benchmark of 300 multi-turn hybrid trajectories.
The same model remains competitive with prior methods on existing manipulation-only benchmarks.
Explicit separation of persistent scene regularities from instruction-conditioned robot dynamics reduces performance degradation over long interleaved navigation-manipulation sequences.
The three disentanglement strategies (post, pre, full) provide measurable trade-offs for modeling hybrid task evolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary definitions could be tested on purely navigation or purely manipulation datasets to check whether hybrid-task gains generalize.
Isolating world from ego dynamics may reduce error accumulation in downstream planners that rely on multi-step rollouts.
Real-robot experiments with sensor noise and physical dynamics would reveal whether the video-clip improvements transfer outside simulation.

Load-bearing premise

The world-ego boundary defined from motion-, semantic-, and intention-based views permits effective post-, pre-, and full disentanglement that improves long-horizon hybrid task modeling.

What would settle it

A controlled ablation on HTEWorld in which removing the motion-semantic-intention boundary definitions and the associated disentanglement strategies produces equal or higher performance on long-horizon hybrid trajectories would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19957 by Jianhui Zhang, Peidong Jia, Shanghang Zhang, Xiaoguang Zhao, Xingyu Chen, Zuyao Lin.

**Figure 1.** Figure 1: Conceptual illustration of World-Ego Modeling. Embodied evolution involves two heterogeneous dynamics: the world [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Three perspectives of the world-ego definition. The motion-based view separates the world and ego by the source of visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: General framework of World-Ego Modeling. Our framework contains two stages. (a) The prediction stage uses a vision [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the World-Ego Model (WEM). WEM instantiates World-Ego Modeling with a semantic-based world-ego view [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on HTEWorld. Given the same initial observation and five-step instruction sequence, each model [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Statistics of the proposed HTEWorld benchmark. HTEWorld provides large-scale training clips and multi-turn evaluation [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Mask-guided visualization of role expert specialization. We visualize the world expert and ego expert outputs in WEM [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits embodied world models into world and ego streams for hybrid navigation-manipulation tasks and ships a new benchmark, but the reported gains are not clearly tied to the disentanglement rather than the CP-MoE generator.

read the letter

The main thing to know is that this paper introduces World-Ego Modeling as a way to break apart persistent scene prediction from robot-specific dynamics in long-horizon hybrid tasks. They define the split along motion, semantic, and intention lines, test post/pre/full disentanglement, and pair an implicit planner with a cascade-parallel MoE diffusion generator. They also release HTEWorld, a benchmark of 125K clips and 300 multi-turn trajectories with action annotations, which directly targets interleaved navigation and manipulation where single-stream models tend to degrade.

Referee Report

2 major / 2 minor

Summary. The paper proposes World-Ego Modeling (WEM), a new paradigm that decomposes future evolution in embodied tasks into separate world (persistent, instruction-agnostic scene regularities) and ego (robot-centric, instruction-conditioned dynamics) components to mitigate entanglement in long-horizon hybrid navigation-manipulation scenarios. It defines the world-ego boundary from motion-, semantic-, and intention-based views and analyzes post-, pre-, and full disentanglement strategies. The paradigm is instantiated as the World-Ego Model coupling an implicit planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. The authors also introduce the HTEWorld benchmark with 125K video clips (4.5M frames) and 300 multi-turn trajectories for rigorous evaluation. Experiments claim SOTA performance on HTEWorld and competitive results on existing manipulation-only benchmarks.

Significance. If the reported gains are causally attributable to the world-ego boundary definitions and disentanglement strategies rather than the CP-MoE generator capacity alone, the work would provide a meaningful conceptual and technical advance for long-horizon embodied world modeling in hybrid tasks. The HTEWorld benchmark addresses a clear gap in evaluating interleaved navigation-manipulation behaviors and could serve as a community resource. The three-view boundary analysis offers a structured way to separate persistent vs. dynamic elements that may generalize beyond the specific instantiation.

major comments (2)

[Experiments] Experiments section: No ablations are described that hold the CP-MoE diffusion generator and planner architecture fixed while toggling only the disentanglement strategies (post-/pre-/full) or the boundary definitions (motion-/semantic-/intention-based). This is load-bearing for the central claim, as the abstract and paradigm definition locate the long-horizon improvements in the world-ego separation, yet the new generator's training dynamics and capacity could independently explain SOTA on the author-constructed HTEWorld benchmark.
[Method / Benchmark] §3 (Method) and HTEWorld construction: The paper does not report controls or statistics showing that the 300 multi-turn evaluation trajectories were selected independently of the proposed WEM (e.g., via held-out human demonstrations or cross-validation against baseline generators). Without this, it remains possible that benchmark construction interacts with the model design, weakening the claim that WEM's disentanglement drives the hybrid-task gains.

minor comments (2)

[Abstract] Abstract: The phrase 'fine-grained action annotations' is used without specifying the action vocabulary size, annotation protocol, or inter-annotator agreement for the 125K clips; adding one sentence would improve reproducibility.
[Method] Notation: The three boundary views and the mapping from disentanglement strategy to the implicit planner outputs are introduced conceptually but lack a compact equation or diagram reference early in the method; a small table summarizing the three strategies would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential value of the world-ego paradigm and HTEWorld benchmark. We address each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: No ablations are described that hold the CP-MoE diffusion generator and planner architecture fixed while toggling only the disentanglement strategies (post-/pre-/full) or the boundary definitions (motion-/semantic-/intention-based). This is load-bearing for the central claim, as the abstract and paradigm definition locate the long-horizon improvements in the world-ego separation, yet the new generator's training dynamics and capacity could independently explain SOTA on the author-constructed HTEWorld benchmark.

Authors: We agree that stronger isolation of the disentanglement strategies is needed to support the central claim. The current manuscript reports performance across post-, pre-, and full disentanglement as well as the three boundary definitions, but these comparisons do not hold the CP-MoE generator and planner completely fixed across all variants. We will add dedicated ablation experiments in the revision that fix the generator architecture and planner while varying only the disentanglement strategies and boundary views. This will more directly attribute gains to the world-ego separation. revision: yes
Referee: [Method / Benchmark] §3 (Method) and HTEWorld construction: The paper does not report controls or statistics showing that the 300 multi-turn evaluation trajectories were selected independently of the proposed WEM (e.g., via held-out human demonstrations or cross-validation against baseline generators). Without this, it remains possible that benchmark construction interacts with the model design, weakening the claim that WEM's disentanglement drives the hybrid-task gains.

Authors: The 300 multi-turn trajectories were drawn from a held-out portion of the HTEWorld data that was not used to train WEM or any baseline models. However, the original manuscript does not include explicit statistics, cross-validation details, or controls demonstrating independence from the WEM design. We will revise the benchmark construction section to add these details, including data splits and selection criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: paradigm, disentanglement strategies, and HTEWorld benchmark introduced as independent contributions with external evaluation.

full rationale

The paper defines a new World-Ego Modeling paradigm by decomposing future evolution into world and ego components using motion-, semantic-, and intention-based boundary views, then analyzes post-/pre-/full disentanglement strategies before instantiating WEM as an implicit planner coupled to a CP-MoE diffusion generator. It further constructs the HTEWorld benchmark with 125K clips and 300 trajectories for evaluation. No step reduces a claimed prediction or first-principles result to its own inputs by construction; the performance claims rest on empirical results on the new benchmark and existing manipulation suites rather than self-referential fitting or self-citation chains. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; main reliance is on the domain assumption that world-ego disentanglement yields measurable gains in hybrid scenarios.

axioms (1)

domain assumption Future evolution in embodied tasks can be usefully decomposed into instruction-agnostic world regularities and instruction-conditioned ego dynamics.
Foundational premise stated in the opening of the abstract.

invented entities (1)

World-Ego Model (WEM) with CP-MoE diffusion generator no independent evidence
purpose: Unified model coupling implicit planner and cascade-parallel mixture-of-experts for disentangled generation
New architecture introduced to instantiate the paradigm.

pith-pipeline@v0.9.0 · 5801 in / 1245 out tokens · 33058 ms · 2026-05-20T06:32:41.502464+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WEM ... couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 19 internal anchors

[1]

Recurrent world models facilitate policy evolution

David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. InNeurIPS, 2018

work page 2018
[2]

Mastering diverse control tasks through world models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025

work page 2025
[3]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. In CoRL, 2025

work page 2025
[5]

Learning interactive real-world simulators

Mengjiao Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InICLR, 2024

work page 2024
[6]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

work page 2023
[7]

FLARE: Robot learning with implicit world modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Lo¨ıc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InCoRL, 2025

work page 2025
[8]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In ICLR, 2026

work page 2026
[9]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRSS, 2024. 11 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2024
[11]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, and Jiwen Lu. Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

work page arXiv 2024
[14]

WORLDMEM: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. WORLDMEM: Long-term consistent world simulation with memory. InNeurIPS, 2025

work page 2025
[15]

StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, and Chenyang Si. StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

work page arXiv 2026
[16]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

work page 2024
[18]

Robodreamer: Learning composi- tional world models for robot imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning composi- tional world models for robot imagination. InICML, 2024

work page 2024
[19]

PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

work page arXiv 2025
[20]

Tesseract: Learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. InICCV, 2025

work page 2025
[21]

Learning world models for interactive video generation

Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation. InNeurIPS, 2025

work page 2025
[22]

VideoREPA: Learning physics for video generation through relational alignment with foundation models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. VideoREPA: Learning physics for video generation through relational alignment with foundation models. InNeurIPS, 2025

work page 2025
[23]

A path towards autonomous machine intelligence, 2022

Yann LeCun. A path towards autonomous machine intelligence, 2022. URL https://openreview.net/forum? id=BZ5a1r-kVsf. OpenReview position paper, version 0.9.2

work page 2022
[24]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, 2025

work page 2025
[26]

Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, and Yong Li. Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

work page arXiv 2025
[27]

Dexterous world models

Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models. InCVPR, 2026. 12 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2026
[28]

Towards long-horizon vision- language navigation: Platform, benchmark and method

Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. InCVPR, 2025

work page 2025
[29]

Mobility VLA: Multimodal instruction navigation with long-context VLMs and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...

work page 2025
[30]

MoManipVLA: Transferring vision-language- action models for general mobile manipulation

Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. MoManipVLA: Transferring vision-language- action models for general mobile manipulation. InCVPR, 2025

work page 2025
[31]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Mart´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

work page 2024
[33]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

work page 2026
[34]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

work page arXiv 2025
[36]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

work page 2025
[37]

Worldmodelbench: Judging video generation models as world models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InNeurIPS, 2025

work page 2025
[38]

Rethinking video generation model for the embodied world

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world. InICML, 2026

work page 2026
[39]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026
[40]

BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ın-Mart´ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InCoRL, 2023

work page 2023
[41]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InICLR, 2020

work page 2020
[42]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[44]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 13 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025

work page 2025
[47]

Genie 2: A large-scale foundation world model, 2024

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Mou- farek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model, 2024. URL https: //deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/ . Google Deep- Mind blog

work page 2024
[48]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Vid2World: Crafting video diffusion models to interactive world models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InICLR, 2026

work page 2026
[50]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

MotionCtrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

work page 2024
[53]

FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. InCVPR, 2025

work page 2025
[54]

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez- Rua, Yiannis Douratsos, and Tao Xiang. Rays as pixels: Learning a joint distribution of videos and camera trajectories. arXiv preprint arXiv:2604.09429, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Motion attribution for video generation

Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taix´e, Olga Russakovsky, Sanja Fidler, and Jonathan Lorraine. Motion attribution for video generation. InICML, 2026

work page 2026
[56]

The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

work page arXiv 2024
[57]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025
[58]

LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

work page arXiv 2026
[59]

Roboscape: Physics-informed embodied world model

Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model. InNeurIPS, 2025

work page 2025
[60]

BridgeData V2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. InCoRL, 2023. 14 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2023
[61]

Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, et al. Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

work page arXiv 2025
[62]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

work page arXiv 2025
[65]

Ctrl-world: A controllable generative world model for robot manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InICLR, 2026

work page 2026
[66]

Vla-jepa: Enhancing vision- language-action model with latent world model,

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. VLA-JEPA: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

work page arXiv 2026
[67]

JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. InICLR, 2026

work page 2026
[68]

4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. 4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration. InNeurIPS, 2025

work page 2025
[69]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InRSS, 2025

work page 2025
[70]

Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models

Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models. InNeurIPS, 2022

work page 2022
[71]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

work page 2022
[73]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Vision transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021

work page 2021
[75]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 15 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2023
[76]

Dice loss for data-imbalanced NLP tasks

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. Dice loss for data-imbalanced NLP tasks. In ACL, 2020

work page 2020
[77]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

work page 2024
[78]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

IRASim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InICCV, 2025

work page 2025
[80]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

work page 2020

Showing first 80 references.

[1] [1]

Recurrent world models facilitate policy evolution

David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. InNeurIPS, 2018

work page 2018

[2] [2]

Mastering diverse control tasks through world models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 2025

work page 2025

[3] [3]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. In CoRL, 2025

work page 2025

[5] [5]

Learning interactive real-world simulators

Mengjiao Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InICLR, 2024

work page 2024

[6] [6]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InCoRL, 2023

work page 2023

[7] [7]

FLARE: Robot learning with implicit world modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Lo¨ıc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InCoRL, 2025

work page 2025

[8] [8]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In ICLR, 2026

work page 2026

[9] [9]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRSS, 2024. 11 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2024

[11] [11]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, and Jiwen Lu. Owl-1: Omni world model for consistent long video generation.arXiv preprint arXiv:2412.09600, 2024

work page arXiv 2024

[14] [14]

WORLDMEM: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. WORLDMEM: Long-term consistent world simulation with memory. InNeurIPS, 2025

work page 2025

[15] [15]

StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, and Chenyang Si. StableWorld: Towards stable and consistent long interactive video generation.arXiv preprint arXiv:2601.15281, 2026

work page arXiv 2026

[16] [16]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, 2024

work page 2024

[18] [18]

Robodreamer: Learning composi- tional world models for robot imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning composi- tional world models for robot imagination. InICML, 2024

work page 2024

[19] [19]

PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. PAN: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

work page arXiv 2025

[20] [20]

Tesseract: Learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. InICCV, 2025

work page 2025

[21] [21]

Learning world models for interactive video generation

Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation. InNeurIPS, 2025

work page 2025

[22] [22]

VideoREPA: Learning physics for video generation through relational alignment with foundation models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. VideoREPA: Learning physics for video generation through relational alignment with foundation models. InNeurIPS, 2025

work page 2025

[23] [23]

A path towards autonomous machine intelligence, 2022

Yann LeCun. A path towards autonomous machine intelligence, 2022. URL https://openreview.net/forum? id=BZ5a1r-kVsf. OpenReview position paper, version 0.9.2

work page 2022

[24] [24]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, 2025

work page 2025

[26] [26]

Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, and Yong Li. Longscape: Advancing long-horizon embodied world models with context-aware moe.arXiv preprint arXiv:2509.21790, 2025

work page arXiv 2025

[27] [27]

Dexterous world models

Byungjun Kim, Taeksoo Kim, Junyoung Lee, and Hanbyul Joo. Dexterous world models. InCVPR, 2026. 12 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2026

[28] [28]

Towards long-horizon vision- language navigation: Platform, benchmark and method

Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. InCVPR, 2025

work page 2025

[29] [29]

Mobility VLA: Multimodal instruction navigation with long-context VLMs and topological graphs

Zhuo Xu, Hao-Tien Lewis Chiang, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, and Jie Tan. Mobility VLA: Multimodal instructi...

work page 2025

[30] [30]

MoManipVLA: Transferring vision-language- action models for general mobile manipulation

Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. MoManipVLA: Transferring vision-language- action models for general mobile manipulation. InCVPR, 2025

work page 2025

[31] [31]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Mart´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

work page 2024

[33] [33]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

work page 2026

[34] [34]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

work page arXiv 2025

[36] [36]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TPAMI, 2025

work page 2025

[37] [37]

Worldmodelbench: Judging video generation models as world models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InNeurIPS, 2025

work page 2025

[38] [38]

Rethinking video generation model for the embodied world

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world. InICML, 2026

work page 2026

[39] [39]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026

[40] [40]

BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ın-Mart´ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InCoRL, 2023

work page 2023

[41] [41]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InICLR, 2020

work page 2020

[42] [42]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020

[43] [43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022

[44] [44]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 13 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2025

work page 2025

[47] [47]

Genie 2: A large-scale foundation world model, 2024

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Mou- farek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model, 2024. URL https: //deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/ . Google Deep- Mind blog

work page 2024

[48] [48]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Vid2World: Crafting video diffusion models to interactive world models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InICLR, 2026

work page 2026

[50] [50]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

MotionCtrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

work page 2024

[53] [53]

FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. InCVPR, 2025

work page 2025

[54] [54]

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez- Rua, Yiannis Douratsos, and Tao Xiang. Rays as pixels: Learning a joint distribution of videos and camera trajectories. arXiv preprint arXiv:2604.09429, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Motion attribution for video generation

Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taix´e, Olga Russakovsky, Sanja Fidler, and Jonathan Lorraine. Motion attribution for video generation. InICML, 2026

work page 2026

[56] [56]

The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

work page arXiv 2024

[57] [57]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025

[58] [58]

LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. LIVE: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

work page arXiv 2026

[59] [59]

Roboscape: Physics-informed embodied world model

Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model. InNeurIPS, 2025

work page 2025

[60] [60]

BridgeData V2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. InCoRL, 2023. 14 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2023

[61] [61]

Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, et al. Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

work page arXiv 2025

[62] [62]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

work page arXiv 2025

[65] [65]

Ctrl-world: A controllable generative world model for robot manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InICLR, 2026

work page 2026

[66] [66]

Vla-jepa: Enhancing vision- language-action model with latent world model,

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. VLA-JEPA: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

work page arXiv 2026

[67] [67]

JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. InICLR, 2026

work page 2026

[68] [68]

4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. 4D-VLA: Spatiotemporal vision-language-action pretraining with cross-scene calibration. InNeurIPS, 2025

work page 2025

[69] [69]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InRSS, 2025

work page 2025

[70] [70]

Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models

Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-Dream: Isolating and leveraging noncontrollable visual dynamics in world models. InNeurIPS, 2022

work page 2022

[71] [71]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

work page 2022

[73] [73]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

Vision transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021

work page 2021

[75] [75]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 15 World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

work page 2023

[76] [76]

Dice loss for data-imbalanced NLP tasks

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. Dice loss for data-imbalanced NLP tasks. In ACL, 2020

work page 2020

[77] [77]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

work page 2024

[78] [78]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

IRASim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InICCV, 2025

work page 2025

[80] [80]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

work page 2020