ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Angtian Wang; Bo Liu; Hao Kang; Jenson Yang; Liming Jiang; Nathan Jacobs; Qing Yan; Stathi Fotiadis; Xin Lu; Yiding Yang

arxiv: 2606.17730 · v1 · pith:IDZYWIAWnew · submitted 2026-06-16 · 💻 cs.CV

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Zhexiao Xiong , Yizhi Song , Hao Kang , Qing Yan , Liming Jiang , Jenson Yang , Zhoujie Fu , Stathi Fotiadis

show 6 more authors

Angtian Wang Zichuan Liu Bo Liu Yiding Yang Xin Lu Nathan Jacobs

This is my paper

Pith reviewed 2026-06-27 01:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords interactive world modelsaction-aware memoryobject interactionvideo generationhierarchical memorypersistent memory bankhuman-object interactionchunk-autoregressive generation

0 comments

The pith

ActWorld extends world models to support object interactions by fixing data scarcity and action-forgetting in memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current world models remain limited to navigation because they lack dense interaction data and because their memory compression discards the event frames that determine later object states. It constructs a 100K video dataset of human-object interactions annotated with chain-of-thought per-chunk captions and introduces a hierarchical action-aware memory that routes compression according to interaction importance together with a persistent bank that keeps event-update and object-identity tokens. The resulting single model performs both flexible viewpoint control and mid-rollout actions such as opening doors or picking up objects, raising interaction fidelity over navigation-only baselines. A reader would care because this turns passive visual exploration into actionable simulation inside generated environments.

Core claim

ActWorld shows that the navigation-interaction gap arises from a data bottleneck and a memory bottleneck; these are resolved by a 100K interaction video dataset and by hierarchical action-aware memory plus a persistent memory bank that maintains causal event tokens across long rollouts, enabling one model to handle both navigation and object interaction without loss of viewpoint control.

What carries the argument

Hierarchical action-aware memory that routes history compression by interaction importance, together with a persistent memory bank that maintains event-update and object-identity tokens.

If this is right

A single model can now generate both viewpoint changes and physical object responses in one forward pass.
Object states remain consistent across extended rollouts because event-update tokens are not overwritten by recency bias.
Interaction fidelity rises while navigation quality stays intact, removing the need to switch between separate navigation and interaction generators.
Mid-rollout actions become feasible inside chunk-autoregressive video generation without external prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory routing principle could be tested on domains that require tracking multiple simultaneous object changes, such as multi-agent scenes.
Persistent identity tokens may allow the model to handle re-entrant objects that leave and return to view without identity drift.
If the chain-of-thought captions prove critical, replacing them with weaker labels would be expected to degrade interaction precision in direct proportion to label density.

Load-bearing premise

The navigation-interaction gap is caused mainly by missing dense interaction labels and by recency-biased memory that forgets the frames linking actions to later object states.

What would settle it

A controlled ablation in which the same 100K dataset is used but the persistent memory bank is removed, followed by measurement of whether object-state consistency collapses after the first interaction in rollouts longer than 30 seconds.

read the original abstract

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActWorld adds a hierarchical action-aware memory and persistent bank plus a 100K interaction dataset to support object manipulation in world models, but the experiments do not isolate whether the architecture or the new data produces the reported gains.

read the letter

The core contribution is a memory architecture that routes compression by interaction importance and keeps event-update and object-identity tokens in a persistent bank, paired with a new 100K video dataset annotated via chain-of-thought per chunk. This directly targets the stated bottlenecks of missing interaction labels and recency-biased forgetting.

The dataset construction and the explicit separation of navigation versus object-interaction actions are useful steps. Prior navigation-only models had no reason to handle pick-up or door-opening events, so extending the action space and supplying matching data is a concrete move forward.

The main weakness is that the evaluation does not separate the two changes. The baselines are navigation-only models that never saw the 100K interaction videos. Without a control that trains the original recency-biased compressor on the new data, any fidelity improvement could come from the data alone. The abstract also gives no numbers, error bars, or ablation tables, so the size of the claimed advance remains unclear.

The work is aimed at researchers building generative world models for robotics or simulation. Anyone already working on long-horizon video prediction would find the memory routing idea and the dataset worth examining, even if the causal claims need tighter controls.

It should go to peer review. The problem is well-posed and the proposed mechanisms are specific enough that referees can ask for the missing ablations and quantitative results.

Referee Report

1 major / 1 minor

Summary. The paper introduces ActWorld, an interactive world model that extends navigation-centric generators to support mid-rollout object interactions within a chunk-autoregressive framework. It identifies a data bottleneck (lack of dense human-object interaction labels) and a memory bottleneck (recency-biased history compression causing action-forgetting), addresses them via a new 100K interaction video dataset with per-chunk CoT captions plus a hierarchical action-aware memory design and persistent memory bank for event-update and object-identity tokens, and reports that experiments show improved interaction fidelity over navigation-only baselines without sacrificing viewpoint control.

Significance. If the results hold after isolating the architectural contribution, the work would be significant for enabling richer, actionable world models that combine flexible navigation with object interactions in a single model, moving beyond navigation-only or game-domain limitations toward more general interactive simulation.

major comments (1)

[Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.

minor comments (1)

[Abstract] Abstract: states that experiments show substantial improvements but provides no quantitative metrics, error bars, baseline details, or ablation descriptions, which weakens the ability to evaluate the magnitude and robustness of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on experimental isolation below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.

Authors: We agree that the current experimental design does not fully isolate the contribution of the hierarchical action-aware memory and persistent bank from the new 100K dataset. The navigation-only baselines lack both the interaction data and the proposed memory mechanisms, so the reported gains cannot be attributed solely to the architectural changes. To address this, we will add an ablation in the revised manuscript: a recency-biased baseline trained on the new 100K interaction dataset (with per-chunk CoT captions) and compare its interaction fidelity directly to ActWorld. This will clarify whether the memory mechanisms provide benefits beyond the dataset alone. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new dataset and architecture with external empirical validation

full rationale

The paper identifies data and memory bottlenecks, constructs an independent 100K interaction dataset with per-chunk CoT captions, proposes a hierarchical action-aware memory plus persistent bank, and reports experiments against navigation-only baselines. No derivation step reduces by construction to its inputs, no parameters are fitted then relabeled as predictions, no load-bearing self-citations or uniqueness theorems are invoked, and no ansatzes are smuggled via prior work. The central claims are supported by new data and architecture choices that remain falsifiable outside the fitted values, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the identified data and memory bottlenecks are the primary causes of the navigation-interaction gap and that the proposed fixes address them without introducing new failure modes. No free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Chunk-autoregressive framework is suitable for extending navigation models to object interactions.
Invoked when stating the model operates within a chunk-autoregressive framework.

invented entities (2)

Hierarchical action-aware memory no independent evidence
purpose: Routes history compression by interaction importance to prevent action-forgetting.
New memory design introduced to address the memory bottleneck.
Persistent memory bank no independent evidence
purpose: Maintains event-update and object-identity tokens across long rollouts.
Introduced to handle long-term object state consistency.

pith-pipeline@v0.9.1-grok · 5865 in / 1406 out tokens · 26329 ms · 2026-06-27T01:19:28.804828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 16 linked inside Pith

[1]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

Pith/arXiv arXiv 2026
[2]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[3]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

arXiv 2023
[4]

Oasis: A universe in a transformer

Decart and Etched. Oasis: A universe in a transformer. Technical report / project page, October 2024. URL https://about.decart.ai/publications/oasis-interactive-ai-video-game-model . Interactive world model for real-time AI-generated gameplay

2024
[5]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

arXiv 2025
[6]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[7]

Vipe: Video pose engine for 3d geometric perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[8]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025
[9]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

2025
[10]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025
[11]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds

Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint arXiv:2604.14268, 2026

Pith/arXiv arXiv 2026
[12]

Wovr: World models as reliable simulators for post-training vla policies with rl

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

arXiv 2026
[13]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[14]

Gonzalez, Ion Stoica, Song Han, and Yao Lu

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. Worldmodelbench: Judging video generation models as world models. InAdvancesin Neural Information Processing Systems, volume 38, 2025

2025
[15]

Enhancing end-to-end autonomous driving with latent world model, 2024

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model, 2024

2024
[16]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

arXiv 2025
[17]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[18]

Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026

Tianran Liu, Shengwen Zhao, and Nicholas Rhinehart. Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026. 12

2026
[19]

Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick Von Platen, ApolinÃĄrio Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

arXiv 2023
[20]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

arXiv 2025
[21]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...
[22]

URLhttps://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/
[23]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

arXiv 2024
[24]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025
[25]

Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

arXiv 2026
[26]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[27]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[28]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[29]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

Pith/arXiv arXiv 2026
[30]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

Pith/arXiv arXiv 2024
[31]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[32]

Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. InInternational Conference on Learning Representations, volume 2025, pages 95118–95146, 2025

2025
[33]

Worldcompass: Reinforcement learning for long-horizon world models

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models. arXiv preprint, 2026

2026
[34]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026. 13

Pith/arXiv arXiv 2026
[36]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026
[37]

Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

arXiv 2025
[38]

Groundingbooth: Grounding text-to-image customization

Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, and Nathan Jacobs. Groundingbooth: Grounding text-to-image customization. arXiv preprint arXiv:2409.08520, 2024

arXiv 2024
[39]

Panodreamer: Consistent text to 360-degree scene generation

Zhexiao Xiong, Zhang Chen, Zhong Li, Yi Xu, and Nathan Jacobs. Panodreamer: Consistent text to 360-degree scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 295–304, June 2025

2025
[40]

Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, and Nathan Jacobs. Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

arXiv 2026
[41]

Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

arXiv 2026
[42]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

2024
[43]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

2024
[44]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025
[45]

Wonderjourney: Going from anywhere to everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

2024
[46]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

arXiv 2026
[47]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025
[48]

Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation

Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, and Xiaoguang Han. Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27683–27693, 2025

2025
[49]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025
[50]

Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Pith/arXiv arXiv 2026
[51]

Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025. 14 Appendix A Data Generation Pipeline This section details the offline data preparation pipeline summarised in Fig. 5. All stages are run once per video ...

arXiv 2025
[52]

compute pairwise frame differences and summarise the dominant motion
[53]

determine whether subject and object are in physical contact
[54]

check whether the supplied action label is consistent with the observed motion (otherwise emitACTION_- MISMATCH)
[55]

assign an interaction-phase tagyph k ∈ Pfrom the taxonomy of §3.1
[56]

the module learned something useful

write a 1–2 sentence semantic description grounded solely in the per-frame evidence and ignoring camera motion. The structured output schema is fixed:HAS_INTERACTION∈ {yes, no} (the interaction flagyint k ), ACTION_- MISMATCH∈ {yes, no}, PHASE∈ P (the phase labelyph k ), plus a free-formDESCRIPTION string. Together with the video-level action classa∈ A fr...

[1] [1]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

Pith/arXiv arXiv 2026

[2] [2]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[3] [3]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

arXiv 2023

[4] [4]

Oasis: A universe in a transformer

Decart and Etched. Oasis: A universe in a transformer. Technical report / project page, October 2024. URL https://about.decart.ai/publications/oasis-interactive-ai-video-game-model . Interactive world model for real-time AI-generated gameplay

2024

[5] [5]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

arXiv 2025

[6] [6]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[7] [7]

Vipe: Video pose engine for 3d geometric perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[8] [8]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025

[9] [9]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

2025

[10] [10]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025

[11] [11]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds

Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint arXiv:2604.14268, 2026

Pith/arXiv arXiv 2026

[12] [12]

Wovr: World models as reliable simulators for post-training vla policies with rl

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

arXiv 2026

[13] [13]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[14] [14]

Gonzalez, Ion Stoica, Song Han, and Yao Lu

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. Worldmodelbench: Judging video generation models as world models. InAdvancesin Neural Information Processing Systems, volume 38, 2025

2025

[15] [15]

Enhancing end-to-end autonomous driving with latent world model, 2024

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model, 2024

2024

[16] [16]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

arXiv 2025

[17] [17]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[18] [18]

Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026

Tianran Liu, Shengwen Zhao, and Nicholas Rhinehart. Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026. 12

2026

[19] [19]

Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick Von Platen, ApolinÃĄrio Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

arXiv 2023

[20] [20]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

arXiv 2025

[21] [21]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

[22] [22]

URLhttps://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/

[23] [23]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

arXiv 2024

[24] [24]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025

[25] [25]

Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

arXiv 2026

[26] [26]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[27] [27]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[28] [28]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[29] [29]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

Pith/arXiv arXiv 2026

[30] [30]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

Pith/arXiv arXiv 2024

[31] [31]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[32] [32]

Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. InInternational Conference on Learning Representations, volume 2025, pages 95118–95146, 2025

2025

[33] [33]

Worldcompass: Reinforcement learning for long-horizon world models

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models. arXiv preprint, 2026

2026

[34] [34]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026. 13

Pith/arXiv arXiv 2026

[35] [36]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026

[36] [37]

Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

arXiv 2025

[37] [38]

Groundingbooth: Grounding text-to-image customization

Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, and Nathan Jacobs. Groundingbooth: Grounding text-to-image customization. arXiv preprint arXiv:2409.08520, 2024

arXiv 2024

[38] [39]

Panodreamer: Consistent text to 360-degree scene generation

Zhexiao Xiong, Zhang Chen, Zhong Li, Yi Xu, and Nathan Jacobs. Panodreamer: Consistent text to 360-degree scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 295–304, June 2025

2025

[39] [40]

Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, and Nathan Jacobs. Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

arXiv 2026

[40] [41]

Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

arXiv 2026

[41] [42]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

2024

[42] [43]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

2024

[43] [44]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025

[44] [45]

Wonderjourney: Going from anywhere to everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

2024

[45] [46]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

arXiv 2026

[46] [47]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025

[47] [48]

Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation

Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, and Xiaoguang Han. Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27683–27693, 2025

2025

[48] [49]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

Pith/arXiv arXiv 2025

[49] [50]

Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Pith/arXiv arXiv 2026

[50] [51]

Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025. 14 Appendix A Data Generation Pipeline This section details the offline data preparation pipeline summarised in Fig. 5. All stages are run once per video ...

arXiv 2025

[51] [52]

compute pairwise frame differences and summarise the dominant motion

[52] [53]

determine whether subject and object are in physical contact

[53] [54]

check whether the supplied action label is consistent with the observed motion (otherwise emitACTION_- MISMATCH)

[54] [55]

assign an interaction-phase tagyph k ∈ Pfrom the taxonomy of §3.1

[55] [56]

the module learned something useful

write a 1–2 sentence semantic description grounded solely in the per-frame evidence and ignoring camera motion. The structured output schema is fixed:HAS_INTERACTION∈ {yes, no} (the interaction flagyint k ), ACTION_- MISMATCH∈ {yes, no}, PHASE∈ P (the phase labelyph k ), plus a free-formDESCRIPTION string. Together with the video-level action classa∈ A fr...