pith. sign in

arxiv: 2606.17730 · v1 · pith:IDZYWIAWnew · submitted 2026-06-16 · 💻 cs.CV

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Pith reviewed 2026-06-27 01:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords interactive world modelsaction-aware memoryobject interactionvideo generationhierarchical memorypersistent memory bankhuman-object interactionchunk-autoregressive generation
0
0 comments X

The pith

ActWorld extends world models to support object interactions by fixing data scarcity and action-forgetting in memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current world models remain limited to navigation because they lack dense interaction data and because their memory compression discards the event frames that determine later object states. It constructs a 100K video dataset of human-object interactions annotated with chain-of-thought per-chunk captions and introduces a hierarchical action-aware memory that routes compression according to interaction importance together with a persistent bank that keeps event-update and object-identity tokens. The resulting single model performs both flexible viewpoint control and mid-rollout actions such as opening doors or picking up objects, raising interaction fidelity over navigation-only baselines. A reader would care because this turns passive visual exploration into actionable simulation inside generated environments.

Core claim

ActWorld shows that the navigation-interaction gap arises from a data bottleneck and a memory bottleneck; these are resolved by a 100K interaction video dataset and by hierarchical action-aware memory plus a persistent memory bank that maintains causal event tokens across long rollouts, enabling one model to handle both navigation and object interaction without loss of viewpoint control.

What carries the argument

Hierarchical action-aware memory that routes history compression by interaction importance, together with a persistent memory bank that maintains event-update and object-identity tokens.

If this is right

  • A single model can now generate both viewpoint changes and physical object responses in one forward pass.
  • Object states remain consistent across extended rollouts because event-update tokens are not overwritten by recency bias.
  • Interaction fidelity rises while navigation quality stays intact, removing the need to switch between separate navigation and interaction generators.
  • Mid-rollout actions become feasible inside chunk-autoregressive video generation without external prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory routing principle could be tested on domains that require tracking multiple simultaneous object changes, such as multi-agent scenes.
  • Persistent identity tokens may allow the model to handle re-entrant objects that leave and return to view without identity drift.
  • If the chain-of-thought captions prove critical, replacing them with weaker labels would be expected to degrade interaction precision in direct proportion to label density.

Load-bearing premise

The navigation-interaction gap is caused mainly by missing dense interaction labels and by recency-biased memory that forgets the frames linking actions to later object states.

What would settle it

A controlled ablation in which the same 100K dataset is used but the persistent memory bank is removed, followed by measurement of whether object-state consistency collapses after the first interaction in rollouts longer than 30 seconds.

read the original abstract

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ActWorld, an interactive world model that extends navigation-centric generators to support mid-rollout object interactions within a chunk-autoregressive framework. It identifies a data bottleneck (lack of dense human-object interaction labels) and a memory bottleneck (recency-biased history compression causing action-forgetting), addresses them via a new 100K interaction video dataset with per-chunk CoT captions plus a hierarchical action-aware memory design and persistent memory bank for event-update and object-identity tokens, and reports that experiments show improved interaction fidelity over navigation-only baselines without sacrificing viewpoint control.

Significance. If the results hold after isolating the architectural contribution, the work would be significant for enabling richer, actionable world models that combine flexible navigation with object interactions in a single model, moving beyond navigation-only or game-domain limitations toward more general interactive simulation.

major comments (1)
  1. [Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.
minor comments (1)
  1. [Abstract] Abstract: states that experiments show substantial improvements but provides no quantitative metrics, error bars, baseline details, or ablation descriptions, which weakens the ability to evaluate the magnitude and robustness of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on experimental isolation below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.

    Authors: We agree that the current experimental design does not fully isolate the contribution of the hierarchical action-aware memory and persistent bank from the new 100K dataset. The navigation-only baselines lack both the interaction data and the proposed memory mechanisms, so the reported gains cannot be attributed solely to the architectural changes. To address this, we will add an ablation in the revised manuscript: a recency-biased baseline trained on the new 100K interaction dataset (with per-chunk CoT captions) and compare its interaction fidelity directly to ActWorld. This will clarify whether the memory mechanisms provide benefits beyond the dataset alone. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new dataset and architecture with external empirical validation

full rationale

The paper identifies data and memory bottlenecks, constructs an independent 100K interaction dataset with per-chunk CoT captions, proposes a hierarchical action-aware memory plus persistent bank, and reports experiments against navigation-only baselines. No derivation step reduces by construction to its inputs, no parameters are fitted then relabeled as predictions, no load-bearing self-citations or uniqueness theorems are invoked, and no ansatzes are smuggled via prior work. The central claims are supported by new data and architecture choices that remain falsifiable outside the fitted values, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the identified data and memory bottlenecks are the primary causes of the navigation-interaction gap and that the proposed fixes address them without introducing new failure modes. No free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Chunk-autoregressive framework is suitable for extending navigation models to object interactions.
    Invoked when stating the model operates within a chunk-autoregressive framework.
invented entities (2)
  • Hierarchical action-aware memory no independent evidence
    purpose: Routes history compression by interaction importance to prevent action-forgetting.
    New memory design introduced to address the memory bottleneck.
  • Persistent memory bank no independent evidence
    purpose: Maintains event-update and object-identity tokens across long rollouts.
    Introduced to handle long-term object state consistency.

pith-pipeline@v0.9.1-grok · 5865 in / 1406 out tokens · 26329 ms · 2026-06-27T01:19:28.804828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 16 linked inside Pith

  1. [1]

    Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

    Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

  2. [2]

    Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  3. [3]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

  4. [4]

    Oasis: A universe in a transformer

    Decart and Etched. Oasis: A universe in a transformer. Technical report / project page, October 2024. URL https://about.decart.ai/publications/oasis-interactive-ai-video-game-model . Interactive world model for real-time AI-generated gameplay

  5. [5]

    End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

    Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

  6. [6]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  7. [7]

    Vipe: Video pose engine for 3d geometric perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025

  8. [8]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  9. [9]

    Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

    Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

  10. [10]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

  11. [11]

    Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds

    Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint arXiv:2604.14268, 2026

  12. [12]

    Wovr: World models as reliable simulators for post-training vla policies with rl

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

  13. [13]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  14. [14]

    Gonzalez, Ion Stoica, Song Han, and Yao Lu

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. Worldmodelbench: Judging video generation models as world models. InAdvancesin Neural Information Processing Systems, volume 38, 2025

  15. [15]

    Enhancing end-to-end autonomous driving with latent world model, 2024

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model, 2024

  16. [16]

    Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

  17. [17]

    Physgen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision (ECCV), 2024

  18. [18]

    Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026

    Tianran Liu, Shengwen Zhao, and Nicholas Rhinehart. Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026. 12

  19. [19]

    Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick Von Platen, ApolinÃĄrio Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

  20. [20]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  21. [21]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  22. [22]

    URLhttps://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/

  23. [23]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

  24. [24]

    Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  25. [25]

    Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

    Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

  26. [26]

    Dinov3.arXiv preprint arXiv:2508.10104, 2025

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  27. [27]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, 2023

  28. [28]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  29. [29]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  30. [30]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

  31. [31]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  32. [32]

    Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance

    Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. InInternational Conference on Learning Representations, volume 2025, pages 95118–95146, 2025

  33. [33]

    Worldcompass: Reinforcement learning for long-horizon world models

    Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models. arXiv preprint, 2026

  34. [34]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory

    Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026. 13

  35. [36]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

  36. [37]

    Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

    Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

  37. [38]

    Groundingbooth: Grounding text-to-image customization

    Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, and Nathan Jacobs. Groundingbooth: Grounding text-to-image customization. arXiv preprint arXiv:2409.08520, 2024

  38. [39]

    Panodreamer: Consistent text to 360-degree scene generation

    Zhexiao Xiong, Zhang Chen, Zhong Li, Yi Xu, and Nathan Jacobs. Panodreamer: Consistent text to 360-degree scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 295–304, June 2025

  39. [40]

    Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

    Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, and Nathan Jacobs. Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026

  40. [41]

    Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

    Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026

  41. [42]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

  42. [43]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

  43. [44]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

  44. [45]

    Wonderjourney: Going from anywhere to everywhere

    Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024

  45. [46]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  46. [47]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  47. [48]

    Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation

    Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, and Xiaoguang Han. Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27683–27693, 2025

  48. [49]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

  49. [50]

    Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

  50. [51]

    Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

    Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025. 14 Appendix A Data Generation Pipeline This section details the offline data preparation pipeline summarised in Fig. 5. All stages are run once per video ...

  51. [52]

    compute pairwise frame differences and summarise the dominant motion

  52. [53]

    determine whether subject and object are in physical contact

  53. [54]

    check whether the supplied action label is consistent with the observed motion (otherwise emitACTION_- MISMATCH)

  54. [55]

    assign an interaction-phase tagyph k ∈ Pfrom the taxonomy of §3.1

  55. [56]

    the module learned something useful

    write a 1–2 sentence semantic description grounded solely in the per-frame evidence and ignoring camera motion. The structured output schema is fixed:HAS_INTERACTION∈ {yes, no} (the interaction flagyint k ), ACTION_- MISMATCH∈ {yes, no}, PHASE∈ P (the phase labelyph k ), plus a free-formDESCRIPTION string. Together with the video-level action classa∈ A fr...