ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
Pith reviewed 2026-06-27 01:19 UTC · model grok-4.3
The pith
ActWorld extends world models to support object interactions by fixing data scarcity and action-forgetting in memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ActWorld shows that the navigation-interaction gap arises from a data bottleneck and a memory bottleneck; these are resolved by a 100K interaction video dataset and by hierarchical action-aware memory plus a persistent memory bank that maintains causal event tokens across long rollouts, enabling one model to handle both navigation and object interaction without loss of viewpoint control.
What carries the argument
Hierarchical action-aware memory that routes history compression by interaction importance, together with a persistent memory bank that maintains event-update and object-identity tokens.
If this is right
- A single model can now generate both viewpoint changes and physical object responses in one forward pass.
- Object states remain consistent across extended rollouts because event-update tokens are not overwritten by recency bias.
- Interaction fidelity rises while navigation quality stays intact, removing the need to switch between separate navigation and interaction generators.
- Mid-rollout actions become feasible inside chunk-autoregressive video generation without external prompting.
Where Pith is reading between the lines
- The same memory routing principle could be tested on domains that require tracking multiple simultaneous object changes, such as multi-agent scenes.
- Persistent identity tokens may allow the model to handle re-entrant objects that leave and return to view without identity drift.
- If the chain-of-thought captions prove critical, replacing them with weaker labels would be expected to degrade interaction precision in direct proportion to label density.
Load-bearing premise
The navigation-interaction gap is caused mainly by missing dense interaction labels and by recency-biased memory that forgets the frames linking actions to later object states.
What would settle it
A controlled ablation in which the same 100K dataset is used but the persistent memory bank is removed, followed by measurement of whether object-state consistency collapses after the first interaction in rollouts longer than 30 seconds.
read the original abstract
Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ActWorld, an interactive world model that extends navigation-centric generators to support mid-rollout object interactions within a chunk-autoregressive framework. It identifies a data bottleneck (lack of dense human-object interaction labels) and a memory bottleneck (recency-biased history compression causing action-forgetting), addresses them via a new 100K interaction video dataset with per-chunk CoT captions plus a hierarchical action-aware memory design and persistent memory bank for event-update and object-identity tokens, and reports that experiments show improved interaction fidelity over navigation-only baselines without sacrificing viewpoint control.
Significance. If the results hold after isolating the architectural contribution, the work would be significant for enabling richer, actionable world models that combine flexible navigation with object interactions in a single model, moving beyond navigation-only or game-domain limitations toward more general interactive simulation.
major comments (1)
- [Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.
minor comments (1)
- [Abstract] Abstract: states that experiments show substantial improvements but provides no quantitative metrics, error bars, baseline details, or ablation descriptions, which weakens the ability to evaluate the magnitude and robustness of the reported gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment on experimental isolation below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that the hierarchical action-aware memory and persistent bank resolve the memory bottleneck (action-forgetting via recency bias) beyond what the new data provides is not isolated. The reported gains are only versus navigation-only baselines (which lack the 100K interaction dataset with per-chunk CoT captions); no ablation trains a recency-biased baseline on the new data to test whether the memory mechanisms, rather than the dataset alone, drive the interaction fidelity improvements. This directly undercuts the claim that the proposed mechanisms close the navigation-interaction gap.
Authors: We agree that the current experimental design does not fully isolate the contribution of the hierarchical action-aware memory and persistent bank from the new 100K dataset. The navigation-only baselines lack both the interaction data and the proposed memory mechanisms, so the reported gains cannot be attributed solely to the architectural changes. To address this, we will add an ablation in the revised manuscript: a recency-biased baseline trained on the new 100K interaction dataset (with per-chunk CoT captions) and compare its interaction fidelity directly to ActWorld. This will clarify whether the memory mechanisms provide benefits beyond the dataset alone. revision: yes
Circularity Check
No circularity; claims rest on new dataset and architecture with external empirical validation
full rationale
The paper identifies data and memory bottlenecks, constructs an independent 100K interaction dataset with per-chunk CoT captions, proposes a hierarchical action-aware memory plus persistent bank, and reports experiments against navigation-only baselines. No derivation step reduces by construction to its inputs, no parameters are fitted then relabeled as predictions, no load-bearing self-citations or uniqueness theorems are invoked, and no ansatzes are smuggled via prior work. The central claims are supported by new data and architecture choices that remain falsifiable outside the fitted values, making the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chunk-autoregressive framework is suitable for extending navigation models to object interactions.
invented entities (2)
-
Hierarchical action-aware memory
no independent evidence
-
Persistent memory bank
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026
Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026
Pith/arXiv arXiv 2026
-
[2]
Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Pith/arXiv arXiv 2025
-
[3]
Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023
arXiv 2023
-
[4]
Oasis: A universe in a transformer
Decart and Etched. Oasis: A universe in a transformer. Technical report / project page, October 2024. URL https://about.decart.ai/publications/oasis-interactive-ai-video-game-model . Interactive world model for real-time AI-generated gameplay
2024
-
[5]
Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025
arXiv 2025
-
[6]
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
Pith/arXiv arXiv 2025
-
[7]
Vipe: Video pose engine for 3d geometric perception
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025
Pith/arXiv arXiv 2025
-
[8]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
Pith/arXiv arXiv 2025
-
[9]
Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025
Team HunyuanWorld. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025
2025
-
[10]
Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025
Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025
2025
-
[11]
Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds
Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint arXiv:2604.14268, 2026
Pith/arXiv arXiv 2026
-
[12]
Wovr: World models as reliable simulators for post-training vla policies with rl
Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026
arXiv 2026
-
[13]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[14]
Gonzalez, Ion Stoica, Song Han, and Yao Lu
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. Worldmodelbench: Judging video generation models as world models. InAdvancesin Neural Information Processing Systems, volume 38, 2025
2025
-
[15]
Enhancing end-to-end autonomous driving with latent world model, 2024
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model, 2024
2024
-
[16]
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025
arXiv 2025
-
[17]
Physgen: Rigid-body physics-grounded image-to-video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision (ECCV), 2024
2024
-
[18]
Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026
Tianran Liu, Shengwen Zhao, and Nicholas Rhinehart. Towards foundational lidar world models with efficient latent flow matching.Advancesin Neural Information Processing Systems, 38:155959–155994, 2026. 12
2026
-
[19]
Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023
Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick Von Platen, ApolinÃĄrio Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023
arXiv 2023
-
[20]
Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
arXiv 2025
-
[21]
Genie 2: A large-scale foundation world model
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...
-
[22]
URLhttps://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/
-
[23]
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024
arXiv 2024
-
[24]
Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025
Pith/arXiv arXiv 2025
-
[25]
Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026
Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026
arXiv 2026
-
[26]
Dinov3.arXiv preprint arXiv:2508.10104, 2025
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
Pith/arXiv arXiv 2025
-
[27]
Consistency models
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, 2023
2023
-
[28]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
Pith/arXiv arXiv 2025
-
[29]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...
Pith/arXiv arXiv 2026
-
[30]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024
Pith/arXiv arXiv 2024
-
[31]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
Pith/arXiv arXiv 2025
-
[32]
Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance
Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. InInternational Conference on Learning Representations, volume 2025, pages 95118–95146, 2025
2025
-
[33]
Worldcompass: Reinforcement learning for long-horizon world models
Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models. arXiv preprint, 2026
2026
-
[34]
Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory
Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026. 13
Pith/arXiv arXiv 2026
-
[36]
Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026
arXiv 2026
-
[37]
Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025
arXiv 2025
-
[38]
Groundingbooth: Grounding text-to-image customization
Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, and Nathan Jacobs. Groundingbooth: Grounding text-to-image customization. arXiv preprint arXiv:2409.08520, 2024
arXiv 2024
-
[39]
Panodreamer: Consistent text to 360-degree scene generation
Zhexiao Xiong, Zhang Chen, Zhong Li, Yi Xu, and Nathan Jacobs. Panodreamer: Consistent text to 360-degree scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 295–304, June 2025
2025
-
[40]
Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, and Nathan Jacobs. Physalign: Physics-coherent image-to-video generation through feature and 3d representation alignment.arXiv preprint arXiv:2603.13770, 2026
arXiv 2026
-
[41]
Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453, 2026
arXiv 2026
-
[42]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024
2024
-
[43]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024
2024
-
[44]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025
2025
-
[45]
Wonderjourney: Going from anywhere to everywhere
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024
2024
-
[46]
Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
arXiv 2026
-
[47]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
arXiv 2025
-
[48]
Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation
Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, and Xiaoguang Han. Taste-rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27683–27693, 2025
2025
-
[49]
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025
Pith/arXiv arXiv 2025
-
[50]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026
Pith/arXiv arXiv 2026
-
[51]
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025. 14 Appendix A Data Generation Pipeline This section details the offline data preparation pipeline summarised in Fig. 5. All stages are run once per video ...
arXiv 2025
-
[52]
compute pairwise frame differences and summarise the dominant motion
-
[53]
determine whether subject and object are in physical contact
-
[54]
check whether the supplied action label is consistent with the observed motion (otherwise emitACTION_- MISMATCH)
-
[55]
assign an interaction-phase tagyph k ∈ Pfrom the taxonomy of §3.1
-
[56]
the module learned something useful
write a 1–2 sentence semantic description grounded solely in the per-frame evidence and ignoring camera motion. The structured output schema is fixed:HAS_INTERACTION∈ {yes, no} (the interaction flagyint k ), ACTION_- MISMATCH∈ {yes, no}, PHASE∈ P (the phase labelyph k ), plus a free-formDESCRIPTION string. Together with the video-level action classa∈ A fr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.