Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Pith reviewed 2026-05-20 19:58 UTC · model grok-4.3
The pith
Echo-Forcing separates stable scene anchors from recent dynamics in KV caches to support prompt switches and long-range recalls in interactive video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that functional entanglement of historical KV states is the root cause of contamination and memory loss in long interactive video generation. Echo-Forcing counters this with Hierarchical Temporal Memory that decouples stable anchors, compressed history, and recent windows under relative RoPE; Scene Recall Frames that turn past scenes into spatially structured KV representations for recall; and Difference-aware Memory Decay that forgets tokens based on scene mismatch. Together these keep cache use bounded while supporting smooth transitions, hard cuts, and distant scene recall, leading to top performance on VBench-Long for both long-video generation and interactive prompt-sw
What carries the argument
Echo-Forcing scene memory framework, built from Hierarchical Temporal Memory, Scene Recall Frames, and Difference-aware Memory Decay, that manages KV states by separating stable and dynamic elements.
If this is right
- Videos can switch prompts mid-generation while keeping background consistency without growing memory use.
- Distant historical scenes become recallable through compressed structured representations.
- The same bounded cache works for both gradual transitions and sudden scene changes.
- No additional training is needed to achieve these interactive capabilities.
Where Pith is reading between the lines
- Similar decoupling of stable and recent states could be tested in other autoregressive models that use KV caches, such as long audio or text generation.
- The bounded-cache property suggests the method may scale to very long sequences where memory limits become critical.
- Interactive applications might benefit from combining this memory design with user-driven control interfaces.
Load-bearing premise
The three proposed mechanisms can resolve entanglement of historical KV states without creating new artifacts or lowering frame quality.
What would settle it
A controlled test on VBench-Long interactive sequences where Echo-Forcing produces more background contamination, slower prompt response, or lost scene recall than a simple KV cache baseline would falsify the central claim.
Figures
read the original abstract
Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Echo-Forcing, a training-free scene memory framework for interactive long video generation in autoregressive video diffusion models. It identifies functional entanglement of historical KV states as the core bottleneck leading to outdated background contamination, delayed prompt response, and loss of long-range memory. Three mechanisms are proposed: Hierarchical Temporal Memory (decoupling stable anchors, compressed history, and recent windows under relative RoPE), Scene Recall Frames (spatially structured compression of historical scenes for long-term recall), and Difference-aware Memory Decay (adaptive forgetting of conflicting tokens based on scene discrepancy). The paper claims these enable uniform support for smooth transitions, hard cuts, and long-range recall under bounded cache, with best overall performance on VBench-Long in both long-video and interactive settings. Code is released.
Significance. If the central claims hold, the work would advance interactive long-video generation by offering a practical training-free approach to KV cache management that explicitly handles scene changes and recall. The open release of code is a clear strength for reproducibility and follow-up research. The framework addresses a timely limitation in diffusion-based video models, though its impact depends on stronger empirical grounding for the performance and artifact-free claims.
major comments (3)
- [Abstract] Abstract: the claim that Echo-Forcing 'achieves the best overall performance' on VBench-Long is made without any quantitative metrics, ablation results, error bars, or details on how post-hoc scene conflicts were measured, leaving the central performance claim weakly supported by the provided text.
- [Method] Method (Scene Recall Frames): the compression step could discard high-frequency spatial details needed for accurate recall; the manuscript should demonstrate through targeted experiments that this does not degrade long-range scene recall or introduce artifacts under hard cuts.
- [Method] Method (Difference-aware Memory Decay): the discrepancy metric could misclassify tokens during rapid scene changes; evaluations must isolate hard-cut cases to confirm the mechanism resolves entanglement without new artifacts or incomplete forgetting.
minor comments (2)
- The abstract and method descriptions would benefit from a clear diagram illustrating the three mechanisms and their interaction with the KV cache under different transition types.
- Notation for 'bounded cache budget' and 'relative RoPE' should be defined more explicitly with reference to standard attention formulations to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Echo-Forcing 'achieves the best overall performance' on VBench-Long is made without any quantitative metrics, ablation results, error bars, or details on how post-hoc scene conflicts were measured, leaving the central performance claim weakly supported by the provided text.
Authors: We agree that the abstract would be strengthened by including specific quantitative support. The full manuscript reports comprehensive VBench-Long results with comparisons to baselines, ablations, and both long-video and interactive settings (Sections 4.2–4.3 and Tables 1–3), along with the scene discrepancy metric used for post-hoc conflict measurement (Section 3.3). In the revised manuscript we will add key numerical improvements and reference to error bars directly in the abstract. revision: yes
-
Referee: [Method] Method (Scene Recall Frames): the compression step could discard high-frequency spatial details needed for accurate recall; the manuscript should demonstrate through targeted experiments that this does not degrade long-range scene recall or introduce artifacts under hard cuts.
Authors: We acknowledge the concern that spatial compression might lose high-frequency details. Our existing VBench-Long evaluations cover diverse transitions including hard cuts and long-range recall, but we agree that dedicated isolation of these cases would provide clearer evidence. In the revision we will add targeted experiments and visualizations that measure recall accuracy and artifact presence before and after compression specifically on hard-cut sequences. revision: yes
-
Referee: [Method] Method (Difference-aware Memory Decay): the discrepancy metric could misclassify tokens during rapid scene changes; evaluations must isolate hard-cut cases to confirm the mechanism resolves entanglement without new artifacts or incomplete forgetting.
Authors: We appreciate the suggestion to isolate hard-cut behavior. While our current experiments include rapid scene changes and report overall entanglement reduction, we agree that dedicated hard-cut isolation would more directly validate the discrepancy metric. In the revised manuscript we will add evaluations that separate hard-cut cases, quantifying forgetting completeness and checking for introduced artifacts via both quantitative metrics and qualitative examples. revision: yes
Circularity Check
No circularity: framework components are independently specified engineering proposals
full rationale
The paper introduces Echo-Forcing as a training-free framework whose three mechanisms (Hierarchical Temporal Memory with relative RoPE, Scene Recall Frames for spatially structured compression, and Difference-aware Memory Decay) are defined explicitly as new components to decouple KV states. The abstract and description present these as direct responses to the stated bottleneck of functional entanglement, without any equations that define a claimed performance metric in terms of parameters fitted to the same data or that reduce the decoupling claim to a self-citation chain. Evaluations on the external VBench-Long benchmark supply independent measurement, so the derivation chain remains self-contained and does not collapse to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autoregressive video diffusion models enable open-ended generation through local attention and KV caching.
Reference graph
Works this paper leans on
-
[1]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[6]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
work page 2022
-
[9]
Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers.arXiv preprint arXiv:2505.22167, 2025
-
[10]
Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, et al. Quantsparse: Comprehensively compressing video diffusion transformer with model quantization and attention sparsification.arXiv preprint arXiv:2509.23681, 2025. 10
-
[11]
Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, and Yongjun Xu. S2Q-VDiT: Accurate quantized video diffusion transformer with salient data and sparse token distillation.arXiv preprint arXiv:2508.04016, 2025
-
[12]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025
work page 2025
-
[13]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[18]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024
work page 2024
-
[19]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023
work page 2023
-
[20]
Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025
-
[21]
H., Nam, J., Yoon, H., and Kim, S
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025
-
[22]
Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Youngrae Kim, Qixin Hu, C-C Jay Kuo, and Peter A Beerel. Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026
-
[24]
Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, and Wenya Wang. Light forcing: Accelerating autoregressive video diffusion via sparse attention.arXiv preprint arXiv:2602.04789, 2026
-
[25]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Shotstream: Streaming multi-shot video generation for interactive storytelling
Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026
-
[27]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEuropean Confer- ence on Computer Vision, pages 393–411. Springer, 2024
work page 2024
-
[30]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[31]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023
work page 2023
-
[33]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
work page 2024
-
[34]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024
work page 2024
-
[35]
Jintao Chen, Chengyu Bai, Xinda Xue, Mu Xu, et al. Grounded forcing: Bridging time- independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025
-
[38]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[39]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895, 2024
work page 2024
-
[40]
Reattention: Training-free infinite context with finite attention scope
Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Qipeng Guo, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, and Xipeng Qiu. Reattention: Training-free infinite context with finite attention scope. InInternational Conference on Learning Representations, volume 2025, pages 95458–95478, 2025
work page 2025
-
[41]
Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026
Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026. 12
-
[42]
Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation
Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, and Yongdong Zhang. Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18837–18846, 2025
work page 2025
-
[43]
Shotadapter: Text-to-multi-shot video generation with diffusion models
Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28405–28415, 2025
work page 2025
-
[44]
Long context tuning for video generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025
work page 2025
-
[45]
Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024
work page 2024
-
[46]
Captain cinema: Towards short movie generation
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cinema: Towards short movie generation. InThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[47]
Cut2next: Generating next shot via in-context tuning
Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Qiao Yu, Wanli Ouyang, and Ziwei Liu. Cut2next: Generating next shot via in-context tuning. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025
work page 2025
-
[48]
Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025
-
[49]
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025
-
[50]
Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, and Rami Ben-Ari. Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention.arXiv preprint arXiv:2602.01801, 2026
-
[51]
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[53]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[55]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 13 A Dataset and evaluation details A.1 Dataset construction Our evaluation datasets are built upon prompts sampled from MovieGenBench [ 55]. For long- video g...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.