Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

Chuanguang Yang; Guoxin Fan; Haotong Qin; Libo Huang; Mingqiang Wu; Weilun Feng; Xiaokun Liu; Yongjun Xu; Yuqi Li; Zhefeng Zhang

arxiv: 2605.16003 · v1 · pith:57YLFBIGnew · submitted 2026-05-15 · 💻 cs.CV

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

Mingqiang Wu , Weilun Feng , Zhefeng Zhang , Haotong Qin , Yuqi Li , Guoxin Fan , Xiaokun Liu , Zhulin An

show 3 more authors

Libo Huang Yongjun Xu Chuanguang Yang

This is my paper

Pith reviewed 2026-05-20 19:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video generationinteractive video generationscene memoryKV cachingvideo diffusion modelstraining-free methodtemporal memoryscene recall

0 comments

The pith

Echo-Forcing separates stable scene anchors from recent dynamics in KV caches to support prompt switches and long-range recalls in interactive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current autoregressive video diffusion models mix old and new information in the same KV cache, which creates outdated backgrounds, slow responses to new prompts, and forgotten distant scenes during interactive use. It proposes Echo-Forcing as a training-free framework that introduces three targeted mechanisms to manage memory without increasing cache size or requiring retraining. A reader would care because this makes extended video creation with changing instructions more reliable for tasks like story continuation or real-time editing. The designs are shown to handle smooth transitions, abrupt cuts, and scene recall equally well. Evaluations on VBench-Long indicate superior results compared with prior long-video methods in both standard and interactive settings.

Core claim

The authors claim that functional entanglement of historical KV states is the root cause of contamination and memory loss in long interactive video generation. Echo-Forcing counters this with Hierarchical Temporal Memory that decouples stable anchors, compressed history, and recent windows under relative RoPE; Scene Recall Frames that turn past scenes into spatially structured KV representations for recall; and Difference-aware Memory Decay that forgets tokens based on scene mismatch. Together these keep cache use bounded while supporting smooth transitions, hard cuts, and distant scene recall, leading to top performance on VBench-Long for both long-video generation and interactive prompt-sw

What carries the argument

Echo-Forcing scene memory framework, built from Hierarchical Temporal Memory, Scene Recall Frames, and Difference-aware Memory Decay, that manages KV states by separating stable and dynamic elements.

If this is right

Videos can switch prompts mid-generation while keeping background consistency without growing memory use.
Distant historical scenes become recallable through compressed structured representations.
The same bounded cache works for both gradual transitions and sudden scene changes.
No additional training is needed to achieve these interactive capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling of stable and recent states could be tested in other autoregressive models that use KV caches, such as long audio or text generation.
The bounded-cache property suggests the method may scale to very long sequences where memory limits become critical.
Interactive applications might benefit from combining this memory design with user-driven control interfaces.

Load-bearing premise

The three proposed mechanisms can resolve entanglement of historical KV states without creating new artifacts or lowering frame quality.

What would settle it

A controlled test on VBench-Long interactive sequences where Echo-Forcing produces more background contamination, slower prompt response, or lost scene recall than a simple KV cache baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16003 by Chuanguang Yang, Guoxin Fan, Haotong Qin, Libo Huang, Mingqiang Wu, Weilun Feng, Xiaokun Liu, Yongjun Xu, Yuqi Li, Zhefeng Zhang, Zhulin An.

**Figure 1.** Figure 1: Echo-Forcing enables autoregressive video diffusion models to support four interactive long-video generation modes: long-horizon generation, smooth transition, hard cut, and long-range scene recall, while maintaining temporal coherence and scene consistency. Abstract Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free lo… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Echo-Forcing framework. Our method integrates three scenememory modules to preserve temporal continuity, recall historical scenes, and suppress conflicting memories during interactive long-video generation. defined as Ar = (Eur , Eur+1, . . . , Eur+S−1), r is odd, (Eur+S−1, Eur+S−2, . . . , Eur ), r is even, ur = (rS) mod Nanc, (1) where all indices are taken modulo Nanc. Consec… view at source ↗

**Figure 3.** Figure 3: Visualization of historical token selection. Compared with alternative scoring strategies, our calibrated query with amplitude compensation and drift gating best matches the ground-truth future-query attention. Drift gate. Although the calibration center provides a stable phase reference, the query distribution is not fixed throughout long-horizon generation. As the video evolves, recent queries may gradua… view at source ↗

**Figure 4.** Figure 4: Visualization of scene recall and memory decay. Echo-Forcing stores compact scene memories for recall and adaptively decays old memories according to old–new scene discrepancies. 3.2 Scene Recall Frames Interactive long-video generation requires compact scene-level memories that preserve useful priors without redundant historical noise. Storing all frames of a scene is costly and may introduce interference… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison. Echo-Forcing improves long-horizon stability and interactive scene control across smooth transition, hard cut, scene recall, and long-video generation. ablation on Drift-Gated Phase Compression in the main paper, while the remaining studies are provided in Appendix C [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Additional visualization of long-video generation. Echo-Forcing maintains subject identity, background structure, and visual fidelity over extended autoregressive rollouts. 0~10s: A Girl reads a book in a library; 10~20s: She turns Pages; 20~30s: She has Tea; 30~50s: She looks out; 50~60s: She returns the book 0~10s: A boy slices fruits in a kitchen; 10~20s: He spreads scream; 20~30s: He decorates cake; 30… view at source ↗

**Figure 7.** Figure 7: Additional visualization of smooth transitions. We show more examples of gradual prompt evolution under continuous scene dynamics. Echo-Forcing preserves compatible subject and scene priors across adjacent segments, producing smoother motion changes and more coherent visual transitions. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Additional visualization of hard cuts. We show more examples under abrupt semantic changes, where the subject is preserved while the background, action, or scene layout changes substantially. Echo-Forcing suppresses old-scene residuals and adapts more cleanly to the new prompt after each cut. D e e p - F o r cin g ∞ - Ro P E E c h o - F o r cin g D e e p - F o r cin g ∞ - Ro P E E c h o - F o r cin g S e v… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on 2-minute long-video generation. Echo-Forcing better preserves subject appearance, background coherence, and visual fidelity during 2-minute autoregressive rollout compared with representative baselines. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of Scene Recall Frames. Each row corresponds to one scene. The left image shows the scene reference, and the blue maps on the right visualize several recalled frames. Different recalled frames exhibit different temporal attention patterns, reflecting varying emphasis on scene information over time. 0~10s:Tighten Coat on Rooftop 10~20s:Walk Subway Platform 20~30s:Lower Headphones in Record Sh… view at source ↗

**Figure 11.** Figure 11: Additional scene-recall results. Echo-Forcing retrieves earlier scene memories and reduces semantic confusion across long-range shot intervals. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Echo-Forcing gives a practical KV-cache split for handling prompt switches in long video diffusion, but the gains rest on experiments that need more isolation of the new components.

read the letter

The main thing to know is that this paper tackles interactive long-video generation by reorganizing the KV cache so old scene info does not bleed into new prompts or get forgotten too slowly. They do this with three pieces: hierarchical temporal memory that keeps stable anchors, compressed history, and recent windows separate using relative RoPE; scene recall frames that pack historical scenes into spatially structured KV entries; and difference-aware memory decay that drops tokens when they conflict with the current scene. This is a new combination aimed at bounded-cache interactive use rather than single-prompt extension.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Echo-Forcing, a training-free scene memory framework for interactive long video generation in autoregressive video diffusion models. It identifies functional entanglement of historical KV states as the core bottleneck leading to outdated background contamination, delayed prompt response, and loss of long-range memory. Three mechanisms are proposed: Hierarchical Temporal Memory (decoupling stable anchors, compressed history, and recent windows under relative RoPE), Scene Recall Frames (spatially structured compression of historical scenes for long-term recall), and Difference-aware Memory Decay (adaptive forgetting of conflicting tokens based on scene discrepancy). The paper claims these enable uniform support for smooth transitions, hard cuts, and long-range recall under bounded cache, with best overall performance on VBench-Long in both long-video and interactive settings. Code is released.

Significance. If the central claims hold, the work would advance interactive long-video generation by offering a practical training-free approach to KV cache management that explicitly handles scene changes and recall. The open release of code is a clear strength for reproducibility and follow-up research. The framework addresses a timely limitation in diffusion-based video models, though its impact depends on stronger empirical grounding for the performance and artifact-free claims.

major comments (3)

[Abstract] Abstract: the claim that Echo-Forcing 'achieves the best overall performance' on VBench-Long is made without any quantitative metrics, ablation results, error bars, or details on how post-hoc scene conflicts were measured, leaving the central performance claim weakly supported by the provided text.
[Method] Method (Scene Recall Frames): the compression step could discard high-frequency spatial details needed for accurate recall; the manuscript should demonstrate through targeted experiments that this does not degrade long-range scene recall or introduce artifacts under hard cuts.
[Method] Method (Difference-aware Memory Decay): the discrepancy metric could misclassify tokens during rapid scene changes; evaluations must isolate hard-cut cases to confirm the mechanism resolves entanglement without new artifacts or incomplete forgetting.

minor comments (2)

The abstract and method descriptions would benefit from a clear diagram illustrating the three mechanisms and their interaction with the KV cache under different transition types.
Notation for 'bounded cache budget' and 'relative RoPE' should be defined more explicitly with reference to standard attention formulations to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Echo-Forcing 'achieves the best overall performance' on VBench-Long is made without any quantitative metrics, ablation results, error bars, or details on how post-hoc scene conflicts were measured, leaving the central performance claim weakly supported by the provided text.

Authors: We agree that the abstract would be strengthened by including specific quantitative support. The full manuscript reports comprehensive VBench-Long results with comparisons to baselines, ablations, and both long-video and interactive settings (Sections 4.2–4.3 and Tables 1–3), along with the scene discrepancy metric used for post-hoc conflict measurement (Section 3.3). In the revised manuscript we will add key numerical improvements and reference to error bars directly in the abstract. revision: yes
Referee: [Method] Method (Scene Recall Frames): the compression step could discard high-frequency spatial details needed for accurate recall; the manuscript should demonstrate through targeted experiments that this does not degrade long-range scene recall or introduce artifacts under hard cuts.

Authors: We acknowledge the concern that spatial compression might lose high-frequency details. Our existing VBench-Long evaluations cover diverse transitions including hard cuts and long-range recall, but we agree that dedicated isolation of these cases would provide clearer evidence. In the revision we will add targeted experiments and visualizations that measure recall accuracy and artifact presence before and after compression specifically on hard-cut sequences. revision: yes
Referee: [Method] Method (Difference-aware Memory Decay): the discrepancy metric could misclassify tokens during rapid scene changes; evaluations must isolate hard-cut cases to confirm the mechanism resolves entanglement without new artifacts or incomplete forgetting.

Authors: We appreciate the suggestion to isolate hard-cut behavior. While our current experiments include rapid scene changes and report overall entanglement reduction, we agree that dedicated hard-cut isolation would more directly validate the discrepancy metric. In the revised manuscript we will add evaluations that separate hard-cut cases, quantifying forgetting completeness and checking for introduced artifacts via both quantitative metrics and qualitative examples. revision: yes

Circularity Check

0 steps flagged

No circularity: framework components are independently specified engineering proposals

full rationale

The paper introduces Echo-Forcing as a training-free framework whose three mechanisms (Hierarchical Temporal Memory with relative RoPE, Scene Recall Frames for spatially structured compression, and Difference-aware Memory Decay) are defined explicitly as new components to decouple KV states. The abstract and description present these as direct responses to the stated bottleneck of functional entanglement, without any equations that define a claimed performance metric in terms of parameters fitted to the same data or that reduce the decoupling claim to a self-citation chain. Evaluations on the external VBench-Long benchmark supply independent measurement, so the derivation chain remains self-contained and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about autoregressive diffusion and KV caching; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Autoregressive video diffusion models enable open-ended generation through local attention and KV caching.
Stated as the starting point for identifying the entanglement bottleneck.

pith-pipeline@v0.9.0 · 5815 in / 1280 out tokens · 43865 ms · 2026-05-20T19:58:23.566106+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 21 internal anchors

[1]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[6]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

work page 2022
[9]

Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers.arXiv preprint arXiv:2505.22167, 2025

Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers.arXiv preprint arXiv:2505.22167, 2025

work page arXiv 2025
[10]

Quantsparse: Comprehensively compressing video diffusion transformer with model quantization and attention sparsification.arXiv preprint arXiv:2509.23681, 2025

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, et al. Quantsparse: Comprehensively compressing video diffusion transformer with model quantization and attention sparsification.arXiv preprint arXiv:2509.23681, 2025. 10

work page arXiv 2025
[11]

S2Q-VDiT: Accurate quantized video diffusion transformer with salient data and sparse token distillation.arXiv preprint arXiv:2508.04016, 2025

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, and Yongjun Xu. S2Q-VDiT: Accurate quantized video diffusion transformer with salient data and sparse token distillation.arXiv preprint arXiv:2508.04016, 2025

work page arXiv 2025
[12]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

work page 2025
[13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[18]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[19]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023

work page 2023
[20]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

work page arXiv 2025
[21]

H., Nam, J., Yoon, H., and Kim, S

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025
[22]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

Youngrae Kim, Qixin Hu, C-C Jay Kuo, and Peter A Beerel. Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

work page arXiv 2026
[24]

Light forcing: Accelerating autoregressive video diffusion via sparse attention.arXiv preprint arXiv:2602.04789, 2026

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, and Wenya Wang. Light forcing: Accelerating autoregressive video diffusion via sparse attention.arXiv preprint arXiv:2602.04789, 2026

work page arXiv 2026
[25]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Shotstream: Streaming multi-shot video generation for interactive storytelling

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026
[27]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEuropean Confer- ence on Computer Vision, pages 393–411. Springer, 2024

work page 2024
[30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[31]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023
[33]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[34]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[35]

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen, Chengyu Bai, Xinda Xue, Mu Xu, et al. Grounded forcing: Bridging time- independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025
[38]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895, 2024

work page 2024
[40]

Reattention: Training-free infinite context with finite attention scope

Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Qipeng Guo, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, and Xipeng Qiu. Reattention: Training-free infinite context with finite attention scope. InInternational Conference on Learning Representations, volume 2025, pages 95458–95478, 2025

work page 2025
[41]

Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026. 12

work page arXiv 2026
[42]

Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, and Yongdong Zhang. Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18837–18846, 2025

work page 2025
[43]

Shotadapter: Text-to-multi-shot video generation with diffusion models

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28405–28415, 2025

work page 2025
[44]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025

work page 2025
[45]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

work page 2024
[46]

Captain cinema: Towards short movie generation

Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cinema: Towards short movie generation. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[47]

Cut2next: Generating next shot via in-context tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Qiao Yu, Wanli Ouyang, and Ziwei Liu. Cut2next: Generating next shot via in-context tuning. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025
[48]

Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

work page arXiv 2025
[49]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

work page arXiv 2025
[50]

Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention.arXiv preprint arXiv:2602.01801, 2026

Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, and Rami Ben-Ari. Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention.arXiv preprint arXiv:2602.01801, 2026

work page arXiv 2026
[51]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[53]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[55]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 13 A Dataset and evaluation details A.1 Dataset construction Our evaluation datasets are built upon prompts sampled from MovieGenBench [ 55]. For long- video g...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024

[6] [6]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

work page 2022

[9] [9]

Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers.arXiv preprint arXiv:2505.22167, 2025

Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers.arXiv preprint arXiv:2505.22167, 2025

work page arXiv 2025

[10] [10]

Quantsparse: Comprehensively compressing video diffusion transformer with model quantization and attention sparsification.arXiv preprint arXiv:2509.23681, 2025

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, et al. Quantsparse: Comprehensively compressing video diffusion transformer with model quantization and attention sparsification.arXiv preprint arXiv:2509.23681, 2025. 10

work page arXiv 2025

[11] [11]

S2Q-VDiT: Accurate quantized video diffusion transformer with salient data and sparse token distillation.arXiv preprint arXiv:2508.04016, 2025

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, and Yongjun Xu. S2Q-VDiT: Accurate quantized video diffusion transformer with salient data and sparse token distillation.arXiv preprint arXiv:2508.04016, 2025

work page arXiv 2025

[12] [12]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

work page 2025

[13] [13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[18] [18]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024

[19] [19]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023

work page 2023

[20] [20]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

work page arXiv 2025

[21] [21]

H., Nam, J., Yoon, H., and Kim, S

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025

[22] [22]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

Youngrae Kim, Qixin Hu, C-C Jay Kuo, and Peter A Beerel. Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

work page arXiv 2026

[24] [24]

Light forcing: Accelerating autoregressive video diffusion via sparse attention.arXiv preprint arXiv:2602.04789, 2026

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, and Wenya Wang. Light forcing: Accelerating autoregressive video diffusion via sparse attention.arXiv preprint arXiv:2602.04789, 2026

work page arXiv 2026

[25] [25]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Shotstream: Streaming multi-shot video generation for interactive storytelling

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026

[27] [27]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEuropean Confer- ence on Computer Vision, pages 393–411. Springer, 2024

work page 2024

[30] [30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[31] [31]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023

[33] [33]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[34] [34]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[35] [35]

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen, Chengyu Bai, Xinda Xue, Mu Xu, et al. Grounded forcing: Bridging time- independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025

[38] [38]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[39] [39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895, 2024

work page 2024

[40] [40]

Reattention: Training-free infinite context with finite attention scope

Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Qipeng Guo, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, and Xipeng Qiu. Reattention: Training-free infinite context with finite attention scope. InInternational Conference on Learning Representations, volume 2025, pages 95458–95478, 2025

work page 2025

[41] [41]

Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026. 12

work page arXiv 2026

[42] [42]

Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, and Yongdong Zhang. Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18837–18846, 2025

work page 2025

[43] [43]

Shotadapter: Text-to-multi-shot video generation with diffusion models

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28405–28415, 2025

work page 2025

[44] [44]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17281–17291, 2025

work page 2025

[45] [45]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

work page 2024

[46] [46]

Captain cinema: Towards short movie generation

Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cinema: Towards short movie generation. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025

[47] [47]

Cut2next: Generating next shot via in-context tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Qiao Yu, Wanli Ouyang, and Ziwei Liu. Cut2next: Generating next shot via in-context tuning. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025

[48] [48]

Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

work page arXiv 2025

[49] [49]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

work page arXiv 2025

[50] [50]

Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention.arXiv preprint arXiv:2602.01801, 2026

Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, and Rami Ben-Ari. Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention.arXiv preprint arXiv:2602.01801, 2026

work page arXiv 2026

[51] [51]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[53] [53]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[55] [55]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 13 A Dataset and evaluation details A.1 Dataset construction Our evaluation datasets are built upon prompts sampled from MovieGenBench [ 55]. For long- video g...

work page internal anchor Pith review Pith/arXiv arXiv 2024