arxiv: 2604.10103 · v2 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

Ruibin Li , Tao Yang , Fangzhou Ai , Tianhe Wu , Shilei Wen , Bingyue Peng , Lei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming video generationhybrid attentionvideo diffusiondecoupled distillationlong-horizon generationautoregressive videoreal-time synthesis

0 comments

The pith

Hybrid attention with a compact linear state and block-sparse windows enables real-time unbounded streaming video generation from distilled diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an autoregressive streaming video generator by distilling a pretrained bidirectional video diffusion model. Standard sliding-window attention loses distant context and incurs high compute cost over long sequences. The proposed hybrid attention adds a lightweight linear temporal component that maintains a compact key-value state to absorb tokens evicted from the window, preserving long-range dependencies at low overhead. It also applies block-sparse attention inside the local window to reduce redundant short-range computation. A decoupled distillation first optimizes under dense attention then activates the hybrid components, yielding stable training and state-of-the-art results on both short and long video benchmarks.

Core claim

Hybrid Forcing uses a hybrid attention design that jointly retains temporal information and improves efficiency: lightweight linear temporal attention maintains a compact key-value state to incrementally absorb evicted tokens beyond the sliding window, block-sparse attention reallocates computation within the local window by skipping redundant dependencies, and a decoupled distillation strategy first performs dense-attention pre-training then activates the linear and sparse components for streaming modeling.

What carries the argument

Hybrid attention consisting of lightweight linear temporal attention that maintains a compact key-value state for long-range context plus block-sparse attention inside the sliding window, trained via a decoupled distillation schedule from a dense bidirectional teacher.

If this is right

Streaming video models can produce videos of unbounded length without progressive loss of distant history.
Real-time inference at 29.5 FPS for 832x480 video becomes practical on a single high-end GPU without quantization or compression.
Distillation pipelines for autoregressive video generators become more stable by separating dense pre-training from hybrid-component training.
Computational budget can be shifted from redundant local calculations to critical long-term temporal dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-state plus sparse-window pattern could transfer to other autoregressive modalities such as long audio or 3D scene generation.
Continuous real-time video synthesis applications, such as interactive content creation, become feasible if the compact state truly scales without bound.
Empirical tests on multi-minute or hour-long videos would reveal whether the fixed-size KV state eventually saturates or requires periodic resetting.

Load-bearing premise

The lightweight linear temporal attention and block-sparse attention can be stably distilled from the dense bidirectional teacher model while preserving generation quality over arbitrarily long horizons.

What would settle it

Measuring a clear drop in perceptual quality or temporal coherence when generating videos thousands of frames longer than the training horizon, or failing to sustain 29.5 FPS at 832x480 resolution on a single H100 GPU.

Figures

Figures reproduced from arXiv: 2604.10103 by Bingyue Peng, Fangzhou Ai, Lei Zhang, Ruibin Li, Shilei Wen, Tao Yang, Tianhe Wu.

**Figure 1.** Figure 1: Illustration of our hybrid attention paradigm for SVG. (a) The standard SWA approach only caches the most recent frames, leading to significant error accumulation in long-form generation. (b) While sinking the first frame can reduce early drift, it sacrifices motion diversity and fails to capture long-horizon dependencies effectively. (c) Our hybrid attention combines long-term history retention with effic… view at source ↗

**Figure 2.** Figure 2: Framework of our training and inference pipeline. (a) An auxiliary regularization loss is introduced to mitigate training mode collapse. (b) The output of our hybrid attention is computed as the sum of sparse local window attention and linear history attention. (c) The linear state is updated prior to evicting outdated frames from the KV cache to preserve long-term information. (d) The temporal RoPE index… view at source ↗

**Figure 3.** Figure 3: Qualitative results on 5-second videos. Our method exhibits significantly stronger motion dynamics and richer camera movement diversity. Additional comparisons and video demonstrations are provided in the Supplemental Materials. Forcing and MemFlow show moderate performance. In contrast, our method faithfully reproduces the full “house melt” process with high motion dynamics and visual fidelity. Moreover,… view at source ↗

**Figure 4.** Figure 4: Frame-by-frame qualitative results on 30-second videos. More comparison results and video demos are provided in the Supplemental Materials [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Visual results with and without the regularization loss. Removing the regularization loss can lead to persistent camera drift and character duplication. bution shift during long inference, yielding consistent quality gains. Meanwhile, trajectory-level regularization stabilizes motion dynamics and reduces long-horizon drift, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: (a) VBench evaluation results over all 16 metrics on 5-seconds videos. (b) VBench-long evaluation results over all 16 metrics on 30-seconds videos. Values are normalized by the max value for better visualization effect. 16 18 20 22 24 26 28 30 Throughput (FPS) 79.5 80.0 80.5 81.0 81.5 82.0 82.5 83.0 83.5 Total Score CausVid Self Forcing Rolling Forcing Longlive Memflow InfRoPE Hybrid Forcing (Ours) Perform… view at source ↗

**Figure 7.** Figure 7: Performance (total score vs. throughput) comparison across seven methods on 30-seconds video inference. Our method achieves superior quality and efficiency [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on 5-seconds video [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on 5-seconds video [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on 30-seconds video [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison on 30-seconds video [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison on 30-seconds video [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Visual result of 10 minutes video [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hybrid linear temporal state plus block-sparse local attention with two-stage distillation delivers real-time streaming video at claimed SOTA levels, but the unbounded long-horizon stability rests on untested scaling assumptions.

read the letter

The main takeaway is that this paper replaces standard sliding-window attention in streaming video diffusion with a hybrid: a compact linear temporal attention state that incrementally absorbs evicted tokens for long-range context, combined with block-sparse attention inside the local window to cut redundant compute, all trained via a decoupled two-stage distillation that first stabilizes under dense attention before activating the hybrid components. This setup targets the core issues of history loss and high overhead in autoregressive video models, and the reported outcome is real-time 29.5 FPS generation of 832x480 video on a single H100 without quantization. Code release helps with checking the details. The approach is a straightforward engineering combination that improves on plain sliding-window baselines in the benchmarks they cite. It shows clear attention to both retention and efficiency, with the staged distillation appearing to address optimization instability that often arises when switching attention types mid-training. The empirical results on short- and long-form benchmarks are quantified and positioned as state-of-the-art, which gives the work concrete grounding. The soft spot is the limited validation for the central long-horizon claim. The linear state is designed to maintain constant memory, but the paper provides no direct measurements of its effective capacity, compression loss, or drift as sequence length grows into the thousands of frames. There are no scaling plots for perceptual metrics like FVD or temporal consistency versus horizon length beyond the reported benchmarks, so the “unbounded” performance is plausible but extrapolated rather than demonstrated. The block-sparsity ratio and state size are free parameters whose sensitivity is not fully explored in the given results. This work is aimed at researchers and practitioners building deployable video generation systems for real-time or simulation use. It shows honest engagement with the practical constraints and has enough reproducible elements to merit a serious referee, even though additional horizon-scaling ablations would strengthen the unbounded part. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Hybrid Forcing for streaming video generation, distilling a bidirectional video diffusion model into an autoregressive model with a hybrid attention design: lightweight linear temporal attention (maintaining a compact KV state to absorb evicted tokens for long-range context) combined with block-sparse attention in the local sliding window, trained via a two-stage decoupled distillation procedure. It reports SOTA results on short- and long-form benchmarks and claims real-time unbounded 832x480 video generation at 29.5 FPS on a single H100 GPU without quantization.

Significance. If the empirical results and stability claims hold under scrutiny, the work would advance practical long-horizon video generation by addressing both temporal context retention and computational overhead in a single framework, with the open release of code and models aiding reproducibility. The hybrid design and distillation strategy offer a concrete path toward efficient streaming models.

major comments (2)

[Experimental results and long-horizon evaluation sections] The central claim of unbounded real-time generation at constant memory depends on the compact linear temporal attention (plus block-sparse local attention) retaining necessary temporal context without capacity limits or drift over arbitrary horizons. However, the manuscript provides no analysis of the KV state's effective capacity (e.g., rank or compression loss) nor quantitative results for perceptual metrics (FVD, CLIP-T, temporal consistency) as a function of horizon length beyond the reported benchmarks.
[Method and distillation procedure sections] The decoupled distillation strategy is described as performing initial dense-attention distillation followed by activation of linear temporal and block-sparse components, but no ablation or stability analysis is given for how this two-stage process prevents optimization issues specific to the hybrid design during streaming.

minor comments (2)

[Abstract] The abstract states 'consistently achieves state-of-the-art performance' without listing the specific quantitative margins or baseline comparisons; adding a brief table reference or key metric deltas would improve clarity.
[Hybrid attention design] Notation for the linear temporal attention state size and block-sparsity ratio is introduced but not explicitly tied to the free parameters listed in the design; a short equation or pseudocode reference would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment in detail below, providing clarifications and committing to revisions that strengthen the presentation of our results and method without altering the core contributions.

read point-by-point responses

Referee: [Experimental results and long-horizon evaluation sections] The central claim of unbounded real-time generation at constant memory depends on the compact linear temporal attention (plus block-sparse local attention) retaining necessary temporal context without capacity limits or drift over arbitrary horizons. However, the manuscript provides no analysis of the KV state's effective capacity (e.g., rank or compression loss) nor quantitative results for perceptual metrics (FVD, CLIP-T, temporal consistency) as a function of horizon length beyond the reported benchmarks.

Authors: We appreciate the referee's emphasis on rigorously validating the long-horizon claims. Our long-form benchmarks already include extended sequences that exceed typical short-clip evaluations, and the compact KV state is designed to incrementally absorb evicted tokens with fixed memory footprint. However, we acknowledge that explicit analysis of the state's effective capacity (e.g., via rank or reconstruction metrics) and metric trends versus horizon length would provide stronger quantitative support. In the revised manuscript, we will add plots of FVD, CLIP-T, and temporal consistency as functions of increasing horizon length on long-form data, along with a brief discussion of the linear attention state's compression properties derived from its incremental update rule. These additions will directly address the concern while preserving the existing SOTA results. revision: yes
Referee: [Method and distillation procedure sections] The decoupled distillation strategy is described as performing initial dense-attention distillation followed by activation of linear temporal and block-sparse components, but no ablation or stability analysis is given for how this two-stage process prevents optimization issues specific to the hybrid design during streaming.

Authors: We agree that additional analysis of the distillation procedure would enhance clarity. The two-stage decoupled approach first stabilizes the model under dense attention to establish reliable gradients before activating the hybrid linear and block-sparse components, thereby avoiding the instability that can arise from jointly optimizing sparse patterns and incremental KV updates from the start. In the revision, we will include an ablation comparing the two-stage procedure against a single-stage joint optimization, reporting training loss curves, gradient norms, and final generation quality to demonstrate improved stability. This will substantiate how the decoupling mitigates hybrid-specific optimization challenges without requiring changes to the core method. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with external benchmarks

full rationale

The paper introduces Hybrid Forcing as a practical distillation procedure combining linear temporal attention, block-sparse attention, and a two-stage decoupled training schedule. All claims rest on measured performance (FPS, FVD, CLIP scores) against standard video generation benchmarks rather than any closed-form derivation or self-referential prediction. No equations are presented that equate a fitted quantity to its own input, and no uniqueness theorem or ansatz is imported via self-citation to force the design. The method is self-contained against external evaluation protocols.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that a pretrained bidirectional diffusion model can be distilled into an efficient autoregressive form; design choices such as state size and sparsity level are free parameters tuned for the task.

free parameters (2)

linear temporal attention state size
Compact KV state dimension chosen to retain long-range context with low overhead.
block-sparsity ratio
Sparsity level inside the sliding window selected to reduce compute while preserving local modeling.

axioms (1)

domain assumption A pretrained bidirectional video diffusion model can be effectively distilled into an autoregressive streaming model without catastrophic quality loss
The entire pipeline depends on successful knowledge transfer from the dense teacher.

pith-pipeline@v0.9.0 · 5574 in / 1210 out tokens · 74074 ms · 2026-05-10T16:06:18.975031+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...

Reference graph

Works this paper leans on

74 extracted references · 35 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024
[3]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024
[4]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

work page internal anchor Pith review arXiv 2025
[5]

Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b

Chen, J., Zhao, Y., Yu, J., Chu, R., Chen, J., Yang, S., Wang, X., Pan, Y., Zhou, D.,Ling,H.,etal.:Sana-video:Efficientvideogenerationwithblocklineardiffusion transformer. arXiv preprint arXiv:2509.24695 (2025)

work page arXiv 2025
[6]

arXiv preprint arXiv:2510.02283 (2025)

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page arXiv 2025
[7]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Dalal, K., Koceja, D., Xu, J., Zhao, Y., Han, S., Cheung, K.C., Kautz, J., Choi, Y., Sun, Y., Wang, X.: One-minute video generation with test-time training. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 17702– 17711 (2025)

2025
[8]

Advances in neural information pro- cessing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

2022
[9]

Autoregressive Video Gen- eration Without Vector Quantization

Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024)

work page arXiv 2024
[10]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Dong, J., Feng, B., Guessous, D., Liang, Y., He, H.: Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496 2(3), 4 (2024)

work page arXiv 2024
[11]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

work page internal anchor Pith review arXiv 2025
[12]

googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

Google: Veo: a text-to-video generation system (2025),https : / / storage . googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

2025
[13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2568–2577 (2025)

2025
[14]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025)

work page arXiv 2025
[15]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review arXiv 2025
[16]

arXiv preprint arXiv:2512.14699 (2025) Head Forcing 17

Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adap- tive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 16 R. Li et al

work page arXiv 2025
[17]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., Lin, Z.: Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 (2024)

work page arXiv 2024
[18]

Advances in neural information processing systems35, 26565–26577 (2022)

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022)

2022
[19]

Kling: Kling ai: Next-generation ai creative studio.https://klingai.com/cn/ (2026)

2026
[20]

Streamdit: Real-time streaming text-to-video generation

Kodaira, A., Hou, T., Hou, J., Georgopoulos, M., Juefei-Xu, F., Tomizuka, M., Zhao, Y.: Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745 (2025)

work page arXiv 2025
[21]

ArXiv (2024)

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Yuan, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Long, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...

2024
[22]

In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

Li, R., Yang, T., Guo, S., Zhang, L.: Rorem: Training a robust object remover with human-in-the-loop. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 14024–14035 (2025)

2025
[23]

arXiv preprint arXiv:2506.01758 (2025)

Li, R., Yang, T., Shi, Y., Zhang, Y., Dong, Q., Cheng, H., Feng, W., Wen, S., Peng, B., Zhang, L.: Many-for-many: Unify the training of multiple video and image generation and manipulation tasks. arXiv preprint arXiv:2506.01758 (2025)

work page arXiv 2025
[24]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page arXiv 2025
[26]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

2021
[28]

ltx2: ltx2.https://ltx.io/model/ltx-2(2026)

2026
[29]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081 (2024)

work page internal anchor Pith review arXiv 2024
[30]

Advances in Neural Information Processing Systems37, 131434–131455 (2024)

Lu, Y., Liang, Y., Zhu, L., Yang, Y.: Freelong: Training-free long video generation with spectralblend temporal attention. Advances in Neural Information Processing Systems37, 131434–131455 (2024)

2024
[31]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)

work page arXiv 2025
[32]

com/sora

OpenAI: Video generation models as world simulators (2024),https://openai. com/sora

2024
[33]

Open-sora 2.0: Training a commercial-level video generation model in $200 k

Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Fast Stream Video Generation with Hybrid Attention 17 Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, ...

work page arXiv 2025
[34]

org / blog / triton - kernel - compilation - stages(2026)

Pytorch: Triton.https : / / pytorch . org / blog / triton - kernel - compilation - stages(2026)

2026
[35]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al.: Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 (2025)

work page internal anchor Pith review arXiv 2025
[36]

RunwayML: Gen-3 by runway (2023),https://research.runwayml.com/gen3

2023
[37]

Sand-AI: Autoregressive video generation at scale (2025)

2025
[38]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[39]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[40]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

2024
[41]

Kimi Linear: An Expressive, Efficient Attention Architecture

Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., et al.: Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Advances in Neural Information Processing Systems37, 65618–65642 (2024)

Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text- to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024)

2024
[44]

International Journal of Computer Vision133(5), 3059–3078 (2025)

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffu- sion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

2025
[45]

Advances in neural information processing systems36, 8406–8441 (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

2023
[46]

Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

Wu, T., Li, R., Zhang, L., Ma, K.: Diversity-preserved distribution matching dis- tillation for fast visual synthesis. arXiv preprint arXiv:2602.03139 (2026)

work page arXiv 2026
[47]

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity, April 2025

Xi, H., Yang, S., Zhao, Y., Xu, C., Li, M., Li, X., Lin, Y., Cai, H., Zhang, J., Li, D., et al.: Sparse videogen: Accelerating video diffusion transformers with spatial- temporal sparsity. arXiv preprint arXiv:2502.01776 (2025)

work page arXiv 2025
[48]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xia, Y., Ling, S., Fu, F., Wang, Y., Li, H., Xiao, X., Cui, B.: Training-free and adaptive sparse attention for efficient long video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15982–15993 (2025)

2025
[49]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

work page internal anchor Pith review arXiv 2023
[50]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review arXiv 2025
[51]

Infinity- rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649,

Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

work page arXiv 2025
[52]

Li et al

Yin, S., Wu, C., Yang, H., Wang, J., Wang, X., Ni, M., Yang, Z., Li, L., Liu, S., Yang, F., et al.: Nuwa-xl: Diffusion over diffusion for extremely long video 18 R. Li et al. generation. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1309–1320 (2023)

2023
[53]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024
[54]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 22963– 22974 (2025)

2025
[55]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y., Wang, L., Xiao, Z., et al.: Native sparse attention: Hardware-aligned and natively trainable sparse attention. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 23078–23097 (2025)

2025
[56]

Gonzalez, Jun Zhu, and Jianfei Chen

Zhang, J., Wang, H., Jiang, K., Yang, S., Zheng, K., Xi, H., Wang, Z., Zhu, H., Zhao, M., Stoica, I., et al.: Sla: Beyond sparsity in diffusion transformers via fine- tunable sparse-linear attention. arXiv preprint arXiv:2509.24006 (2025)

work page arXiv 2025
[57]

Arxiv (2025)

Zhang, L., Agrawala, M.: Packing input frame contexts in next-frame prediction models for video generation. Arxiv (2025)

2025
[58]

VSA: Faster Video Diffusion with Trainable Sparse Attention, October 2025

Zhang, P., Chen, Y., Huang, H., Lin, W., Liu, Z., Stoica, I., Xing, E., Zhang, H.: Vsa: Faster video diffusion with trainable sparse attention. arXiv preprint arXiv:2505.13389 (2025)

work page arXiv 2025
[59]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Zhang, P., Chen, Y., Su, R., Ding, H., Stoica, I., Liu, Z., Zhang, H.: Fast video generation with sliding tile attention. arXiv preprint arXiv:2502.04507 (2025)

work page arXiv 2025
[60]

Waver: Wave your way to lifelike video generation,

Zhang, Y., Yang, H., Zhang, Y., Hu, Y., Zhu, F., Lin, C., Mei, X., Jiang, Y., Yuan, Z., Peng, B.: Waver: Wave your way to lifelike video generation. arXiv preprint arXiv:2508.15761 (2025)

work page arXiv 2025
[61]

Advances in Neural Information Processing Systems36, 49842–49869 (2023)

Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems36, 49842–49869 (2023)

2023
[62]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Zheng, K., Wang, Y., Ma, Q., Chen, H., Zhang, J., Balaji, Y., Chen, J., Liu, M.Y., Zhu, J., Zhang, Q.: Large scale diffusion distillation via score-regularized continuous-time consistency. arXiv preprint arXiv:2510.08431 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026) Fast Stream Video Generation with Hybrid Attention 19 In this supplementary file, we provide the following materials: –A complete description ...

work page arXiv 2026
[64]

For overly concise user inputs , r e a s o n a b l y infer and add details
[65]

Enhance the main features in user d e s c r i p t i o n s
[66]

Output the entire prompt in English
[67]

Prompts should match the user ’ s intent
[68]

E mp has iz e motion i n f o r m a t i o n
[69]

Your output should have natural motion a t t r i b u t e s
[70]

The revised prompt should be around 80 -100 words long . 20 R. Li et al. Revised prompt examples :
[71]

Japanese - style fresh film photography ,
[72]

Anime thick - coated illustration ,
[73]

CG game concept digital art ,
[74]

I will now provide the prompt for you to rewrite

American TV series poster style , ... I will now provide the prompt for you to rewrite ... C Detailed evaluation results on the short VBench benchmark and VBench-Long Benchmark InFig.6,wepresentthedetailedevaluationresultsonVBenchandVBench-Long for representative few-step autoregressive models across all 16 VBench metrics. Overall, our hybrid forcing meth...