pith. sign in

arxiv: 2605.30409 · v1 · pith:MJKZ7TEEnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Pith reviewed 2026-06-29 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords streaming video editingdiffusion transformerreal-time inferencetemporal consistencyhybrid architecturemixed precision quantizationflow matching regularization
0
0 comments X

The pith

A hybrid diffusion transformer with cycle-reverse regularization and hardware co-design achieves real-time 1280x704 video editing at 24 FPS on one consumer GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SANA-Streaming as a system-algorithm co-design for streaming video-to-video editing that must meet strict demands for temporal consistency and high inference speed. It combines a hybrid transformer architecture that adds softmax attention selectively to linear blocks, a cycle-reverse training method that uses flow matching to enforce consistency by reversing edits, and targeted optimizations including fused kernels and mixed-precision quantization. The goal is to deliver usable performance for live applications without needing paired long edited video datasets. If the designs hold, they produce measurable gains in both coherence and throughput compared with prior methods. The reported outcome is end-to-end operation at 24 FPS with the core model at 58 FPS on a single RTX 5090.

Core claim

The central claim is that a Hybrid Diffusion Transformer using partial softmax attention, trained via Cycle-Reverse Regularization that predicts source frames from generated content through flow matching, and paired with Blackwell-specific fused GDN kernels plus mixed-precision quantization, produces real-time streaming video editing at 1280 x 704 resolution and 24 end-to-end FPS while improving temporal coherence over existing state-of-the-art approaches.

What carries the argument

The Hybrid Diffusion Transformer architecture, which mixes linear attention blocks with selective softmax attention blocks to strengthen local modeling while retaining linear efficiency.

If this is right

  • Real-time 1280 x 704 editing reaches 24 end-to-end FPS with the diffusion transformer core at 58 FPS on a single RTX 5090 GPU.
  • Temporal consistency improves without access to paired long edited video datasets.
  • Mixed-precision quantization and fused kernels raise throughput while preserving generation quality.
  • The full co-design outperforms prior methods on both coherence and system speed metrics.
  • Interactive applications such as live broadcasting become feasible on consumer hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid attention pattern and cycle training could reduce data needs for other generative video tasks that lack paired long sequences.
  • If the regularization generalizes across input lengths, it may lower the barrier for deploying streaming models on varied consumer GPUs.
  • Hardware-specific quantization choices may transfer to similar diffusion backbones, suggesting a template for co-design in other real-time generation settings.
  • Success here would encourage testing whether selective softmax blocks improve local fidelity in non-editing video diffusion pipelines.

Load-bearing premise

The cycle-reverse regularization produces stable temporal coherence on real streaming inputs without paired long edited videos for training.

What would settle it

Running the system on extended unedited streaming video sequences and measuring whether temporal coherence metrics drop below those of prior methods when the cycle-reverse term is removed.

read the original abstract

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SANA-Streaming, a system-algorithm co-designed framework for real-time streaming video-to-video editing. Core contributions include (1) a hybrid Diffusion Transformer that mixes softmax attention blocks with linear layers for improved local modeling, (2) cycle-reverse regularization that uses flow matching to predict source frames from generated content for semantic consistency without paired long edited videos, and (3) Blackwell-specific fused GDN kernels and mixed-precision quantization (MPQ) to maximize Tensor Core utilization. The paper claims the resulting system achieves 24 end-to-end FPS at 1280×704 resolution on a single RTX 5090, with the DiT core at 58 FPS, and significantly outperforms existing SOTA methods in temporal coherence and throughput.

Significance. If the performance numbers and outperformance claims hold under rigorous evaluation, the work would be significant for enabling interactive real-time V2V editing on consumer GPUs. The hybrid architecture, the cycle-reverse training signal that sidesteps the need for paired long videos, and the hardware-specific MPQ co-design represent concrete advances. The explicit, falsifiable targets (24 FPS end-to-end, 58 FPS DiT core at the stated resolution) and the absence of hidden parameters in the core claims are strengths that make the results directly testable.

major comments (2)
  1. Abstract and Experimental Results section: the central claims of 24 end-to-end FPS, 58 FPS DiT core, and significant outperformance over SOTA in temporal coherence and throughput are stated without any quantitative tables, error bars, baseline details, ablation studies, or metric definitions visible in the provided text. This absence makes the load-bearing performance assertions impossible to verify from the manuscript as presented.
  2. Core design 2 (cycle-reverse regularization): the description states that the strategy enforces semantic consistency by predicting source frames from generated content via flow matching without paired long edited videos, but no equations, loss formulation, or training details are supplied to show how this signal is implemented or stabilized; if the regularization fails to produce stable coherence on real streaming inputs, the claimed real-time advantage would not hold.
minor comments (2)
  1. Abstract: the resolution is given as '1280 x 704' without clarifying whether this is width × height or the exact aspect ratio used in all experiments.
  2. Abstract: 'Blackwell (RTX 5090)' should note that the RTX 5090 is a consumer Blackwell part; any architecture-specific claims should be scoped accordingly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity, verifiability, and completeness.

read point-by-point responses
  1. Referee: Abstract and Experimental Results section: the central claims of 24 end-to-end FPS, 58 FPS DiT core, and significant outperformance over SOTA in temporal coherence and throughput are stated without any quantitative tables, error bars, baseline details, ablation studies, or metric definitions visible in the provided text. This absence makes the load-bearing performance assertions impossible to verify from the manuscript as presented.

    Authors: We agree that the performance claims require explicit supporting data for verification. The complete manuscript contains an Experimental Results section with quantitative tables for FPS measurements and SOTA comparisons, ablation studies, error bars from multiple runs, baseline details, and metric definitions (including temporal coherence). In the revision we will add explicit cross-references from the abstract to these tables, include a compact summary table of key metrics near the introduction, and ensure all numerical claims are directly traceable to visible experimental results. revision: yes

  2. Referee: Core design 2 (cycle-reverse regularization): the description states that the strategy enforces semantic consistency by predicting source frames from generated content via flow matching without paired long edited videos, but no equations, loss formulation, or training details are supplied to show how this signal is implemented or stabilized; if the regularization fails to produce stable coherence on real streaming inputs, the claimed real-time advantage would not hold.

    Authors: We agree that the current description of cycle-reverse regularization is insufficiently detailed. In the revised manuscript we will supply the full mathematical formulation, including the flow-matching loss for predicting source frames from generated content, the overall training objective, implementation specifics, and any stabilization methods used. This will enable assessment of the regularization's stability and its contribution to the reported real-time performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent architectural and empirical choices

full rationale

The paper's core claims rest on three explicit design decisions (hybrid DiT blocks with partial softmax attention, cycle-reverse flow-matching regularization, and Blackwell-specific MPQ + fused kernels) whose performance is reported via direct hardware measurements (24 end-to-end FPS at 1280×704, DiT core at 58 FPS). No equations, fitted parameters, or self-citations are shown that would reduce these metrics or the temporal-coherence improvement to quantities defined by the same inputs. The training signal is described as avoiding paired long videos, and the throughput numbers are presented as profiled results rather than derived predictions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated beyond standard diffusion model assumptions.

pith-pipeline@v0.9.1-grok · 5793 in / 1167 out tokens · 22263 ms · 2026-06-29T07:37:07.574828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 28 canonical work pages · 19 internal anchors

  1. [1]

    Videox-fun: A video generation pipeline for diffusion transformer, 2026

    aigc apps. Videox-fun: A video generation pipeline for diffusion transformer, 2026

  2. [2]

    Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  3. [3]

    Two deterministic half- quadratic regularization algorithms for computed imaging

    Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half- quadratic regularization algorithms for computed imaging. InProceedings of 1st international conference on image processing, volume 2, pages 168–172. IEEE, 1994

  4. [4]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  5. [5]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

  6. [6]

    Sana-video: Efficient video generation with block linear diffusion transformer

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

  7. [7]

    Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

    Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

  8. [8]

    Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

  9. [9]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  10. [10]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  11. [11]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  12. [12]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  13. [13]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  14. [14]

    Jonathan Ho and Tim Salimans

    Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

  15. [15]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  16. [16]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  17. [17]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025. 11 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

  18. [18]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711. Springer, 2016

  19. [19]

    In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

    Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

  20. [20]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  21. [21]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

  22. [22]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

  23. [23]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  24. [24]

    Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

    Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

  25. [25]

    Lucy edit: Open-weight text-guided video editing

    DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

  26. [26]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

  27. [27]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  28. [28]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  29. [29]

    Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

    Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

  30. [30]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  31. [31]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

  32. [32]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

  33. [33]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  34. [34]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  35. [35]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  36. [36]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023. 12 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

  37. [37]

    Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

    Tian Ye, Song Fei, and Lei Zhu. Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

  38. [38]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  39. [39]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  40. [40]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  41. [41]

    Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

  42. [42]

    Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention

    Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv preprint arXiv:2509.24006, 2025

  43. [43]

    SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, and Enze Xie. Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer. arXiv preprint arXiv:2605.15178, 2026

  44. [44]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

  45. [45]

    RoPE-on-numerator-only

    Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 13 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer A. Mixed-Precision Quantization Search setup and met...