OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Bin Lin; Bin Zhu; Li Yuan; Xianyi He; Xinhua Cheng; Yunyang Ge; Zezhong Zhang

arxiv: 2605.28691 · v1 · pith:C67CPBKXnew · submitted 2026-05-27 · 💻 cs.CV

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Yunyang Ge , Xianyi He , Zezhong Zhang , Bin Lin , Bin Zhu , Xinhua Cheng , Li Yuan This is my paper

Pith reviewed 2026-06-29 13:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationsparse attentionsequence parallelismdiffusion transformersquantizationreinforcement learningtext-to-videoefficiency

0 comments

The pith

OSP-Next pairs fixed-pattern sparse attention with reduced-communication parallelism to produce higher-quality video than the dense baseline at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that diffusion transformers for text-to-video can escape quadratic attention costs by replacing most attention with a fixed sparse pattern while still recovering full quality through targeted fine-tuning. It does so by defining Skiparse-2D Attention that keeps local spatial structure and pairs it with Sparse Sequence Parallelism that moves data only once across ranks. Quantization and reinforcement-learning post-training are then applied to stabilize the efficient pipeline. If the claim holds, 5-second 720P and 768P videos become feasible at 1.5 imes or greater speed on both NVIDIA and Ascend hardware without visible quality regression. Readers would care because current video models remain too slow for broad use; removing the quadratic bottleneck changes what resolutions and lengths are practical.

Core claim

OSP-Next builds a hybrid full-sparse attention architecture whose sparse part is Skiparse-2D Attention, a fixed token-wise and group-wise pattern along spatial dimensions that remains compatible with FlashAttention. From the local equivalence property of this rearrangement it derives Sparse Sequence Parallelism, which partitions subsequences and switches patterns via a single All-to-All step that cuts communication volume by 75 percent relative to Ulysses sequence parallelism. HiF8 quantization permits stable joint 8-bit training, and Mix-GRPO reinforcement learning then lifts the sparse model back above the Wan2.1 baseline to a VBench total of 83.73 percent, delivering measured speedups of

What carries the argument

Skiparse-2D Attention, the fixed-pattern sparse mechanism that applies token-wise and group-wise sparsity along spatial dimensions while preserving FlashAttention compatibility, together with Sparse Sequence Parallelism that partitions subsequences and performs pattern switching through one All-to-All collective.

If this is right

The hybrid architecture reaches a VBench total score of 83.73 percent, exceeding the Wan2.1 baseline.
Single-GPU inference reaches up to 1.64 imes speedup and eight-GPU inference exceeds 1.52 imes speedup on H200 hardware for 5-second 720P and 768P video.
HiF8-quantized OSP-Next-HiF8 incurs only a 0.4 percent VBench drop while achieving 1.69 imes and 2.27 imes speedups on a single Ascend 950PR under the same settings.
Sparse Sequence Parallelism reduces communication volume by 75 percent compared with prior sequence-parallel methods while remaining native to sparse attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-pattern sparsity plus single-All-to-All parallelism could be applied to image or audio diffusion models to lower memory use at comparable quality.
Because the pattern is static, hardware-specific kernels could be written once and reused across many model scales without retraining the attention layout.
The combination of 8-bit quantization and sparse fine-tuning may allow deployment of these models on consumer GPUs that previously could not hold a full dense attention map.
Extending the same locality assumption to video lengths beyond five seconds would test whether the spatial sparsity continues to suffice or whether temporal sparsity must be added.

Load-bearing premise

The fixed sparse pattern in Skiparse-2D keeps enough spatial information that Mix-GRPO fine-tuning can restore quality to or above the dense baseline without introducing new artifacts.

What would settle it

Run the sparse model without the Mix-GRPO step on the same 5-second 720P prompts and measure whether VBench total falls more than 1 percent or whether human raters detect increased artifacts relative to the dense Wan2.1 baseline.

read the original abstract

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OSP-Next is a practical engineering stack for sparse video diffusion with new SSP parallelism and HiF8 quantization, but the abstract supplies no ablations or controls to verify the quality recovery claims.

read the letter

OSP-Next combines Skiparse-2D sparse attention, Sparse Sequence Parallelism that cuts communication volume by 75% versus Ulysses, HiF8 8-bit quantization for joint training, and Mix-GRPO fine-tuning. The reported results are a VBench total of 83.73% above the Wan2.1 baseline, plus 1.64× single-GPU and 1.52× eight-GPU speedups on H200 for 720p/768p clips, with similar gains on Ascend hardware at a 0.4% quality cost.

The concrete new pieces are the SSP All-to-All pattern that stays native to the sparse layout and the HiF8 scheme that keeps training stable under quantization. Those are incremental extensions rather than a new framework, but they address real bottlenecks in sparse attention deployment.

The work shows cross-platform speedups while staying close to the dense baseline on the reported metric. That is useful for anyone who needs lower inference cost on existing hardware.

The soft spot is the complete absence of experimental detail: no ablation tables, no error bars, no dataset descriptions, and no evidence that the fixed sparse pattern plus Mix-GRPO recovers quality without new artifacts or distribution shift. The central assumption that fine-tuning closes the gap therefore cannot be checked from the given text.

This is for groups building production video models who care about measured speed on H200 or Ascend. A reader who needs the numbers to hold up under scrutiny would get value once the controls appear. It deserves peer review so the missing experimental sections can be examined.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OSP-Next, a text-to-video diffusion transformer that combines a hybrid full-sparse attention architecture using Skiparse-2D (fixed-pattern token- and group-wise sparsity along spatial dimensions), Sparse Sequence Parallelism (SSP) that reduces communication volume by 75% via All-to-All, HiF8 8-bit quantization for stable training, and Mix-GRPO reinforcement learning post-training. It claims a VBench total score of 83.73% (surpassing Wan2.1), single-GPU speedups up to 1.64× and 8-GPU speedups over 1.52× on H200 GPUs for 5s 720P/768P settings, and 1.69×/2.27× speedups on Ascend 950PR with only 0.4% VBench drop for the HiF8 variant.

Significance. If the empirical results are reproducible with proper controls, the work would demonstrate a practical engineering integration of sparsity, sequence parallelism, quantization, and RL fine-tuning that delivers measurable efficiency gains while preserving video generation quality across NVIDIA and Ascend hardware. The native compatibility of Skiparse-2D with FlashAttention and the communication reduction in SSP represent concrete implementation advances.

major comments (1)

[Abstract] Abstract and experimental reporting: the central performance claims (VBench 83.73%, speedups of 1.64×/1.52× and 1.69×/2.27×) are presented without any ablation tables, error bars, dataset descriptions, training details, or component-wise breakdowns. This absence prevents verification of whether Skiparse-2D plus Mix-GRPO actually recovers quality close to the dense baseline or whether the reported speedups are load-bearing outcomes of SSP and HiF8.

minor comments (1)

Notation for Skiparse-2D and SSP could be clarified with a small diagram or pseudocode showing the token/group partitioning and the single All-to-All pattern switch.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on experimental reporting. We agree that more detailed supporting evidence is needed to substantiate the central claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and experimental reporting: the central performance claims (VBench 83.73%, speedups of 1.64×/1.52× and 1.69×/2.27×) are presented without any ablation tables, error bars, dataset descriptions, training details, or component-wise breakdowns. This absence prevents verification of whether Skiparse-2D plus Mix-GRPO actually recovers quality close to the dense baseline or whether the reported speedups are load-bearing outcomes of SSP and HiF8.

Authors: We agree that the abstract is a high-level summary and that the manuscript would benefit from explicit component-wise evidence. The full paper reports overall VBench and speedup numbers against Wan2.1 but does not currently contain dedicated ablation tables, error bars, or breakdowns isolating Skiparse-2D, SSP, HiF8, and Mix-GRPO. In the revised version we will add: (1) ablation tables measuring quality and latency when each technique is enabled/disabled, (2) error bars from multiple training runs where feasible, (3) dataset descriptions and training hyper-parameters, and (4) component-wise analysis showing how close the sparse+RL model recovers to the dense baseline and which modules drive the reported speedups. These additions will directly address the verification concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an engineering integration of sparse attention (Skiparse-2D), Sparse Sequence Parallelism, HiF8 quantization, and Mix-GRPO reinforcement learning for video generation efficiency. No load-bearing equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. Claims rest on empirical VBench scores and measured speedups rather than self-referential predictions or uniqueness theorems imported from prior self-work. The central argument is self-contained as a practical system combination without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5885 in / 1184 out tokens · 37774 ms · 2026-06-29T13:02:43.819366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 27 canonical work pages · 15 internal anchors

[1]

Sana-video: Efficient video generation with block linear diffusion transformer, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025
[2]

Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers

Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Wei Cheng, Gang Yu, and Tao Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2957–2965, 2026

2026
[3]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024
[4]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

2022
[5]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[7]

Usp: A unified sequence parallelism approach for long context generative ai

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024

work page arXiv 2024
[8]

Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

work page arXiv 2025
[9]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[10]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[14]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Ringattention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, volume 2024, pages 3992–4008, 2024

2024
[16]

Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

2026
[17]

Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026

2026
[18]

Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024

Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, et al. Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024. 16

work page arXiv 2024
[19]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Recipes for pre-training llms with mxfp8

Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for pre-training llms with mxfp8. arXiv preprint arXiv:2506.08027, 2025

work page arXiv 2025
[22]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[24]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[25]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015
[26]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

2024
[27]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[28]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. arXiv preprint arXiv:2502.01776, 2025

work page arXiv 2025
[31]

Training-free and adaptive sparse attention for efficient long video generation

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15982–15993, 2025

2025
[32]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026

2026
[34]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025
[35]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025
[36]

Gonzalez, Jun Zhu, and Jianfei Chen

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention. arXiv preprint arXiv:2509.24006, 2025. 17

work page arXiv 2025
[37]

Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025
[38]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Pengle Zhang, Jun Zhu, Jianfei Chen, et al. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations, volume 2025, pages 71566–71585, 2025

2025
[39]

Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

work page arXiv 2025
[40]

Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, and Joseph E Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026

work page arXiv 2026
[41]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

work page arXiv 2025
[42]

Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026

2026
[43]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025. 18 Wan2.1OSP-NextOSP-Next-HiF8Wan2.1OSP-NextOSP-Next-HiF8 A low-angle tracking shot glides through knee-d...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Sana-video: Efficient video generation with block linear diffusion transformer, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025

[2] [2]

Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers

Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Wei Cheng, Gang Yu, and Tao Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2957–2965, 2026

2026

[3] [3]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024

[4] [4]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

2022

[5] [5]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[7] [7]

Usp: A unified sequence parallelism approach for long context generative ai

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024

work page arXiv 2024

[8] [8]

Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

work page arXiv 2025

[9] [9]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[10] [10]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025

[14] [14]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Ringattention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, volume 2024, pages 3992–4008, 2024

2024

[16] [16]

Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026

2026

[17] [17]

Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026

2026

[18] [18]

Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024

Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, et al. Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024. 16

work page arXiv 2024

[19] [19]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Recipes for pre-training llms with mxfp8

Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for pre-training llms with mxfp8. arXiv preprint arXiv:2506.08027, 2025

work page arXiv 2025

[22] [22]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[23] [23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[24] [24]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[25] [25]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015

[26] [26]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

2024

[27] [27]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[28] [28]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. arXiv preprint arXiv:2502.01776, 2025

work page arXiv 2025

[31] [31]

Training-free and adaptive sparse attention for efficient long video generation

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15982–15993, 2025

2025

[32] [32]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026

2026

[34] [34]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025

[35] [35]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025

[36] [36]

Gonzalez, Jun Zhu, and Jianfei Chen

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention. arXiv preprint arXiv:2509.24006, 2025. 17

work page arXiv 2025

[37] [37]

Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025

[38] [38]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Pengle Zhang, Jun Zhu, Jianfei Chen, et al. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations, volume 2025, pages 71566–71585, 2025

2025

[39] [39]

Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

work page arXiv 2025

[40] [40]

Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, and Joseph E Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026

work page arXiv 2026

[41] [41]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

work page arXiv 2025

[42] [42]

Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026

2026

[43] [43]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025. 18 Wan2.1OSP-NextOSP-Next-HiF8Wan2.1OSP-NextOSP-Next-HiF8 A low-angle tracking shot glides through knee-d...

work page internal anchor Pith review Pith/arXiv arXiv 2025