OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
Pith reviewed 2026-06-29 13:02 UTC · model grok-4.3
The pith
OSP-Next pairs fixed-pattern sparse attention with reduced-communication parallelism to produce higher-quality video than the dense baseline at lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OSP-Next builds a hybrid full-sparse attention architecture whose sparse part is Skiparse-2D Attention, a fixed token-wise and group-wise pattern along spatial dimensions that remains compatible with FlashAttention. From the local equivalence property of this rearrangement it derives Sparse Sequence Parallelism, which partitions subsequences and switches patterns via a single All-to-All step that cuts communication volume by 75 percent relative to Ulysses sequence parallelism. HiF8 quantization permits stable joint 8-bit training, and Mix-GRPO reinforcement learning then lifts the sparse model back above the Wan2.1 baseline to a VBench total of 83.73 percent, delivering measured speedups of
What carries the argument
Skiparse-2D Attention, the fixed-pattern sparse mechanism that applies token-wise and group-wise sparsity along spatial dimensions while preserving FlashAttention compatibility, together with Sparse Sequence Parallelism that partitions subsequences and performs pattern switching through one All-to-All collective.
If this is right
- The hybrid architecture reaches a VBench total score of 83.73 percent, exceeding the Wan2.1 baseline.
- Single-GPU inference reaches up to 1.64 imes speedup and eight-GPU inference exceeds 1.52 imes speedup on H200 hardware for 5-second 720P and 768P video.
- HiF8-quantized OSP-Next-HiF8 incurs only a 0.4 percent VBench drop while achieving 1.69 imes and 2.27 imes speedups on a single Ascend 950PR under the same settings.
- Sparse Sequence Parallelism reduces communication volume by 75 percent compared with prior sequence-parallel methods while remaining native to sparse attention.
Where Pith is reading between the lines
- The same fixed-pattern sparsity plus single-All-to-All parallelism could be applied to image or audio diffusion models to lower memory use at comparable quality.
- Because the pattern is static, hardware-specific kernels could be written once and reused across many model scales without retraining the attention layout.
- The combination of 8-bit quantization and sparse fine-tuning may allow deployment of these models on consumer GPUs that previously could not hold a full dense attention map.
- Extending the same locality assumption to video lengths beyond five seconds would test whether the spatial sparsity continues to suffice or whether temporal sparsity must be added.
Load-bearing premise
The fixed sparse pattern in Skiparse-2D keeps enough spatial information that Mix-GRPO fine-tuning can restore quality to or above the dense baseline without introducing new artifacts.
What would settle it
Run the sparse model without the Mix-GRPO step on the same 5-second 720P prompts and measure whether VBench total falls more than 1 percent or whether human raters detect increased artifacts relative to the dense Wan2.1 baseline.
read the original abstract
Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OSP-Next, a text-to-video diffusion transformer that combines a hybrid full-sparse attention architecture using Skiparse-2D (fixed-pattern token- and group-wise sparsity along spatial dimensions), Sparse Sequence Parallelism (SSP) that reduces communication volume by 75% via All-to-All, HiF8 8-bit quantization for stable training, and Mix-GRPO reinforcement learning post-training. It claims a VBench total score of 83.73% (surpassing Wan2.1), single-GPU speedups up to 1.64× and 8-GPU speedups over 1.52× on H200 GPUs for 5s 720P/768P settings, and 1.69×/2.27× speedups on Ascend 950PR with only 0.4% VBench drop for the HiF8 variant.
Significance. If the empirical results are reproducible with proper controls, the work would demonstrate a practical engineering integration of sparsity, sequence parallelism, quantization, and RL fine-tuning that delivers measurable efficiency gains while preserving video generation quality across NVIDIA and Ascend hardware. The native compatibility of Skiparse-2D with FlashAttention and the communication reduction in SSP represent concrete implementation advances.
major comments (1)
- [Abstract] Abstract and experimental reporting: the central performance claims (VBench 83.73%, speedups of 1.64×/1.52× and 1.69×/2.27×) are presented without any ablation tables, error bars, dataset descriptions, training details, or component-wise breakdowns. This absence prevents verification of whether Skiparse-2D plus Mix-GRPO actually recovers quality close to the dense baseline or whether the reported speedups are load-bearing outcomes of SSP and HiF8.
minor comments (1)
- Notation for Skiparse-2D and SSP could be clarified with a small diagram or pseudocode showing the token/group partitioning and the single All-to-All pattern switch.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on experimental reporting. We agree that more detailed supporting evidence is needed to substantiate the central claims and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental reporting: the central performance claims (VBench 83.73%, speedups of 1.64×/1.52× and 1.69×/2.27×) are presented without any ablation tables, error bars, dataset descriptions, training details, or component-wise breakdowns. This absence prevents verification of whether Skiparse-2D plus Mix-GRPO actually recovers quality close to the dense baseline or whether the reported speedups are load-bearing outcomes of SSP and HiF8.
Authors: We agree that the abstract is a high-level summary and that the manuscript would benefit from explicit component-wise evidence. The full paper reports overall VBench and speedup numbers against Wan2.1 but does not currently contain dedicated ablation tables, error bars, or breakdowns isolating Skiparse-2D, SSP, HiF8, and Mix-GRPO. In the revised version we will add: (1) ablation tables measuring quality and latency when each technique is enabled/disabled, (2) error bars from multiple training runs where feasible, (3) dataset descriptions and training hyper-parameters, and (4) component-wise analysis showing how close the sparse+RL model recovers to the dense baseline and which modules drive the reported speedups. These additions will directly address the verification concern. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an engineering integration of sparse attention (Skiparse-2D), Sparse Sequence Parallelism, HiF8 quantization, and Mix-GRPO reinforcement learning for video generation efficiency. No load-bearing equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. Claims rest on empirical VBench scores and measured speedups rather than self-referential predictions or uniqueness theorems imported from prior self-work. The central argument is self-contained as a practical system combination without the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sana-video: Efficient video generation with block linear diffusion transformer, 2025
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025
-
[2]
Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers
Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Wei Cheng, Gang Yu, and Tao Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2957–2965, 2026
2026
-
[3]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024
2024
-
[4]
Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
2022
-
[5]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[7]
Usp: A unified sequence parallelism approach for long context generative ai
Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024
-
[8]
Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025
-
[9]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[10]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025
-
[14]
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Ringattention with blockwise transformers for near-infinite context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, volume 2024, pages 3992–4008, 2024
2024
-
[16]
Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Advances in neural information processing systems, 38:40783–40818, 2026
2026
-
[17]
Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.Advances in Neural Information Processing Systems, 38:82155–82192, 2026
2026
-
[18]
Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024
Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, et al. Ascend hifloat8 format for deep learning.arXiv preprint arXiv:2409.16626, 2024. 16
-
[19]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Recipes for pre-training llms with mxfp8
Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for pre-training llms with mxfp8. arXiv preprint arXiv:2506.08027, 2025
-
[22]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[23]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[24]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[25]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
2015
-
[26]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
2024
-
[27]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[28]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity, 2025
Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. arXiv preprint arXiv:2502.01776, 2025
-
[31]
Training-free and adaptive sparse attention for efficient long video generation
Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15982–15993, 2025
2025
-
[32]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026
Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.Advances in Neural Information Processing Systems, 38:96965–96991, 2026
2026
-
[34]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025
2025
-
[35]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
2025
-
[36]
Gonzalez, Jun Zhu, and Jianfei Chen
Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention. arXiv preprint arXiv:2509.24006, 2025. 17
-
[37]
Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025
Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025
-
[38]
Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration
Jintao Zhang, Pengle Zhang, Jun Zhu, Jianfei Chen, et al. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations, volume 2025, pages 71566–71585, 2025
2025
-
[39]
Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025
-
[40]
Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026
Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, and Joseph E Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026
-
[41]
Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025
Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025
-
[42]
Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026
Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Faster video diffusion with trainable sparse attention.Advances in Neural Information Processing Systems, 38: 152509–152534, 2026
2026
-
[43]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025. 18 Wan2.1OSP-NextOSP-Next-HiF8Wan2.1OSP-NextOSP-Next-HiF8 A low-angle tracking shot glides through knee-d...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.