LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

arxiv: 2605.18739 · v2 · pith:NYLMCP44new · submitted 2026-05-18 · 💻 cs.CV · cs.DC

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Yukang Chen , Luozhou Wang , Wei Huang , Shuai Yang , Bohan Zhang , Yicheng Xiao , Ruihang Chu , Weian Mao

show 8 more authors

Qixin Hu Shaoteng Liu Yuyang Zhao Huizi Mao Ying-Cong Chen Enze Xie Xiaojuan Qi Song Han

This is my paper

Pith reviewed 2026-05-20 10:58 UTC · model grok-4.3

classification 💻 cs.CV cs.DC

keywords long video generationautoregressive diffusionNVFP4 quantizationsequence parallelismteacher-forcingvideo diffusion modelsinference accelerationtraining speedup

0 comments p. Extension

pith:NYLMCP44 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{NYLMCP44}

Prints a linked pith:NYLMCP44 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

LongLive-2.0 directly converts diffusion models into long multi-shot autoregressive video generators with NVFP4 and balanced sequence parallelism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongLive-2.0 as a full training and inference infrastructure built around NVFP4 precision to handle the speed and memory demands of long video generation. It introduces Balanced SP, a sequence-parallel autoregressive training scheme that co-designs the teacher-forcing layout by pairing clean-history chunks with noisy-target chunks on each rank, creating a natural mask and SP-aware VAE encoding. This setup allows direct tuning of an existing diffusion model into an interactive autoregressive model for multi-shot videos, skipping the ODE initialization and distribution matching distillation steps common in prior methods. The combination yields up to 2.15 times faster training and 1.84 times faster inference, with the 5B model reaching 45.7 FPS, plus support for real-time generation via few-step denoising and standalone LoRA weights. A sympathetic reader would care because the approach targets the core hardware bottlenecks that currently limit practical long-video generation.

Core claim

LongLive-2.0 is the first NVFP4 training and inference system for long video generation. It directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive diffusion model through sequence-parallel autoregressive training instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training. For inference on Blackwell GPUs it enables W4A4 NVFP4 with quantized KV cache and asynchronous streaming VAE decoding; on other architectures it deploys SP inference while the

What carries the argument

Balanced SP sequence-parallel autoregressive training that co-designs teacher-forcing layout with chunk pairing, paired with NVFP4 precision for memory reduction and GEMM speedup.

If this is right

A high-quality infrastructure and dataset enable a clean training pipeline that avoids ODE initialization and distribution matching distillation.
The model converts to real-time generation with 4 to 2 denoising steps using standalone LoRA weights.
W4A4 NVFP4 inference with quantized KV cache lowers memory use and inter-GPU communication during sequence-parallel execution.
Asynchronous streaming VAE decoding boosts end-to-end throughput on Blackwell GPUs.
SP inference on non-Blackwell architectures matches Blackwell speeds while the quantized cache reduces communication overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The chunk-pairing idea in Balanced SP could extend to other long-sequence generative tasks such as audio or 3D content synthesis.
The reported speedups suggest the infrastructure may support more interactive user-guided video generation in real time.
Testing the layout on videos longer than current benchmarks would reveal whether communication costs stay sub-linear.
Similar co-designs of parallelism and low-precision formats might apply to large language models handling extended contexts.

Load-bearing premise

The assumption that the Balanced SP co-design of teacher-forcing layout with sequence-parallel execution preserves training stability and final model quality without additional regularization or loss terms.

What would settle it

An experiment that trains the same diffusion model with Balanced SP chunk pairing versus a standard non-paired sequence-parallel baseline and measures a clear drop in video quality metrics or training stability on identical data and length.

read the original abstract

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongLive-2.0 brings a practical NVFP4 and Balanced SP infrastructure to long video AR generation with direct tuning, but the lack of quality ablations leaves the stability claims open.

read the letter

The main point to know is that LongLive-2.0 puts together an NVFP4 parallel setup for the whole training and inference of long video autoregressive generation, with a Balanced SP method that handles the teacher-forcing by pairing clean history chunks and noisy target chunks on the same rank. This is new compared to prior work because it skips the ODE initialization and distribution matching distillation steps. Instead it directly tunes the diffusion model for the long multi-shot interactive AR behavior. They layer on sequence parallel inference, quantized KV cache, and asynchronous VAE decoding to get the speed on different GPU types. What they do well is tackle the scaling problems head on. As video length increases, the GEMM parts take more time, and their approach cuts memory and speeds things up, with reported gains of 2.15 times in training and 1.84 times in inference. Hitting 45.7 FPS on the 5B model shows it can run in real time after some LoRA adjustment for fewer steps. The soft spots come down to how much we can trust the quality side. The description says the Balanced SP keeps training stable and final quality good without added regularization, but there's no data here on ablations, error bars, or before-and-after comparisons. Distributing the noise and history across ranks might change gradient behavior or create edge effects at chunk boundaries, and NVFP4's lower precision could make any issues worse. The stress test note flags this as load-bearing, and from the abstract alone it looks like a fair concern that needs checking in the full paper. This paper is aimed at researchers and engineers focused on making autoregressive video models efficient enough for practical long sequences. Readers who care about implementation tricks for parallel training and low-precision inference will get the most out of it. I think it should go to peer review. The engineering is concrete and the speed claims are specific, so referees can dig into whether the quality holds and if the methods generalize.

Referee Report

2 major / 2 minor

Summary. The paper presents LongLive-2.0, an NVFP4-based parallel infrastructure for the full training and inference workflow of long video generation. It introduces sequence-parallel autoregressive (AR) training via Balanced SP, which co-designs a teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling SP-aware chunked VAE encoding. Combined with NVFP4 precision for reduced memory and accelerated GEMM, the system directly tunes a diffusion model into a long multi-shot interactive AR diffusion model (without ODE initialization or DMD), convertible to real-time generation via standalone LoRA weights. For inference, it supports W4A4 NVFP4, quantized KV cache, asynchronous streaming VAE decoding, and SP on non-Blackwell GPUs. Experiments report up to 2.15x training and 1.84x inference speedups, with LongLive-2.0-5B reaching 45.7 FPS while attaining strong benchmark performance; it claims to be the first such NVFP4 system for long video generation.

Significance. If the quality-preservation claims hold, this would represent a meaningful engineering advance for practical long-video generation by addressing memory and compute bottlenecks in long-sequence AR diffusion models. The co-design of Balanced SP with NVFP4 and the direct-tuning pipeline (avoiding distillation) could simplify workflows and enable higher throughput on Blackwell and other GPUs, with potential impact on real-time interactive video systems. Concrete speedups and FPS numbers are reported, though their significance depends on verifiable quality retention.

major comments (2)

Abstract (description of Balanced SP): The claim that pairing clean-history and noisy-target temporal chunks realizes an SP-aware teacher-forcing mask while preserving training stability and final model quality without additional regularization or loss terms is load-bearing for the headline speedups (2.15x training, 1.84x inference) and 45.7 FPS figure being meaningful. Distributing the noise schedule and history across ranks can change per-token gradient statistics and introduce chunk-boundary artifacts; combined with NVFP4's narrowed dynamic range for activations and gradients, this risks shifting the optimization trajectory. No ablation tables, training curves, gradient-variance analysis, or quality comparisons (e.g., vs. non-SP baseline) are referenced to substantiate stability under these changes.
Abstract: The abstract reports concrete speedups and FPS numbers but provides no error bars, ablation tables, or detailed training curves. The central claims rest on engineering results whose reproducibility and quality preservation under NVFP4 and SP cannot be verified from the given text alone, undermining assessment of whether the Balanced SP construction maintains comparable AR distribution quality.

minor comments (2)

Abstract: The phrase 'strong performance on benchmarks' is used without naming the specific benchmarks or reporting quantitative scores; adding these details would improve clarity and allow direct comparison to prior work.
Abstract: Consider clarifying the exact video lengths and model scales at which the 2.15x and 1.84x speedups were measured, as the proportion of GEMM computation is stated to increase with video length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below. Where the comments correctly identify gaps in evidence or presentation, we have revised the manuscript to incorporate additional analysis and results.

read point-by-point responses

Referee: Abstract (description of Balanced SP): The claim that pairing clean-history and noisy-target temporal chunks realizes an SP-aware teacher-forcing mask while preserving training stability and final model quality without additional regularization or loss terms is load-bearing for the headline speedups (2.15x training, 1.84x inference) and 45.7 FPS figure being meaningful. Distributing the noise schedule and history across ranks can change per-token gradient statistics and introduce chunk-boundary artifacts; combined with NVFP4's narrowed dynamic range for activations and gradients, this risks shifting the optimization trajectory. No ablation tables, training curves, gradient-variance analysis, or quality comparisons (e.g., vs. non-SP baseline) are referenced to substantiate stability under these changes.

Authors: We agree that explicit substantiation of stability under the combined Balanced SP and NVFP4 regime strengthens the central claims. In the revised manuscript we have added a new subsection in the experiments (Section 4.2) containing: (i) side-by-side training-loss curves for SP versus non-SP runs on identical hardware and data, (ii) per-token gradient-variance statistics measured at multiple training checkpoints, and (iii) benchmark-quality comparisons (FVD, CLIP score, and human preference) between the final SP-trained model and a non-SP baseline trained to the same number of steps. These results show that chunk-boundary artifacts remain negligible and that the optimization trajectory does not deviate materially from the non-SP case, confirming that no additional regularization is required. revision: yes
Referee: Abstract: The abstract reports concrete speedups and FPS numbers but provides no error bars, ablation tables, or detailed training curves. The central claims rest on engineering results whose reproducibility and quality preservation under NVFP4 and SP cannot be verified from the given text alone, undermining assessment of whether the Balanced SP construction maintains comparable AR distribution quality.

Authors: We accept that the original abstract and experimental section lacked sufficient statistical detail. The revised manuscript now reports all speedup and FPS numbers with error bars computed over five independent runs (different random seeds and data-order shuffles). We have also inserted an expanded ablation table (Table 3) that isolates the contribution of Balanced SP, NVFP4 quantization, and asynchronous VAE decoding, together with the corresponding training curves placed in Appendix C. These additions allow direct verification that quality is preserved while the reported throughput gains are realized. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance metrics are empirical measurements

full rationale

The paper is a systems/engineering contribution describing an NVFP4 parallel infrastructure for long video generation. Reported speedups (up to 2.15x training, 1.84x inference) and 45.7 FPS are measured experimental outcomes on benchmarks, not quantities obtained by fitting parameters inside the same equations or by renaming inputs as predictions. The Balanced SP co-design (pairing clean-history and noisy-target chunks) is presented as an implementation choice enabling teacher-forcing masks and chunked VAE encoding; the claim that it preserves stability without extra regularization is an empirical statement, not a self-definitional derivation. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or description. The derivation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions about GPU hardware behavior under NVFP4 arithmetic and the correctness of the sequence-parallel communication pattern; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (2)

domain assumption NVFP4 arithmetic preserves sufficient numerical stability for diffusion model training and inference on the target video lengths.
Invoked when claiming memory reduction and GEMM acceleration without quality loss.
domain assumption The Balanced SP chunk pairing produces an exact teacher-forcing mask equivalent to non-parallel training.
Stated as enabling natural teacher-forcing with SP-aware encoding.

pith-pipeline@v0.9.0 · 5924 in / 1583 out tokens · 26540 ms · 2026-05-20T10:58:38.911550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 20 internal anchors

[1]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blake- man, Evan Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025
[2]

Introducing nvfp4 for efficient and accurate low-precision inference, 2025

Eduardo Alvarez. Introducing nvfp4 for efficient and accurate low-precision inference, 2025. NVIDIA Technical Blog

work page 2025
[3]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximil- ian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. NeurIPS, 37:100213–100240, 2024

work page 2024
[4]

Quartet: Native fp4 training can be optimal for large language models

Roberto L Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native fp4 training can be optimal for large language models. arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025
[5]

Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[6]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weim- ing Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhi- heng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zheng- cong Fei, Yang Li, and Yahui Zhou. SkyReels-v2: Infinite-length film generative...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Sana-video: Efficient video genera- tion with block linear diffusion transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video genera- tion with block linear diffusion transformer. InICLR, 2026

work page 2026
[8]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026
[9]

Scaling RL to long videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, and Song Han. Scaling RL to long videos. InNeurIPS, 2025

work page 2025
[10]

Longvila: Scaling long- context visual language models for long videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Yihui He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long- context visual language models for long videos. In ICLR, 2025

work page 2025
[11]

Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

work page arXiv 2025
[12]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autore- gressive video generation via selective computa- tion and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute- scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

work page arXiv 2026
[16]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InIn- ternational Conference on Learning Representations, 2025

work page 2025
[17]

Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

work page 2023
[18]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention ker- nels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024. 9 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

work page arXiv 2024
[20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaol- ing Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

work page arXiv 2024
[22]

Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, and Maosong Sun. Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans. Mach. Learn. Res., 2026, 2026

work page 2026
[23]

Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, et al. Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

work page arXiv 2025
[24]

Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, and Xiaojuan Qi. Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

work page 2026
[25]

Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

work page arXiv 2024
[26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, Yao- hui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024

work page 2024
[28]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Ji- ashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chan- paisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

work page 2026
[29]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[31]

Rehg, and Tobias Hinz

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InCVPR, pages 28405– 28415, 2025

work page 2025
[32]

Jay Kuo, and Pe- ter A

Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Pe- ter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

work page arXiv 2026
[33]

Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fang- min Chen, Xing Wang, and Hayden Kwok-Hay So. Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

work page arXiv 2026
[34]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

work page arXiv 2024
[35]

Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

Ruibin Li, Tao Yang, Fangzhou Ai, Tianhe Wu, Shilei Wen, Bingyue Peng, and Lei Zhang. Long- horizon streaming video generation via hybrid at- tention with decoupled distillation.arXiv preprint arXiv:2604.10103, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Sequence parallelism: Long sequence training from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. In ACL, pages 2391–2404, 2023

work page 2023
[37]

Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

work page 2024
[38]

Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

work page arXiv 2025
[39]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device 10 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation llm compression and acceleration.MLSys, 6:87–100, 2024

work page 2024
[40]

Autoregressive adversarial post- training for real-time interactive video generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post- training for real-time interactive video generation. arXiv preprint arXiv:2506.09350, 2025

work page arXiv 2025
[41]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

work page arXiv 2026
[43]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Kai Wang, Xuanlei Zhao, James Demmel, and Yang You. Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

work page arXiv 2024
[45]

Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick V on Platen, ApolinÃ ˛ Ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

work page arXiv 2023
[46]

ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026
[47]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, and Rongrong Ji. Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825, 2026

work page arXiv 2026
[49]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

PackForcing: Short video training suffices for long video sampling and long context inference

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference. arXiv preprint arXiv:2603.25730, 2026

work page arXiv 2026
[51]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Nvidia blackwell architecture technical brief, 2024

NVIDIA. Nvidia blackwell architecture technical brief, 2024. Accessed: 2025-05-13

work page 2024
[53]

Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

NVIDIA. Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

work page 2026
[54]

Open Compute Project, version 1.0 edition, 2023

Open Compute Project.OCP Microscaling Formats (MX) Specification. Open Compute Project, version 1.0 edition, 2023

work page 2023
[55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195– 4205, 2023

work page 2023
[56]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Bita Darvish Rouhani et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

work page arXiv 2023
[57]

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

Jiahao Tian, Chenxi Song, Wei Cheng, and Chi Zhang. Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

work page arXiv 2026
[59]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan. Wan: Open and advanced large- scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto- regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, and Chunchao Guo. Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

work page arXiv 2026
[62]

Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models. InICML, pages 38087–38099. PMLR, 2023

work page 2023
[63]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming lan- guage models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

work page arXiv 2026
[65]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. InICLR, 2026

work page 2026
[66]

MANIQA: multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. MANIQA: multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, pages 1190–1199, 2022

work page 2022
[67]

Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion

Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng- Tao Jiang. Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion. arXiv preprint arXiv:2603.13405, 2026

work page arXiv 2026
[68]

Deep forcing: Training-free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025
[69]

Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast au- toregressive video diffusion models.arXiv preprint arXiv:2412.07772, 2024

work page arXiv 2024
[70]

Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

work page arXiv 2025
[71]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real- time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026
[72]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Va- hab Mirrokni. Turboquant: Online vector quantiza- tion with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. InICML, 2025

work page 2025
[74]

Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, and Jianfei Chen. Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

work page arXiv 2025
[75]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InICLR, 2025

work page 2025
[76]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Generative pre-trained autore- gressive diffusion transformer.arXiv preprint arXiv:2505.07344, 2025

Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiy- ing Lu, Haoyang Huang, Jianlong Yuan, Nan Duan, and Daxin Jiang. Generative pre-trained autore- gressive diffusion transformer.arXiv preprint arXiv:2505.07344, 2025

work page arXiv 2025
[78]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

work page arXiv 2024
[79]

Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zang- wei Zheng, Ziming Liu, Zheming Yang, and Yang You. Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

work page arXiv 2024
[80]

Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video gener- ation.arXiv preprint arXiv:2603.21366, 2026

work page arXiv 2026

Showing first 80 references.

[1] [1]

Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blake- man, Evan Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025

[2] [2]

Introducing nvfp4 for efficient and accurate low-precision inference, 2025

Eduardo Alvarez. Introducing nvfp4 for efficient and accurate low-precision inference, 2025. NVIDIA Technical Blog

work page 2025

[3] [3]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximil- ian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. NeurIPS, 37:100213–100240, 2024

work page 2024

[4] [4]

Quartet: Native fp4 training can be optimal for large language models

Roberto L Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native fp4 training can be optimal for large language models. arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025

[5] [5]

Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full- sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024

[6] [6]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weim- ing Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhi- heng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zheng- cong Fei, Yang Li, and Yahui Zhou. SkyReels-v2: Infinite-length film generative...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Sana-video: Efficient video genera- tion with block linear diffusion transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video genera- tion with block linear diffusion transformer. InICLR, 2026

work page 2026

[8] [8]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026

[9] [9]

Scaling RL to long videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, and Song Han. Scaling RL to long videos. InNeurIPS, 2025

work page 2025

[10] [10]

Longvila: Scaling long- context visual language models for long videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Yihui He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long- context visual language models for long videos. In ICLR, 2025

work page 2025

[11] [11]

Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. Fp4 all the way: Fully quantized training of llms.arXiv preprint arXiv:2505.19115, 2025

work page arXiv 2025

[12] [12]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autore- gressive video generation via selective computa- tion and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute- scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho- Jui Hsieh. LoL: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

work page arXiv 2026

[16] [16]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InIn- ternational Conference on Learning Representations, 2025

work page 2025

[17] [17]

Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.NeurIPS, 36:10088–10115, 2023

work page 2023

[18] [18]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A program- ming model for generating optimized attention ker- nels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context gen- erative ai.arXiv preprint arXiv:2405.07719, 2024. 9 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

work page arXiv 2024

[20] [20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaol- ing Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

work page arXiv 2024

[22] [22]

Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, and Maosong Sun. Acdit: Interpolating autore- gressive conditional modeling and diffusion trans- former.Trans. Mach. Learn. Res., 2026, 2026

work page 2026

[23] [23]

Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, et al. Qerl: Beyond efficiency– quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696, 2025

work page arXiv 2025

[24] [24]

Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, and Xiaojuan Qi. Mc#: Mixture compressor for mixture-of-experts large models.T-PAMI, 2026

work page 2026

[25] [25]

Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture- of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024

work page arXiv 2024

[26] [26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, Yao- hui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024

work page 2024

[28] [28]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Ji- ashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chan- paisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.T-PAMI, 48(3):3268–3285, 2026

work page 2026

[29] [29]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024

[31] [31]

Rehg, and Tobias Hinz

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, and Tobias Hinz. Shotadapter: Text-to-multi-shot video generation with diffusion models. InCVPR, pages 28405– 28415, 2025

work page 2025

[32] [32]

Jay Kuo, and Pe- ter A

Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Pe- ter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

work page arXiv 2026

[33] [33]

Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fang- min Chen, Xing Wang, and Hayden Kwok-Hay So. Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

work page arXiv 2026

[34] [34]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

work page arXiv 2024

[35] [35]

Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

Ruibin Li, Tao Yang, Fangzhou Ai, Tianhe Wu, Shilei Wen, Bingyue Peng, and Lei Zhang. Long- horizon streaming video generation via hybrid at- tention with decoupled distillation.arXiv preprint arXiv:2604.10103, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Sequence parallelism: Long sequence training from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. In ACL, pages 2391–2404, 2023

work page 2023

[37] [37]

Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424– 56445, 2024

work page 2024

[38] [38]

Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

work page arXiv 2025

[39] [39]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device 10 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation llm compression and acceleration.MLSys, 6:87–100, 2024

work page 2024

[40] [40]

Autoregressive adversarial post- training for real-time interactive video generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post- training for real-time interactive video generation. arXiv preprint arXiv:2506.09350, 2025

work page arXiv 2025

[41] [41]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Stream- ing autoregressive video generation via diagonal dis- tillation.arXiv preprint arXiv:2603.09488, 2026

work page arXiv 2026

[43] [43]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Kai Wang, Xuanlei Zhao, James Demmel, and Yang You. Startrail: Concentric ring sequence parallelism for efficient near-infinite- context transformer model training.arXiv preprint arXiv:2407.00611, 2024

work page arXiv 2024

[45] [45]

Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick V on Platen, ApolinÃ ˛ Ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A uni- versal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556, 2023

work page arXiv 2023

[46] [46]

ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. ShotStream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026

[47] [47]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, and Rongrong Ji. Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825, 2026

work page arXiv 2026

[49] [49]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

PackForcing: Short video training suffices for long video sampling and long context inference

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference. arXiv preprint arXiv:2603.25730, 2026

work page arXiv 2026

[51] [51]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

Nvidia blackwell architecture technical brief, 2024

NVIDIA. Nvidia blackwell architecture technical brief, 2024. Accessed: 2025-05-13

work page 2024

[53] [53]

Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

NVIDIA. Speeding up variable-length training with dynamic context parallelism and nvidia megatron core, 2026

work page 2026

[54] [54]

Open Compute Project, version 1.0 edition, 2023

Open Compute Project.OCP Microscaling Formats (MX) Specification. Open Compute Project, version 1.0 edition, 2023

work page 2023

[55] [55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195– 4205, 2023

work page 2023

[56] [56]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Bita Darvish Rouhani et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

work page arXiv 2023

[57] [57]

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

Jiahao Tian, Chenxi Song, Wei Cheng, and Chi Zhang. Free-lunch long video generation via layer-adaptive o.o.d correction.arXiv preprint arXiv:2603.25209, 2026

work page arXiv 2026

[59] [59]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan. Wan: Open and advanced large- scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto- regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, and Chunchao Guo. Pathwise test- time correction for autoregressive long video genera- tion.arXiv preprint arXiv:2602.05871, 2026

work page arXiv 2026

[62] [62]

Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accu- rate and efficient post-training quantization for large 11 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation language models. InICML, pages 38087–38099. PMLR, 2023

work page 2023

[63] [63]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming lan- guage models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Streamfu- sion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

work page arXiv 2026

[65] [65]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. InICLR, 2026

work page 2026

[66] [66]

MANIQA: multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. MANIQA: multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, pages 1190–1199, 2022

work page 2022

[67] [67]

Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion

Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng- Tao Jiang. Anchor forcing: Anchor memory and tri- region rope for interactive streaming video diffusion. arXiv preprint arXiv:2603.13405, 2026

work page arXiv 2026

[68] [68]

Deep forcing: Training-free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025

[69] [69]

Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast au- toregressive video diffusion models.arXiv preprint arXiv:2412.07772, 2024

work page arXiv 2024

[70] [70]

Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

work page arXiv 2025

[71] [71]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real- time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026

[72] [72]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Va- hab Mirrokni. Turboquant: Online vector quantiza- tion with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. InICML, 2025

work page 2025

[74] [74]

Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, and Jianfei Chen. Sageattention3: Microscaling FP4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

work page arXiv 2025

[75] [75]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InICLR, 2025

work page 2025

[76] [76]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Generative pre-trained autore- gressive diffusion transformer.arXiv preprint arXiv:2505.07344, 2025

Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiy- ing Lu, Haoyang Huang, Jianlong Yuan, Nan Duan, and Daxin Jiang. Generative pre-trained autore- gressive diffusion transformer.arXiv preprint arXiv:2505.07344, 2025

work page arXiv 2025

[78] [78]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video genera- tion.arXiv preprint arXiv:2406.02540, 2024

work page arXiv 2024

[79] [79]

Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zang- wei Zheng, Ziming Liu, Zheming Yang, and Yang You. Dsp: Dynamic sequence parallelism for multi-dimensional transformers.arXiv preprint arXiv:2403.10266, 2024

work page arXiv 2024

[80] [80]

Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video gener- ation.arXiv preprint arXiv:2603.21366, 2026

work page arXiv 2026