pith. machine review for the scientific record. sign in

arxiv: 2604.08329 · v1 · submitted 2026-04-09 · 📡 eess.IV · cs.MM

Recognition: unknown

DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3

classification 📡 eess.IV cs.MM
keywords video compressionimplicit neural representationsdiffusion modelslow bitrateperceptual qualityINR conditioninggenerative priors
0
0 comments X

The pith

INR conditioning of pre-trained diffusion models achieves better perceptual video quality than traditional codecs at bitrates below 0.05 bpp.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a hybrid compression method that uses compact implicit neural representations to condition pre-trained video diffusion models, replacing conventional keyframes with bit-efficient neural signals. This integration exploits diffusion models' generative priors learned from large datasets to reconstruct videos at extremely low bitrates where standard codecs produce poor perceptual results. Experiments across UVG, MCL-JCV, and JVET Class-B datasets show clear gains in LPIPS, DISTS, and FID, including BD-LPIPS improvements up to 0.214 and BD-FID up to 91.14 over HEVC while also beating VVC and prior neural or INR-only methods. The approach further reveals that conditioned diffusion first assembles scene layout and object identities before refining textures, indicating a semantic-to-visual processing order that supports faithful low-bitrate reconstruction.

Core claim

INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates. Experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while outperforming VVC and previous state-of-the-art neural and INR-only video codecs.

What carries the argument

INR-based conditioning that replaces traditional intra-coded keyframes with bit-efficient neural representations trained jointly with parameter-efficient adapters to estimate latent features and guide the diffusion process.

If this is right

  • The method delivers measurable gains in LPIPS, DISTS, and FID over HEVC at bitrates below 0.05 bpp.
  • It outperforms VVC and earlier neural or INR-only codecs on the same perceptual metrics.
  • Diffusion reconstruction under INR conditioning follows a semantic-to-visual hierarchy, first placing layout and identities then adding texture.
  • Joint INR and adapter optimization keeps parameter overhead low while encoding video-specific information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support progressive or layered streaming where early bits establish coarse structure and later bits add detail.
  • Similar INR conditioning might be tested on other generative models or modalities such as audio to check for comparable bitrate savings.
  • If the hierarchy holds, future codecs could allocate bits differently across semantic versus textural stages.

Load-bearing premise

Joint optimization of INR weights and parameter-efficient adapters produces reliable, generalizable conditioning signals that transfer across videos without overfitting to the training distribution or requiring per-video retraining at inference time.

What would settle it

Evaluating the method on a held-out video dataset drawn from a different distribution than the training data and measuring whether the reported gains in LPIPS, DISTS, and FID at under 0.05 bpp persist or collapse.

Figures

Figures reproduced from arXiv: 2604.08329 by Christopher Schroers, Eren \c{C}etin, Lucas Relic, Markus Gross, Roberto Azevedo, Yuanyi Xue.

Figure 1
Figure 1. Figure 1: Illustrative comparison of our approach with baselines. Our video compression method maintains pleasing details [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of our proposed framework. An INR-driven conditioning module and a parameter-efficient adapter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Adapter placement with INR conditioning. The INR [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rate-distortion curves on UVG, JVET-B, and MCL-JCV. Our approach (DiV-INR) is compared with traditional codecs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on UVG, JVET-B, and MCL-JCV. Representative crops show that our approach preserves [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics for DiV-INR. Snapshots across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiV-INR, a perceptually-driven video compression framework that combines implicit neural representations (INRs) for compact video encoding with pre-trained video diffusion models. INR-based conditioning replaces traditional keyframes, and joint optimization of INR weights with parameter-efficient adapters generates conditioning signals to guide diffusion-based reconstruction. The central claim is that this enables superior perceptual quality (LPIPS, DISTS, FID) at extremely low bitrates (<0.05 bpp) compared to HEVC, VVC, and prior neural/INR codecs, with reported BD-LPIPS gains up to 0.214 and BD-FID up to 91.14 on UVG, MCL-JCV, and JVET Class-B benchmarks; an additional analysis highlights a semantic-to-visual generation hierarchy.

Significance. If the rate accounting and generalization claims hold, the work would meaningfully advance extreme low-bitrate video coding by showing how diffusion priors can be conditioned via compact INRs to outperform traditional codecs on perceptual metrics where pixel-level fidelity is secondary. The empirical results on standard benchmarks and the hierarchical generation observation provide concrete evidence of practical utility in this regime.

major comments (2)
  1. [§4] §4 (experimental setup and rate-distortion curves): the central claim of operating below 0.05 bpp while reporting BD-LPIPS gains of 0.214 and BD-FID of 91.14 relative to HEVC requires that the bitrate explicitly include the full transmission cost (quantization, entropy coding, and signaling) of per-video INR weights plus adapter deltas. If this overhead is omitted or undercounted, the effective operating point shifts and the comparisons to HEVC/VVC become invalid.
  2. [§3.2] §3.2 (INR conditioning and joint optimization): the claim that the learned conditioning signals transfer across videos without per-video retraining at inference rests on the assumption that the adapters and INR weights generalize reliably; no ablation or cross-video transfer experiment is described that would rule out overfitting to the training distribution, which directly affects the weakest assumption in the evaluation.
minor comments (2)
  1. [Abstract and §4.1] The abstract and §4.1 report aggregate BD metrics but do not tabulate per-sequence bitrates or list the exact HEVC/VVC encoder configurations (preset, GOP structure) used for fair comparison.
  2. [Figure 3] Figure 3 (qualitative results) would benefit from explicit bitrate annotations on each example to allow direct visual verification of the <0.05 bpp regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below with clarifications on our rate accounting and generalization assumptions. We are prepared to revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses
  1. Referee: [§4] §4 (experimental setup and rate-distortion curves): the central claim of operating below 0.05 bpp while reporting BD-LPIPS gains of 0.214 and BD-FID of 91.14 relative to HEVC requires that the bitrate explicitly include the full transmission cost (quantization, entropy coding, and signaling) of per-video INR weights plus adapter deltas. If this overhead is omitted or undercounted, the effective operating point shifts and the comparisons to HEVC/VVC become invalid.

    Authors: The reported bitrates below 0.05 bpp explicitly incorporate the complete transmission costs for both the per-video INR weights and the adapter parameter deltas. INR weights are quantized to 8-bit precision and entropy-coded with a learned prior, while adapter updates are similarly compressed and signaled; all overhead from quantization, entropy coding, and metadata is included in the final bpp figures. This ensures the operating points and BD gains relative to HEVC and VVC are directly comparable. We will add an explicit bitrate-component table and pseudocode for the rate calculation in the revised §4 to eliminate any ambiguity. revision: yes

  2. Referee: [§3.2] §3.2 (INR conditioning and joint optimization): the claim that the learned conditioning signals transfer across videos without per-video retraining at inference rests on the assumption that the adapters and INR weights generalize reliably; no ablation or cross-video transfer experiment is described that would rule out overfitting to the training distribution, which directly affects the weakest assumption in the evaluation.

    Authors: The adapters are trained once on a diverse multi-video dataset and kept parameter-efficient (LoRA-style updates), enabling them to produce reliable conditioning signals for unseen videos without retraining at inference; INR weights are optimized per video but remain compact and video-specific. While the original manuscript did not contain a dedicated cross-video adapter-transfer ablation, the consistent gains across UVG, MCL-JCV, and JVET Class-B benchmarks provide indirect evidence of generalization. We will insert a new ablation subsection in §3.2 that freezes the adapters and evaluates them on held-out videos to directly address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated against external codecs

full rationale

The paper proposes an architecture that jointly optimizes INR weights with parameter-efficient adapters to condition a pre-trained diffusion model for low-bitrate video compression. All load-bearing claims (perceptual gains on UVG/MCL-JCV/JVET) are presented as outcomes of experiments that compare against independent external baselines (HEVC, VVC, prior neural codecs). No derivation chain, uniqueness theorem, or fitted parameter is invoked whose output is definitionally identical to its input; the reported BD-LPIPS and BD-FID deltas are measured quantities, not tautological re-expressions of the training objective. Self-citations, if present, are not load-bearing for the central result.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that pre-trained diffusion priors remain useful when conditioned by compact INRs and that joint optimization converges to effective adapters without introducing new failure modes at inference.

free parameters (2)
  • INR architecture and capacity hyperparameters
    Chosen to achieve target bitrate while providing useful conditioning features; directly affects the bit-efficiency claim.
  • Adapter rank and learning rate schedule
    Parameter-efficient fine-tuning knobs optimized jointly with INR weights; control how much video-specific information is injected.
axioms (1)
  • domain assumption Pre-trained video diffusion models encode sufficiently general generative priors that can be steered by external conditioning signals at inference time.
    Invoked when the paper states that diffusion models offer rich generative priors learned from large-scale datasets.

pith-pipeline@v0.9.0 · 5546 in / 1484 out tokens · 48979 ms · 2026-05-10T17:17:00.452548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 31 canonical work pages · 7 internal anchors

  1. [1]

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. 2023. Stable Video Diffusion: Scaling LPIPS (↓) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.05 0.10 0.15 0.20 0.25 0.30 PSNR (dB) (↑) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 22.0 24.0 2...

  2. [2]

    Yochai Blau and Tomer Michaeli. 2019. Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff. arXiv:1901.07821 [cs] doi:10.48550/arXiv. 1901.07821

  3. [3]

    J., and Ohm, J.-R

    Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm. 2021. Overview of the Versatile Video Coding (VVC) Standard and its Applications.IEEE Transactions on Circuits and Systems for Video Technology31, 10 (2021), 3736–3764. doi:10.1109/TCSVT.2021.3101953

  4. [4]

    Hao Chen, Matt Gwilliam, Ser-Nam Lim, and Abhinav Shrivastava. 2023. HNeRV: A Hybrid Neural Representation for Videos. arXiv:2304.02633 [cs.CV] https: //arxiv.org/abs/2304.02633

  5. [5]

    Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, and Abhinav Shri- vastava. 2021. NeRV: Neural Representations for Videos. arXiv:2110.13903 [cs] doi:10.48550/arXiv.2110.13903

  6. [6]

    Zhenghao Chen, Lucas Relic, Roberto Azevedo, Yang Zhang, Markus Gross, Dong Xu, Luping Zhou, and Christopher Schroers. 2023. Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers. InProceedings of the 31st ACM International Conference on Multimedia(Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 85...

  7. [7]

    CoRR abs/2004.07728(2020),https://arxiv.org/abs/ 2004.07728

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. 2020. Image Quality Assessment: Unifying Structure and Texture Similarity.CoRRabs/2004.07728 (2020). https://arxiv.org/abs/2004.07728

  8. [8]

    Emilien Dupont, Adam Golinski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. 2021. COIN: COmpression with Implicit Neural representations. In Neural Compression: From Information Theory to Applications – Workshop @ ICLR

  9. [9]

    https://openreview.net/forum?id=yekxhcsVi4

  10. [10]

    Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. 2021. Training with Quantization Noise for Extreme Model Compression. arXiv:2004.07320 [cs] doi:10.48550/arXiv.2004. DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning Conference’17, July 2017, Washington, DC, USA 07320

  11. [11]

    FFmpeg Developers. 2025. FFmpeg documentation – a complete, cross-platform solution to record, convert and stream audio and video. https://ffmpeg.org/ documentation.html. Version 7.1 (git commit <abcd123>), accessed 26 Jun 2025

  12. [12]

    Ge Gao, Siyue Teng, Tianhao Peng, Fan Zhang, and David Bull. 2025. GIViC: Generative Implicit Video Compression. arXiv:2503.19604 [eess.IV] https://arxiv. org/abs/2503.19604

  13. [13]

    Carlos Gomes, Roberto Azevedo, and Christopher Schroers. 2023. Video Com- pression with Entropy-Constrained Neural Representations. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, BC, Canada, 18497–18506. doi:10.1109/CVPR52729.2023.01774

  14. [14]

    Jingning Han, Bohan Li, Debargha Mukherjee, Ching-Han Chiang, Adrian Grange, Cheng Chen, Hui Su, Sarah Parker, Sai Deng, Urvang Joshi, Yue Chen, Yunqing Wang, Paul Wilkins, Yaowu Xu, and James Bankoski. 2021. A Technical Overview of AV1. arXiv:2008.06091 [eess.IV] https://arxiv.org/abs/2008.06091

  15. [15]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500 [cs.LG] https://arxiv.org/abs/1706. 08500

  16. [16]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs] doi:10.48550/arXiv.2106.09685

  17. [17]

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. 2025. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. arXiv:2506.08009 [cs] doi:10.48550/arXiv.2506.08009

  18. [18]

    Joint Video Experts Team (JVET). [n. d.]. VVC Test Model (VTM) Reference Software. https://jvet.hhi.fraunhofer.de/. Online; accessed 20 November 2025

  19. [19]

    Soroush Abbasi Koohpayegani, K. L. Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. 2024. NOLA: Compressing LoRA Using Linear Combina- tion of Random Basis. arXiv:2310.02556 [cs] doi:10.48550/arXiv.2310.02556

  20. [20]

    Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, and David Bull. 2024. HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation. arXiv:2306.09818 [eess] doi:10.5555/3666122.3669299

  21. [21]

    Bohan Li, Yiming Liu, Xueyan Niu, Bo Bai, Lei Deng, and Deniz Gündüz

  22. [22]

    arXiv:2402.08934 [eess.IV] https://arxiv.org/abs/2402.08934

    Extreme Video Compression with Pre-trained Diffusion Models. arXiv:2402.08934 [eess.IV] https://arxiv.org/abs/2402.08934

  23. [23]

    Jiahao Li, Bin Li, and Yan Lu. 2021. Deep Contextual Video Compression. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 18114–18125

  24. [24]

    Jiahao Li, Bin Li, and Yan Lu. 2024. Neural Video Compression with Feature Modulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 17-21, 2024

  25. [25]

    Zizhang Li, Mengmeng Wang, Huaijin Pi, Kechun Xu, Jianbiao Mei, and Yong Liu

  26. [26]

    arXiv:2207.08132 [cs.CV] https://arxiv.org/abs/2207.08132

    E-NeRV: Expedite Neural Video Representation with Disentangled Spatial- Temporal Context. arXiv:2207.08132 [cs.CV] https://arxiv.org/abs/2207.08132

  27. [27]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747

  28. [28]

    Alexandre Mercat, Marko Viitanen, and Jarno Vanne. 2020. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. InProceed- ings of the 11th ACM Multimedia Systems Conference(Istanbul, Turkey)(MM- Sys ’20). Association for Computing Machinery, New York, NY, USA, 297–302. doi:10.1145/3339825.3394937

  29. [29]

    Lucas Relic, Roberto Azevedo, Markus Gross, and Christopher Schroers. 2025. Lossy Image Compression with Foundation Diffusion Models. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 303–319

  30. [30]

    Relic, R

    L. Relic, R. Azevedo, Y. Zhang, M. Gross, and C. Schroers. 2025. Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression. InCVPR

  31. [31]

    Lucas Relic, André Emmenegger, Roberto Azevedo, Yang Zhang, Markus Gross, and Christopher Schroers. 2025. Spatiotemporal Diffusion Priors for Extreme Video Compression. In2025 Picture Coding Symposium (PCS). IEEE

  32. [32]

    Jens Eirik Saethre, Roberto Azevedo, and Christopher Schroers. 2024. Combining Frame and GOP Embeddings for Neural Video Representation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 9253–9263. doi:10.1109/CVPR52733.2024.00884

  33. [33]

    Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. 2020. Implicit Neural Representations with Periodic Activation Functions. arXiv:2006.09661 [cs] doi:10.48550/arXiv.2006.09661

  34. [34]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis,

    Vivienne Sze, Madhukar Budagavi, and Gary J. Sullivan. 2014.High Efficiency Video Coding (HEVC): Algorithms and Architectures. Springer. doi:10.1007/978-3- 319-06895-4

  35. [35]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  36. [36]

    Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016. MCL-JCV: a JND-based H. 264/AVC video quality assessment dataset. In2016 IEEE international conference on image processing (ICIP). IEEE, 1509–1513

  37. [37]

    Yichong Xia, Yimin Zhou, Jinpeng Wang, Baoyi An, Haoqian Wang, Yaowei Wang, and Bin Chen. 2024. DiffPC: Diffusion-based High Perceptual Fidelity Image Compression with Semantic Refinement. InThe Thirteenth International Conference on Learning Representations

  38. [38]

    Ruihan Yang and Stephan Mandt. 2023. Lossy Image Compression with Condi- tional Diffusion Models.Advances in Neural Information Processing Systems36 (Dec. 2023), 64971–64995

  39. [39]

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. 2024. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. arXiv:2408.06072 [cs] doi:10.48550/arXiv.2408.06072

  40. [40]

    From slow bidirectional to fast causal video generators

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. 2025. From Slow Bidirectional to Fast Autoregressive Video Diffusion Models. arXiv:2412.07772 [cs] doi:10.48550/arXiv.2412.07772

  41. [41]

    Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, and Jun Zhu. 2026. SageAttention3: Mi- croscaling FP4 Attention for Inference and An Exploration of 8-Bit Training. arXiv:2505.11594 [cs.LG] https://arxiv.org/abs/2505.11594

  42. [42]

    Turbodiffusion: Accelerating video diffusion models by 100-200 times,

    Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, and Jun Zhu. 2025. TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times. arXiv:2512.16093 [cs.CV] https: //arxiv.org/abs/2512.16093

  43. [43]

    Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. 2025. VSA: Faster Video Diffusion with Trainable Sparse Attention. arXiv:2505.13389 [cs.CV] https://arxiv.org/abs/2505. 13389

  44. [44]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  45. [45]

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. InCVPR

  46. [46]

    Salman Asif, and Zhan Ma

    Qi Zhao, M. Salman Asif, and Zhan Ma. 2023. DNeRV: Modeling Inherent Dynam- ics via Difference Neural Representation for Videos. arXiv:2304.06544 [cs.CV] https://arxiv.org/abs/2304.06544

  47. [47]

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. 2023. UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. arXiv:2302.04867 [cs.LG] https://arxiv.org/abs/2302.04867