arxiv: 2604.08329 · v1 · submitted 2026-04-09 · 📡 eess.IV · cs.MM

Recognition: unknown

DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

Eren \c{C}etin , Lucas Relic , Yuanyi Xue , Markus Gross , Christopher Schroers , Roberto Azevedo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3

classification 📡 eess.IV cs.MM

keywords video compressionimplicit neural representationsdiffusion modelslow bitrateperceptual qualityINR conditioninggenerative priors

0 comments

The pith

INR conditioning of pre-trained diffusion models achieves better perceptual video quality than traditional codecs at bitrates below 0.05 bpp.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a hybrid compression method that uses compact implicit neural representations to condition pre-trained video diffusion models, replacing conventional keyframes with bit-efficient neural signals. This integration exploits diffusion models' generative priors learned from large datasets to reconstruct videos at extremely low bitrates where standard codecs produce poor perceptual results. Experiments across UVG, MCL-JCV, and JVET Class-B datasets show clear gains in LPIPS, DISTS, and FID, including BD-LPIPS improvements up to 0.214 and BD-FID up to 91.14 over HEVC while also beating VVC and prior neural or INR-only methods. The approach further reveals that conditioned diffusion first assembles scene layout and object identities before refining textures, indicating a semantic-to-visual processing order that supports faithful low-bitrate reconstruction.

Core claim

INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates. Experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while outperforming VVC and previous state-of-the-art neural and INR-only video codecs.

What carries the argument

INR-based conditioning that replaces traditional intra-coded keyframes with bit-efficient neural representations trained jointly with parameter-efficient adapters to estimate latent features and guide the diffusion process.

If this is right

The method delivers measurable gains in LPIPS, DISTS, and FID over HEVC at bitrates below 0.05 bpp.
It outperforms VVC and earlier neural or INR-only codecs on the same perceptual metrics.
Diffusion reconstruction under INR conditioning follows a semantic-to-visual hierarchy, first placing layout and identities then adding texture.
Joint INR and adapter optimization keeps parameter overhead low while encoding video-specific information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support progressive or layered streaming where early bits establish coarse structure and later bits add detail.
Similar INR conditioning might be tested on other generative models or modalities such as audio to check for comparable bitrate savings.
If the hierarchy holds, future codecs could allocate bits differently across semantic versus textural stages.

Load-bearing premise

Joint optimization of INR weights and parameter-efficient adapters produces reliable, generalizable conditioning signals that transfer across videos without overfitting to the training distribution or requiring per-video retraining at inference time.

What would settle it

Evaluating the method on a held-out video dataset drawn from a different distribution than the training data and measuring whether the reported gains in LPIPS, DISTS, and FID at under 0.05 bpp persist or collapse.

Figures

Figures reproduced from arXiv: 2604.08329 by Christopher Schroers, Eren \c{C}etin, Lucas Relic, Markus Gross, Roberto Azevedo, Yuanyi Xue.

**Figure 1.** Figure 1: Illustrative comparison of our approach with baselines. Our video compression method maintains pleasing details [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of our proposed framework. An INR-driven conditioning module and a parameter-efficient adapter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Adapter placement with INR conditioning. The INR [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Rate-distortion curves on UVG, JVET-B, and MCL-JCV. Our approach (DiV-INR) is compared with traditional codecs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on UVG, JVET-B, and MCL-JCV. Representative crops show that our approach preserves [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics for DiV-INR. Snapshots across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is conditioning a pre-trained diffusion model with per-video INR outputs via adapters for extreme low-bitrate video compression, but the bitrate must include the full INR transmission cost or the gains don't hold.

read the letter

The core claim is that jointly optimizing INR weights and lightweight adapters lets you replace keyframes with a compact neural representation that guides a diffusion model to produce perceptually better video at under 0.05 bpp. The experiments on UVG, MCL-JCV, and JVET Class-B report clear wins on LPIPS, DISTS, and FID against HEVC, VVC, and earlier neural or INR-only codecs, with the diffusion process apparently building layout and identity before filling in texture details. That hierarchy observation is a useful side result. The combination itself looks like a fresh integration not covered in the cited prior work. The paper does a reasonable job showing the method can outperform strong baselines on perceptual metrics while keeping parameter overhead low through the adapters. The soft spot is exactly the one in the stress-test note. The abstract calls the INR overhead minimal, but if the reported bpp numbers do not fully count quantization, entropy coding, and transmission of the per-video INR weights plus any adapter updates, then the effective rate is higher and the BD-LPIPS and BD-FID deltas shrink or vanish. Without seeing the precise rate breakdown in the methods section, it is difficult to trust the operating point. Minor additional gaps are the lack of statistical significance tests and exact baseline re-implementation details, though those are common at this stage. This is for people working on neural video codecs or generative compression who care about the sub-0.05 bpp regime. A reader who wants to test new conditioning tricks would find the setup useful. It deserves peer review because the idea is concrete, the benchmarks are standard, and the central rate-accounting question is straightforward to check in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiV-INR, a perceptually-driven video compression framework that combines implicit neural representations (INRs) for compact video encoding with pre-trained video diffusion models. INR-based conditioning replaces traditional keyframes, and joint optimization of INR weights with parameter-efficient adapters generates conditioning signals to guide diffusion-based reconstruction. The central claim is that this enables superior perceptual quality (LPIPS, DISTS, FID) at extremely low bitrates (<0.05 bpp) compared to HEVC, VVC, and prior neural/INR codecs, with reported BD-LPIPS gains up to 0.214 and BD-FID up to 91.14 on UVG, MCL-JCV, and JVET Class-B benchmarks; an additional analysis highlights a semantic-to-visual generation hierarchy.

Significance. If the rate accounting and generalization claims hold, the work would meaningfully advance extreme low-bitrate video coding by showing how diffusion priors can be conditioned via compact INRs to outperform traditional codecs on perceptual metrics where pixel-level fidelity is secondary. The empirical results on standard benchmarks and the hierarchical generation observation provide concrete evidence of practical utility in this regime.

major comments (2)

[§4] §4 (experimental setup and rate-distortion curves): the central claim of operating below 0.05 bpp while reporting BD-LPIPS gains of 0.214 and BD-FID of 91.14 relative to HEVC requires that the bitrate explicitly include the full transmission cost (quantization, entropy coding, and signaling) of per-video INR weights plus adapter deltas. If this overhead is omitted or undercounted, the effective operating point shifts and the comparisons to HEVC/VVC become invalid.
[§3.2] §3.2 (INR conditioning and joint optimization): the claim that the learned conditioning signals transfer across videos without per-video retraining at inference rests on the assumption that the adapters and INR weights generalize reliably; no ablation or cross-video transfer experiment is described that would rule out overfitting to the training distribution, which directly affects the weakest assumption in the evaluation.

minor comments (2)

[Abstract and §4.1] The abstract and §4.1 report aggregate BD metrics but do not tabulate per-sequence bitrates or list the exact HEVC/VVC encoder configurations (preset, GOP structure) used for fair comparison.
[Figure 3] Figure 3 (qualitative results) would benefit from explicit bitrate annotations on each example to allow direct visual verification of the <0.05 bpp regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below with clarifications on our rate accounting and generalization assumptions. We are prepared to revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses

Referee: [§4] §4 (experimental setup and rate-distortion curves): the central claim of operating below 0.05 bpp while reporting BD-LPIPS gains of 0.214 and BD-FID of 91.14 relative to HEVC requires that the bitrate explicitly include the full transmission cost (quantization, entropy coding, and signaling) of per-video INR weights plus adapter deltas. If this overhead is omitted or undercounted, the effective operating point shifts and the comparisons to HEVC/VVC become invalid.

Authors: The reported bitrates below 0.05 bpp explicitly incorporate the complete transmission costs for both the per-video INR weights and the adapter parameter deltas. INR weights are quantized to 8-bit precision and entropy-coded with a learned prior, while adapter updates are similarly compressed and signaled; all overhead from quantization, entropy coding, and metadata is included in the final bpp figures. This ensures the operating points and BD gains relative to HEVC and VVC are directly comparable. We will add an explicit bitrate-component table and pseudocode for the rate calculation in the revised §4 to eliminate any ambiguity. revision: yes
Referee: [§3.2] §3.2 (INR conditioning and joint optimization): the claim that the learned conditioning signals transfer across videos without per-video retraining at inference rests on the assumption that the adapters and INR weights generalize reliably; no ablation or cross-video transfer experiment is described that would rule out overfitting to the training distribution, which directly affects the weakest assumption in the evaluation.

Authors: The adapters are trained once on a diverse multi-video dataset and kept parameter-efficient (LoRA-style updates), enabling them to produce reliable conditioning signals for unseen videos without retraining at inference; INR weights are optimized per video but remain compact and video-specific. While the original manuscript did not contain a dedicated cross-video adapter-transfer ablation, the consistent gains across UVG, MCL-JCV, and JVET Class-B benchmarks provide indirect evidence of generalization. We will insert a new ablation subsection in §3.2 that freezes the adapters and evaluates them on held-out videos to directly address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated against external codecs

full rationale

The paper proposes an architecture that jointly optimizes INR weights with parameter-efficient adapters to condition a pre-trained diffusion model for low-bitrate video compression. All load-bearing claims (perceptual gains on UVG/MCL-JCV/JVET) are presented as outcomes of experiments that compare against independent external baselines (HEVC, VVC, prior neural codecs). No derivation chain, uniqueness theorem, or fitted parameter is invoked whose output is definitionally identical to its input; the reported BD-LPIPS and BD-FID deltas are measured quantities, not tautological re-expressions of the training objective. Self-citations, if present, are not load-bearing for the central result.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that pre-trained diffusion priors remain useful when conditioned by compact INRs and that joint optimization converges to effective adapters without introducing new failure modes at inference.

free parameters (2)

INR architecture and capacity hyperparameters
Chosen to achieve target bitrate while providing useful conditioning features; directly affects the bit-efficiency claim.
Adapter rank and learning rate schedule
Parameter-efficient fine-tuning knobs optimized jointly with INR weights; control how much video-specific information is injected.

axioms (1)

domain assumption Pre-trained video diffusion models encode sufficiently general generative priors that can be steered by external conditioning signals at inference time.
Invoked when the paper states that diffusion models offer rich generative priors learned from large-scale datasets.

pith-pipeline@v0.9.0 · 5546 in / 1484 out tokens · 48979 ms · 2026-05-10T17:17:00.452548+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 31 canonical work pages · 7 internal anchors

[1]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. 2023. Stable Video Diffusion: Scaling LPIPS (↓) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.05 0.10 0.15 0.20 0.25 0.30 PSNR (dB) (↑) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 22.0 24.0 2...

work page internal anchor Pith review arXiv 2023
[2]

Yochai Blau and Tomer Michaeli. 2019. Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff. arXiv:1901.07821 [cs] doi:10.48550/arXiv. 1901.07821

work page internal anchor Pith review doi:10.48550/arxiv 2019
[3]

J., and Ohm, J.-R

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm. 2021. Overview of the Versatile Video Coding (VVC) Standard and its Applications.IEEE Transactions on Circuits and Systems for Video Technology31, 10 (2021), 3736–3764. doi:10.1109/TCSVT.2021.3101953

work page doi:10.1109/tcsvt.2021.3101953 2021
[4]

Hao Chen, Matt Gwilliam, Ser-Nam Lim, and Abhinav Shrivastava. 2023. HNeRV: A Hybrid Neural Representation for Videos. arXiv:2304.02633 [cs.CV] https: //arxiv.org/abs/2304.02633

work page arXiv 2023
[5]

Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, and Abhinav Shri- vastava. 2021. NeRV: Neural Representations for Videos. arXiv:2110.13903 [cs] doi:10.48550/arXiv.2110.13903

work page doi:10.48550/arxiv.2110.13903 2021
[6]

Zhenghao Chen, Lucas Relic, Roberto Azevedo, Yang Zhang, Markus Gross, Dong Xu, Luping Zhou, and Christopher Schroers. 2023. Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers. InProceedings of the 31st ACM International Conference on Multimedia(Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 85...

work page arXiv 2023
[7]

CoRR abs/2004.07728(2020),https://arxiv.org/abs/ 2004.07728

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. 2020. Image Quality Assessment: Unifying Structure and Texture Similarity.CoRRabs/2004.07728 (2020). https://arxiv.org/abs/2004.07728

work page arXiv 2020
[8]

Emilien Dupont, Adam Golinski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. 2021. COIN: COmpression with Implicit Neural representations. In Neural Compression: From Information Theory to Applications – Workshop @ ICLR

2021
[9]

https://openreview.net/forum?id=yekxhcsVi4
[10]

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. 2021. Training with Quantization Noise for Extreme Model Compression. arXiv:2004.07320 [cs] doi:10.48550/arXiv.2004. DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning Conference’17, July 2017, Washington, DC, USA 07320

work page doi:10.48550/arxiv.2004 2021
[11]

FFmpeg Developers. 2025. FFmpeg documentation – a complete, cross-platform solution to record, convert and stream audio and video. https://ffmpeg.org/ documentation.html. Version 7.1 (git commit <abcd123>), accessed 26 Jun 2025

2025
[12]

Ge Gao, Siyue Teng, Tianhao Peng, Fan Zhang, and David Bull. 2025. GIViC: Generative Implicit Video Compression. arXiv:2503.19604 [eess.IV] https://arxiv. org/abs/2503.19604

work page arXiv 2025
[13]

Carlos Gomes, Roberto Azevedo, and Christopher Schroers. 2023. Video Com- pression with Entropy-Constrained Neural Representations. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, BC, Canada, 18497–18506. doi:10.1109/CVPR52729.2023.01774

work page doi:10.1109/cvpr52729.2023.01774 2023
[14]

Jingning Han, Bohan Li, Debargha Mukherjee, Ching-Han Chiang, Adrian Grange, Cheng Chen, Hui Su, Sarah Parker, Sai Deng, Urvang Joshi, Yue Chen, Yunqing Wang, Paul Wilkins, Yaowu Xu, and James Bankoski. 2021. A Technical Overview of AV1. arXiv:2008.06091 [eess.IV] https://arxiv.org/abs/2008.06091

work page arXiv 2021
[15]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500 [cs.LG] https://arxiv.org/abs/1706. 08500

work page Pith review arXiv 2018
[16]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs] doi:10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[17]

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. 2025. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. arXiv:2506.08009 [cs] doi:10.48550/arXiv.2506.08009

work page internal anchor Pith review doi:10.48550/arxiv.2506.08009 2025
[18]

Joint Video Experts Team (JVET). [n. d.]. VVC Test Model (VTM) Reference Software. https://jvet.hhi.fraunhofer.de/. Online; accessed 20 November 2025

2025
[19]

Soroush Abbasi Koohpayegani, K. L. Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. 2024. NOLA: Compressing LoRA Using Linear Combina- tion of Random Basis. arXiv:2310.02556 [cs] doi:10.48550/arXiv.2310.02556

work page doi:10.48550/arxiv.2310.02556 2024
[20]

Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, and David Bull. 2024. HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation. arXiv:2306.09818 [eess] doi:10.5555/3666122.3669299

work page doi:10.5555/3666122.3669299 2024
[21]

Bohan Li, Yiming Liu, Xueyan Niu, Bo Bai, Lei Deng, and Deniz Gündüz
[22]

arXiv:2402.08934 [eess.IV] https://arxiv.org/abs/2402.08934

Extreme Video Compression with Pre-trained Diffusion Models. arXiv:2402.08934 [eess.IV] https://arxiv.org/abs/2402.08934

work page arXiv
[23]

Jiahao Li, Bin Li, and Yan Lu. 2021. Deep Contextual Video Compression. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 18114–18125

2021
[24]

Jiahao Li, Bin Li, and Yan Lu. 2024. Neural Video Compression with Feature Modulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 17-21, 2024

2024
[25]

Zizhang Li, Mengmeng Wang, Huaijin Pi, Kechun Xu, Jianbiao Mei, and Yong Liu
[26]

arXiv:2207.08132 [cs.CV] https://arxiv.org/abs/2207.08132

E-NeRV: Expedite Neural Video Representation with Disentangled Spatial- Temporal Context. arXiv:2207.08132 [cs.CV] https://arxiv.org/abs/2207.08132

work page arXiv
[27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. 2020. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. InProceed- ings of the 11th ACM Multimedia Systems Conference(Istanbul, Turkey)(MM- Sys ’20). Association for Computing Machinery, New York, NY, USA, 297–302. doi:10.1145/3339825.3394937

work page doi:10.1145/3339825.3394937 2020
[29]

Lucas Relic, Roberto Azevedo, Markus Gross, and Christopher Schroers. 2025. Lossy Image Compression with Foundation Diffusion Models. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 303–319

2025
[30]

Relic, R

L. Relic, R. Azevedo, Y. Zhang, M. Gross, and C. Schroers. 2025. Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression. InCVPR

2025
[31]

Lucas Relic, André Emmenegger, Roberto Azevedo, Yang Zhang, Markus Gross, and Christopher Schroers. 2025. Spatiotemporal Diffusion Priors for Extreme Video Compression. In2025 Picture Coding Symposium (PCS). IEEE

2025
[32]

Jens Eirik Saethre, Roberto Azevedo, and Christopher Schroers. 2024. Combining Frame and GOP Embeddings for Neural Video Representation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 9253–9263. doi:10.1109/CVPR52733.2024.00884

work page doi:10.1109/cvpr52733.2024.00884 2024
[33]

Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. 2020. Implicit Neural Representations with Periodic Activation Functions. arXiv:2006.09661 [cs] doi:10.48550/arXiv.2006.09661

work page doi:10.48550/arxiv.2006.09661 2020
[34]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis,

Vivienne Sze, Madhukar Budagavi, and Gary J. Sullivan. 2014.High Efficiency Video Coding (HEVC): Algorithms and Architectures. Springer. doi:10.1007/978-3- 319-06895-4

work page doi:10.1007/978-3- 2014
[35]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016. MCL-JCV: a JND-based H. 264/AVC video quality assessment dataset. In2016 IEEE international conference on image processing (ICIP). IEEE, 1509–1513

2016
[37]

Yichong Xia, Yimin Zhou, Jinpeng Wang, Baoyi An, Haoqian Wang, Yaowei Wang, and Bin Chen. 2024. DiffPC: Diffusion-based High Perceptual Fidelity Image Compression with Semantic Refinement. InThe Thirteenth International Conference on Learning Representations

2024
[38]

Ruihan Yang and Stephan Mandt. 2023. Lossy Image Compression with Condi- tional Diffusion Models.Advances in Neural Information Processing Systems36 (Dec. 2023), 64971–64995

2023
[39]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. 2024. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. arXiv:2408.06072 [cs] doi:10.48550/arXiv.2408.06072

work page internal anchor Pith review doi:10.48550/arxiv.2408.06072 2024
[40]

From slow bidirectional to fast causal video generators

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. 2025. From Slow Bidirectional to Fast Autoregressive Video Diffusion Models. arXiv:2412.07772 [cs] doi:10.48550/arXiv.2412.07772

work page doi:10.48550/arxiv.2412.07772 2025
[41]

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, and Jun Zhu. 2026. SageAttention3: Mi- croscaling FP4 Attention for Inference and An Exploration of 8-Bit Training. arXiv:2505.11594 [cs.LG] https://arxiv.org/abs/2505.11594

work page arXiv 2026
[42]

Turbodiffusion: Accelerating video diffusion models by 100-200 times,

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, and Jun Zhu. 2025. TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times. arXiv:2512.16093 [cs.CV] https: //arxiv.org/abs/2512.16093

work page arXiv 2025
[43]

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. 2025. VSA: Faster Video Diffusion with Trainable Sparse Attention. arXiv:2505.13389 [cs.CV] https://arxiv.org/abs/2505. 13389

work page arXiv 2025
[44]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
[45]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. InCVPR
[46]

Salman Asif, and Zhan Ma

Qi Zhao, M. Salman Asif, and Zhan Ma. 2023. DNeRV: Modeling Inherent Dynam- ics via Difference Neural Representation for Videos. arXiv:2304.06544 [cs.CV] https://arxiv.org/abs/2304.06544

work page arXiv 2023
[47]

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. 2023. UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. arXiv:2302.04867 [cs.LG] https://arxiv.org/abs/2302.04867

work page arXiv 2023