ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors

Lin Liu; Qi Tian; Xiaohan Pan; Xin Li; Yixin Gao; Zhibo Chen

arxiv: 2606.22371 · v2 · pith:Q72VSXMQnew · submitted 2026-06-21 · 📡 eess.IV · cs.CV

ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors

Yixin Gao , Xiaohan Pan , Lin Liu , Xin Li , Zhibo Chen , Qi Tian This is my paper

Pith reviewed 2026-06-26 10:05 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords generative video compressionzero-shot compressiondiffusion priorsautoregressive modelslow bitrate videoperceptual reconstructioncodebook guidance

0 comments

The pith

ZeroGVC reconstructs video at ultra-low bitrates by steering pretrained diffusion trajectories with compact codebook noise vectors and no extra training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZeroGVC as a compression system that encodes only the first frame of each group with a standard image codec and then represents later frames by selecting small sets of reproducible noise vectors from a diffusion codebook. This selection guides the decoder's few-step denoising process to match the target frame exactly because the underlying codebook scheme supports consistent reproduction of trajectories. The method includes an optional bidirectional reference that uses the next I-frame to limit error buildup at no added cost. If the approach holds, it removes the need to train generative models specifically for compression tasks while still delivering perceptually strong results on standard benchmarks. The design rests on the observation that codebook-based control works reliably in limited denoising steps.

Core claim

ZeroGVC achieves superior perceptual reconstruction quality at ultra-low bitrates without any additional training by encoding the first frame of each GOP with an image codec and representing subsequent P-frames through Codebook-Guided Autoregressive Latent Compression, which selects compact combinations of reproducible codebook noise vectors to steer the latent denoising trajectory toward the target P-frame while allowing the decoder to reproduce the same trajectory in only a few denoising steps.

What carries the argument

Codebook-Guided Autoregressive Latent Compression, the mechanism that selects compact reproducible codebook noise vectors to direct the denoising trajectory in few steps.

If this is right

Video sequences can be sent at ultra-low bitrates while preserving perceptual quality using only existing pretrained diffusion models.
Compression pipelines no longer require separate training stages to adapt generative priors for reconstruction tasks.
Error accumulation across frames can be reduced by bidirectional reference to the next I-frame at zero extra bitrate cost.
Low-delay decoding remains feasible because the decoder reproduces the steered trajectory in a small number of steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same codebook selection idea might let other diffusion-based generation tasks avoid fine-tuning when the target output can be reached by trajectory steering.
Updating the underlying diffusion model would automatically improve the compressor without retraining any codec components.
Real-time applications that cannot afford training runs could adopt this style of compression more readily than methods that demand per-task adaptation.

Load-bearing premise

Selecting a small set of codebook noise vectors can steer the diffusion denoising process precisely enough to match a target video frame in only a few steps.

What would settle it

Side-by-side perceptual tests at the reported low bitrates where ZeroGVC reconstructions score lower than trained generative compressors on standard video quality metrics.

Figures

Figures reproduced from arXiv: 2606.22371 by Lin Liu, Qi Tian, Xiaohan Pan, Xin Li, Yixin Gao, Zhibo Chen.

**Figure 1.** Figure 1: Overview of the proposed ZeroGVC framework. At the encoder, the first frame of each GOP is compressed by an image [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Quantitative comparisons on HEVC Class C, HEVC Class B, UVG, and MCL-JCV. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation studies on hyperparameters. We report LPIPS and DISTS rate-distortion curves under different (a) sampling steps [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of the bidirectional reference mode under different numbers of future-reference latent frames. The dashed curves [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of bidirectional reference mode. The future reference helps recover local structures and textures that are [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Perceptual comparisons on HEVC Class D and Class E. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: MS-SSIM comparisons on all evaluated datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative comparisons with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: More qualitative comparisons of the bidirectional reference mode. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Recent generative video compression methods leverage powerful generative priors to achieve perceptually pleasing reconstructions. However, most existing approaches require additional training to adapt generative models to produce realistic reconstructions from compact representations. In this paper, we propose ZeroGVC, a zero-shot generative video compression framework that leverages pretrained autoregressive diffusion priors for low-delay video reconstruction. ZeroGVC encodes the first frame of each group of pictures (GOP) with an image codec and represents subsequent P-frames through Codebook-Guided Autoregressive Latent Compression. This design is motivated by our observation that the compression scheme of denoising diffusion codebook models is effective in few-step consistency sampling. By selecting compact combinations of reproducible codebook noise vectors, ZeroGVC steers the latent denoising trajectory toward the target P-frame while allowing the decoder to reproduce the same trajectory in only a few denoising steps. In addition, we design an optional bidirectional reference mode that mitigates error propagation by leveraging the next I-frame context without introducing any additional bitrate overhead. Extensive experiments on standard video compression benchmarks demonstrate that ZeroGVC achieves superior perceptual reconstruction quality at ultra-low bitrates without any additional training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZeroGVC sketches a zero-shot diffusion approach for P-frame compression but the abstract gives no procedure for codebook vector selection and no numbers at all.

read the letter

The main takeaway is that this paper tries to compress video P-frames at very low bitrates by picking compact sets of codebook noise vectors from a pretrained autoregressive diffusion model to steer the denoising trajectory, with the decoder reproducing the same path in a few steps. The claim is that this works zero-shot, without any extra training.

What stands out as new is the codebook-guided autoregressive latent compression step for P-frames, framed as coming from an observation about how denoising diffusion codebook models behave under few-step consistency sampling. The optional bidirectional reference mode is also a clean addition: it pulls in the next I-frame to limit error propagation while adding zero bitrate cost. These choices show some attention to low-delay constraints and to reusing existing models rather than retraining them.

The paper does a reasonable job laying out the motivation and the high-level architecture. It avoids the training overhead common in other generative compression papers, which is a practical angle if the method holds.

The soft spots are clear and central. The abstract states superior perceptual quality on standard benchmarks but shows no metrics, no baselines, and no ablations, so the performance claim cannot be checked. More importantly, the stress-test concern lands: there is no description of how the encoder actually selects the compact, reproducible codebook vectors for a given target P-frame. No distance metric, no search method, and no guarantee that the choice stays zero-shot and bitrate-efficient. If that step requires per-frame optimization or auxiliary networks, the zero-shot and ultra-low-bitrate advantages are at risk. This is the part the whole rate-distortion story rests on, and it is not addressed.

The work is aimed at people already working on diffusion priors for coding. A reader looking for new ways to apply pretrained models without fine-tuning might find the framework useful as an idea, but would need the full methods and results to judge whether it works. The thinking is coherent on its own terms with no obvious internal contradictions.

I would bring this to a reading group to talk through the selection mechanism. It deserves peer review so referees can see the experiments and the exact encoding procedure.

Referee Report

2 major / 2 minor

Summary. The paper proposes ZeroGVC, a zero-shot generative video compression framework that encodes I-frames of each GOP with a standard image codec and represents P-frames via Codebook-Guided Autoregressive Latent Compression. It selects compact, reproducible combinations of codebook noise vectors from pretrained denoising diffusion codebook models to steer the latent trajectory toward target P-frames, enabling decoder-side reproduction in few consistency sampling steps. An optional bidirectional reference mode is introduced to reduce error propagation without extra bitrate. The central claim is that this achieves superior perceptual reconstruction quality at ultra-low bitrates on standard benchmarks without any additional training or fine-tuning.

Significance. If the zero-shot selection procedure and few-step reproducibility claims hold with supporting experiments, the work would be significant for demonstrating how existing autoregressive diffusion priors can be directly repurposed for low-delay, training-free video compression, addressing a key limitation of prior generative codecs that require adaptation. The bidirectional mode and emphasis on perceptual quality at ultra-low rates could influence future codec design.

major comments (2)

[Abstract] Abstract: the claim of 'superior perceptual reconstruction quality' and 'extensive experiments' is unsupported by any quantitative metrics, baselines, rate-distortion curves, or ablation results in the provided description, so the central performance claim rests on unverified statements.
[Method (Codebook-Guided Autoregressive Latent Compression)] Codebook-Guided Autoregressive Latent Compression section: no encoding algorithm, distance metric, search procedure, or bitrate accounting is supplied for selecting the compact combinations of codebook noise vectors in a zero-shot manner; if selection requires optimization or auxiliary components, the zero-shot and ultra-low-bitrate guarantees are compromised, which is load-bearing for the entire rate-distortion advantage.

minor comments (2)

Clarify the exact number of consistency steps used at the decoder and how reproducibility of the selected noise vectors is guaranteed across encoder and decoder.
Add a diagram or pseudocode for the vector selection process to make the zero-shot claim concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'superior perceptual reconstruction quality' and 'extensive experiments' is unsupported by any quantitative metrics, baselines, rate-distortion curves, or ablation results in the provided description, so the central performance claim rests on unverified statements.

Authors: The abstract is a concise summary of the work and its claims, as is standard. The full manuscript includes a dedicated Experiments section that reports quantitative perceptual metrics (e.g., LPIPS, FID), comparisons against baselines, rate-distortion curves on standard benchmarks (UVG, MCL-JCV), and ablation studies. These results support the stated claims of superior perceptual quality at ultra-low bitrates. To improve clarity, we will revise the abstract to briefly reference the experimental validation and key findings. revision: partial
Referee: [Method (Codebook-Guided Autoregressive Latent Compression)] Codebook-Guided Autoregressive Latent Compression section: no encoding algorithm, distance metric, search procedure, or bitrate accounting is supplied for selecting the compact combinations of codebook noise vectors in a zero-shot manner; if selection requires optimization or auxiliary components, the zero-shot and ultra-low-bitrate guarantees are compromised, which is load-bearing for the entire rate-distortion advantage.

Authors: We agree that additional implementation details are needed in this section to fully substantiate the zero-shot procedure. The manuscript will be revised to explicitly describe: (1) the encoding algorithm for selecting compact, reproducible codebook noise vector combinations; (2) the distance metric (latent-space L2 distance to the target P-frame latent); (3) the search procedure (deterministic nearest-neighbor lookup over the pretrained codebook without iterative optimization); and (4) bitrate accounting (fixed-length indices into the codebook, incurring no extra overhead beyond the I-frame). This selection uses only the pretrained autoregressive diffusion model in a feed-forward manner, preserving the zero-shot and training-free guarantees. revision: yes

Circularity Check

0 steps flagged

No circularity; zero-shot claim relies on external pretrained priors without self-referential reduction.

full rationale

The paper presents ZeroGVC as a zero-shot framework that encodes I-frames with a standard image codec and steers P-frames via selection of codebook noise vectors from pretrained autoregressive diffusion models. The motivation is an observation about few-step consistency sampling in denoising diffusion codebook models, but no equations, fitted parameters, or self-citations are shown that would make the claimed perceptual quality or bitrate performance equivalent to the inputs by construction. The derivation remains self-contained against external benchmarks, with the selection mechanism described as leveraging existing model properties rather than redefining or fitting to the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested assumption that pretrained diffusion codebook models already support effective few-step steering via noise vector selection; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Pretrained autoregressive diffusion models support effective few-step consistency sampling when guided by compact codebook noise vectors
This observation is stated as the motivation for the compression scheme and is required for the zero-shot reconstruction to succeed.

pith-pipeline@v0.9.1-grok · 5745 in / 1230 out tokens · 28682 ms · 2026-06-26T10:05:04.737282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Overview of the versatile video coding (vvc) standard and its applications

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applications. TCSVT, 2021. 5

2021
[2]

Diffusion forcing: Next-token prediction meets full-sequence diffu- sion

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion. InThe Thirty-eighth Annual Conference on Neural In- formation Processing Systems, 2024. 3

2024
[3]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InInternational Conference on Learn- ing Representations, pages 44730–44745, 2025. 3

2025
[5]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

2020
[6]

Common test con- ditions and software reference configurations for hevc range extensions, document jctvc-n1006

D Flynn, K Sharman, and C Rosewarne. Common test con- ditions and software reference configurations for hevc range extensions, document jctvc-n1006. Joint Collaborative Team Video Coding ITU-T SG, 16. 5
[7]

Generative adversarial nets.NeurIPS, 27,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS, 27,
[8]

Generative latent video compression.arXiv preprint arXiv:2510.09987, 2025

Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li, and Yan Lu. Generative latent video compression.arXiv preprint arXiv:2510.09987, 2025. 1, 2

work page arXiv 2025
[9]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

2017
[10]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020
[11]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Informa- tion Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Informa- tion Processing Systems, 38:167283–167308, 2026. 2, 3, 4

2026
[12]

Towards practical real-time neural video compression

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu. Towards practical real-time neural video compression. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12543–12552,
[13]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InInternational Conference on Learn- ing Representations, pages 23378–23402, 2025. 2, 3

2025
[14]

Deep contextual video com- pression.Advances in Neural Information Processing Sys- tems, 34:18114–18125, 2021

Jiahao Li, Bin Li, and Yan Lu. Deep contextual video com- pression.Advances in Neural Information Processing Sys- tems, 34:18114–18125, 2021. 1, 2

2021
[15]

Neural video compression with diverse contexts

Jiahao Li, Bin Li, and Yan Lu. Neural video compression with diverse contexts. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22616–22626, 2023. 5

2023
[16]

Neural video compression with feature modulation

Jiahao Li, Bin Li, and Yan Lu. Neural video compression with feature modulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26099–26108, 2024. 1, 2, 5

2024
[17]

Yoda: Yet another one-step diffusion-based video com- pressor.arXiv preprint arXiv:2601.01141, 2026

Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, and Zhan Ma. Yoda: Yet another one-step diffusion-based video com- pressor.arXiv preprint arXiv:2601.01141, 2026. 1, 2 9

work page arXiv 2026
[18]

Free-gvc: To- wards training-free extreme generative video compression with temporal coherence.arXiv preprint arXiv:2602.09868,

Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, and Wenjun Zhang. Free-gvc: To- wards training-free extreme generative video compression with temporal coherence.arXiv preprint arXiv:2602.09868,

work page arXiv
[19]

Wenzhuo Ma and Zhenzhong Chen. Diffusion-based percep- tual neural video compression with temporal diffusion infor- mation reuse.ACM Transactions on Multimedia Computing, Communications and Applications, 21(12):1–22, 2025. 1, 2

2025
[20]

Diffvc-osd: One-step diffusion-based perceptual neural video compression frame- work.arXiv preprint arXiv:2508.07682, 2025

Wenzhuo Ma and Zhenzhong Chen. Diffvc-osd: One-step diffusion-based perceptual neural video compression frame- work.arXiv preprint arXiv:2508.07682, 2025. 2

work page arXiv 2025
[21]

Diffvc-rt: Towards prac- tical real-time diffusion-based perceptual neural video com- pression.arXiv preprint arXiv:2601.20564, 2026

Wenzhuo Ma and Zhenzhong Chen. Diffvc-rt: Towards prac- tical real-time diffusion-based perceptual neural video com- pression.arXiv preprint arXiv:2601.20564, 2026

work page arXiv 2026
[22]

Generative neural video compression via video diffusion prior.arXiv preprint arXiv:2512.05016, 2025

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma. Generative neural video compression via video diffusion prior.arXiv preprint arXiv:2512.05016, 2025. 1, 2

work page arXiv 2025
[23]

High-fidelity generative image compres- sion.arXiv preprint arXiv:2006.09965, 2020

Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compres- sion.arXiv preprint arXiv:2006.09965, 2020. 6

work page arXiv 2006
[24]

Neural video compression using gans for detail synthesis and propagation

Fabian Mentzer, Eirikur Agustsson, Johannes Ball ´e, David Minnen, Nick Johnston, and George Toderici. Neural video compression using gans for detail synthesis and propagation. InEuropean Conference on Computer Vision, pages 562–
[25]

Springer, 2022. 1, 2

2022
[26]

Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM multimedia systems conference, pages 297–302, 2020. 5

2020
[27]

Improving statistical fi- delity for neural image compression with implicit local like- lihood models

Matthew J Muckley, Alaaeldin El-Nouby, Karen Ullrich, Herv´e J ´egou, and Jakob Verbeek. Improving statistical fi- delity for neural image compression with implicit local like- lihood models. InInternational Conference on Machine Learning, pages 25426–25443. PMLR, 2023. 6

2023
[28]

Compressed image generation with denoising diffusion codebook models

Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffusion codebook models. InForty-second International Conference on Machine Learning, 2025. 1, 2, 3, 4

2025
[29]

Generative latent coding for ultra-low bitrate image and video compression.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2025

Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image and video compression.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2025. 1, 2, 5

2025
[30]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[31]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 2, 4

2023
[32]

Overview of the high efficiency video coding (hevc) standard.TCSVT, 22(12):1649–1668, 2012

Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard.TCSVT, 22(12):1649–1668, 2012. 5

2012
[33]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Lossy compression with gaussian diffusion

Lucas Theis, Tim Salimans, Matthew D Hoffman, and Fabian Mentzer. Lossy compression with gaussian diffusion. arXiv preprint arXiv:2206.08889, 2022. 1, 2

work page arXiv 2022
[35]

Turbo-DDCM: Fast and flexible zero-shot diffusion-based image compression

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, and Tomer Michaeli. Turbo-DDCM: Fast and flexible zero-shot diffusion-based image compression. InThe Fourteenth In- ternational Conference on Learning Representations, 2026. 1, 2, 3, 4, 5

2026
[36]

Lossy compression with pretrained diffusion models

Jeremy V onderfecht and Feng Liu. Lossy compression with pretrained diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2

2025
[37]

Mcl-jcv: a jnd-based h

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavouni- dis, Anne Aaron, and C-C Jay Kuo. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In2016 IEEE international conference on image processing (ICIP), pages 1509–1513. IEEE, 2016. 5

2016
[38]

Chan, and Chen Change Loy

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 2024. 6

2024
[39]

Multi- scale structural similarity for image quality assessment

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi- scale structural similarity for image quality assessment. In The thrity-seventh asilomar conference on signals, systems & computers, 2003, pages 1398–1402. Ieee, 2003. 6

2003
[40]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InInternational Conference on Computer Vision (ICCV),
[41]

One-step diffusion-based image compression with semantic distillation

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, and Yan Lu. One-step diffusion-based image compression with semantic distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 6

2025
[42]

Single-step diffusion-based video coding with semantic-temporal guidance.arXiv preprint arXiv:2512.07480, 2025

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, and Yan Lu. Single-step diffusion-based video coding with semantic-temporal guidance.arXiv preprint arXiv:2512.07480, 2025. 1, 2

work page arXiv 2025
[43]

Perceptual learned video compression with recurrent conditional gan

Ren Yang, Radu Timofte, and Luc Van Gool. Perceptual learned video compression with recurrent conditional gan. InIJCAI, pages 1537–1544, 2022. 1, 2

2022
[44]

From slow bidirectional to fast autoregressive video diffusion mod- els

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22963–22974,
[45]

GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, and Hiroshi Watanabe. Gvcc: Zero-shot video compression via codebook-driven stochastic rectified flow.arXiv preprint arXiv:2603.26571, 2026. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

When video compression meets multimodal large lan- 10 guage models: A unified paradigm for cross-modality video compression.IEEE Signal Processing Letters, 2026

Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, and Shiqi Wang. When video compression meets multimodal large lan- 10 guage models: A unified paradigm for cross-modality video compression.IEEE Signal Processing Letters, 2026. 1, 2

2026
[47]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595,
[48]

Stablecodec: Taming one-step diffusion for extreme image compression

Tianyu Zhang, Xin Luo, Li Li, and Dong Liu. Stablecodec: Taming one-step diffusion for extreme image compression. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17379–17389, 2025. 6

2025
[49]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongx- uan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time inter- active video generation.arXiv preprint arXiv:2602.02214,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2, 3, 4, 6 11 ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors Supplementary Material
[51]

More Quantitative Results Fig

Additional Experiments 7.1. More Quantitative Results Fig. 7 further presents perceptual results on HEVC Class D and Class E. For the HEVC Class D dataset, whose resolution (416×240) is relatively low, we instead com- pute FID using128×128patches. On Class D, ZeroGVC consistently outperforms competing methods across LPIPS, DISTS, FID, and DOVER. Because t...

1942

[1] [1]

Overview of the versatile video coding (vvc) standard and its applications

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applications. TCSVT, 2021. 5

2021

[2] [2]

Diffusion forcing: Next-token prediction meets full-sequence diffu- sion

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion. InThe Thirty-eighth Annual Conference on Neural In- formation Processing Systems, 2024. 3

2024

[3] [3]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InInternational Conference on Learn- ing Representations, pages 44730–44745, 2025. 3

2025

[5] [5]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

2020

[6] [6]

Common test con- ditions and software reference configurations for hevc range extensions, document jctvc-n1006

D Flynn, K Sharman, and C Rosewarne. Common test con- ditions and software reference configurations for hevc range extensions, document jctvc-n1006. Joint Collaborative Team Video Coding ITU-T SG, 16. 5

[7] [7]

Generative adversarial nets.NeurIPS, 27,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS, 27,

[8] [8]

Generative latent video compression.arXiv preprint arXiv:2510.09987, 2025

Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li, and Yan Lu. Generative latent video compression.arXiv preprint arXiv:2510.09987, 2025. 1, 2

work page arXiv 2025

[9] [9]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

2017

[10] [10]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020

[11] [11]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Informa- tion Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Informa- tion Processing Systems, 38:167283–167308, 2026. 2, 3, 4

2026

[12] [12]

Towards practical real-time neural video compression

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu. Towards practical real-time neural video compression. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12543–12552,

[13] [13]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InInternational Conference on Learn- ing Representations, pages 23378–23402, 2025. 2, 3

2025

[14] [14]

Deep contextual video com- pression.Advances in Neural Information Processing Sys- tems, 34:18114–18125, 2021

Jiahao Li, Bin Li, and Yan Lu. Deep contextual video com- pression.Advances in Neural Information Processing Sys- tems, 34:18114–18125, 2021. 1, 2

2021

[15] [15]

Neural video compression with diverse contexts

Jiahao Li, Bin Li, and Yan Lu. Neural video compression with diverse contexts. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22616–22626, 2023. 5

2023

[16] [16]

Neural video compression with feature modulation

Jiahao Li, Bin Li, and Yan Lu. Neural video compression with feature modulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26099–26108, 2024. 1, 2, 5

2024

[17] [17]

Yoda: Yet another one-step diffusion-based video com- pressor.arXiv preprint arXiv:2601.01141, 2026

Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, and Zhan Ma. Yoda: Yet another one-step diffusion-based video com- pressor.arXiv preprint arXiv:2601.01141, 2026. 1, 2 9

work page arXiv 2026

[18] [18]

Free-gvc: To- wards training-free extreme generative video compression with temporal coherence.arXiv preprint arXiv:2602.09868,

Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, and Wenjun Zhang. Free-gvc: To- wards training-free extreme generative video compression with temporal coherence.arXiv preprint arXiv:2602.09868,

work page arXiv

[19] [19]

Wenzhuo Ma and Zhenzhong Chen. Diffusion-based percep- tual neural video compression with temporal diffusion infor- mation reuse.ACM Transactions on Multimedia Computing, Communications and Applications, 21(12):1–22, 2025. 1, 2

2025

[20] [20]

Diffvc-osd: One-step diffusion-based perceptual neural video compression frame- work.arXiv preprint arXiv:2508.07682, 2025

Wenzhuo Ma and Zhenzhong Chen. Diffvc-osd: One-step diffusion-based perceptual neural video compression frame- work.arXiv preprint arXiv:2508.07682, 2025. 2

work page arXiv 2025

[21] [21]

Diffvc-rt: Towards prac- tical real-time diffusion-based perceptual neural video com- pression.arXiv preprint arXiv:2601.20564, 2026

Wenzhuo Ma and Zhenzhong Chen. Diffvc-rt: Towards prac- tical real-time diffusion-based perceptual neural video com- pression.arXiv preprint arXiv:2601.20564, 2026

work page arXiv 2026

[22] [22]

Generative neural video compression via video diffusion prior.arXiv preprint arXiv:2512.05016, 2025

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma. Generative neural video compression via video diffusion prior.arXiv preprint arXiv:2512.05016, 2025. 1, 2

work page arXiv 2025

[23] [23]

High-fidelity generative image compres- sion.arXiv preprint arXiv:2006.09965, 2020

Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compres- sion.arXiv preprint arXiv:2006.09965, 2020. 6

work page arXiv 2006

[24] [24]

Neural video compression using gans for detail synthesis and propagation

Fabian Mentzer, Eirikur Agustsson, Johannes Ball ´e, David Minnen, Nick Johnston, and George Toderici. Neural video compression using gans for detail synthesis and propagation. InEuropean Conference on Computer Vision, pages 562–

[25] [25]

Springer, 2022. 1, 2

2022

[26] [26]

Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM multimedia systems conference, pages 297–302, 2020. 5

2020

[27] [27]

Improving statistical fi- delity for neural image compression with implicit local like- lihood models

Matthew J Muckley, Alaaeldin El-Nouby, Karen Ullrich, Herv´e J ´egou, and Jakob Verbeek. Improving statistical fi- delity for neural image compression with implicit local like- lihood models. InInternational Conference on Machine Learning, pages 25426–25443. PMLR, 2023. 6

2023

[28] [28]

Compressed image generation with denoising diffusion codebook models

Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffusion codebook models. InForty-second International Conference on Machine Learning, 2025. 1, 2, 3, 4

2025

[29] [29]

Generative latent coding for ultra-low bitrate image and video compression.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2025

Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image and video compression.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2025. 1, 2, 5

2025

[30] [30]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011

[31] [31]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InProceedings of the 40th International Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 2, 4

2023

[32] [32]

Overview of the high efficiency video coding (hevc) standard.TCSVT, 22(12):1649–1668, 2012

Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard.TCSVT, 22(12):1649–1668, 2012. 5

2012

[33] [33]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Lossy compression with gaussian diffusion

Lucas Theis, Tim Salimans, Matthew D Hoffman, and Fabian Mentzer. Lossy compression with gaussian diffusion. arXiv preprint arXiv:2206.08889, 2022. 1, 2

work page arXiv 2022

[35] [35]

Turbo-DDCM: Fast and flexible zero-shot diffusion-based image compression

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, and Tomer Michaeli. Turbo-DDCM: Fast and flexible zero-shot diffusion-based image compression. InThe Fourteenth In- ternational Conference on Learning Representations, 2026. 1, 2, 3, 4, 5

2026

[36] [36]

Lossy compression with pretrained diffusion models

Jeremy V onderfecht and Feng Liu. Lossy compression with pretrained diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2

2025

[37] [37]

Mcl-jcv: a jnd-based h

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavouni- dis, Anne Aaron, and C-C Jay Kuo. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In2016 IEEE international conference on image processing (ICIP), pages 1509–1513. IEEE, 2016. 5

2016

[38] [38]

Chan, and Chen Change Loy

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 2024. 6

2024

[39] [39]

Multi- scale structural similarity for image quality assessment

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi- scale structural similarity for image quality assessment. In The thrity-seventh asilomar conference on signals, systems & computers, 2003, pages 1398–1402. Ieee, 2003. 6

2003

[40] [40]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InInternational Conference on Computer Vision (ICCV),

[41] [41]

One-step diffusion-based image compression with semantic distillation

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, and Yan Lu. One-step diffusion-based image compression with semantic distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 6

2025

[42] [42]

Single-step diffusion-based video coding with semantic-temporal guidance.arXiv preprint arXiv:2512.07480, 2025

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, and Yan Lu. Single-step diffusion-based video coding with semantic-temporal guidance.arXiv preprint arXiv:2512.07480, 2025. 1, 2

work page arXiv 2025

[43] [43]

Perceptual learned video compression with recurrent conditional gan

Ren Yang, Radu Timofte, and Luc Van Gool. Perceptual learned video compression with recurrent conditional gan. InIJCAI, pages 1537–1544, 2022. 1, 2

2022

[44] [44]

From slow bidirectional to fast autoregressive video diffusion mod- els

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22963–22974,

[45] [45]

GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, and Hiroshi Watanabe. Gvcc: Zero-shot video compression via codebook-driven stochastic rectified flow.arXiv preprint arXiv:2603.26571, 2026. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

When video compression meets multimodal large lan- 10 guage models: A unified paradigm for cross-modality video compression.IEEE Signal Processing Letters, 2026

Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, and Shiqi Wang. When video compression meets multimodal large lan- 10 guage models: A unified paradigm for cross-modality video compression.IEEE Signal Processing Letters, 2026. 1, 2

2026

[47] [47]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595,

[48] [48]

Stablecodec: Taming one-step diffusion for extreme image compression

Tianyu Zhang, Xin Luo, Li Li, and Dong Liu. Stablecodec: Taming one-step diffusion for extreme image compression. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17379–17389, 2025. 6

2025

[49] [49]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongx- uan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time inter- active video generation.arXiv preprint arXiv:2602.02214,

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

2, 3, 4, 6 11 ZeroGVC: Zero-Shot Generative Video Compression with Autoregressive Diffusion Priors Supplementary Material

[51] [51]

More Quantitative Results Fig

Additional Experiments 7.1. More Quantitative Results Fig. 7 further presents perceptual results on HEVC Class D and Class E. For the HEVC Class D dataset, whose resolution (416×240) is relatively low, we instead com- pute FID using128×128patches. On Class D, ZeroGVC consistently outperforms competing methods across LPIPS, DISTS, FID, and DOVER. Because t...

1942