arxiv: 2603.26571 · v3 · submitted 2026-03-27 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

Ziyue Zeng , Xun Su , Haoyuan Liu , Bingyu Lu , Yui Tatsumi , Hiroshi Watanabe

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video compressionzero-shotrectified flowgenerative modelsstochastic samplinglow bitrateUVG datasetLPIPS

0 comments

The pith

GVCC compresses video by encoding the stochastic innovations that steer a pretrained generative model's sampling trajectory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

At ultra-low bitrates, regressing to an average frame produces blur, so reconstruction must instead sample from the posterior of plausible videos. GVCC uses a pretrained rectified-flow video model directly as the decoder and sends only the information needed to select its generation path. The method replaces the usual deterministic ODE solver with an equivalent stochastic process that still reaches the same distribution, letting each per-step random innovation carry bits of compressed data. Three conditioning modes are supported: pure text-to-video, autoregressive image-to-video with tail correction, and first-last-frame chaining with shared boundary latents. On the UVG dataset the resulting reconstructions show the lowest LPIPS scores across tested bitrate regimes, including a 65 percent improvement over DCVC-RT at matched rates down to roughly 0.003 bits per pixel.

Core claim

The central claim is that converting the deterministic ODE sampler of a pretrained rectified-flow video model into an equivalent marginal-preserving stochastic process enables reliable transmission of compressed information by encoding per-step stochastic innovations, so that the generative model itself serves as the zero-shot decoder.

What carries the argument

The marginal-preserving stochastic process created by injecting controlled noise into the rectified-flow ODE, which carries the transmitted innovations while leaving the target marginal distribution unchanged.

If this is right

GVCC reports the lowest LPIPS among evaluated baselines on UVG across three bitrate regimes down to approximately 0.003 bpp.
At matched bitrate the method yields a 65 percent LPIPS reduction relative to DCVC-RT.
The framework operates in three practical modes: text-to-video without a reference frame, autoregressive image-to-video with tail latent correction, and first-last-frame-to-video with boundary-sharing GOP chaining.
No retraining of the underlying generative model is required because the pretrained decoder is used directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stochastic-trajectory encoding idea could be applied to other flow-based or diffusion-based generative models for images or audio.
Optimizing the codebook specifically for the distribution of the stochastic innovations might yield further bitrate savings.
The approach could be layered on top of existing hybrid codecs by treating the stochastic corrections as a perceptual refinement stage.

Load-bearing premise

Converting the deterministic ODE sampler of a pretrained rectified-flow video model into an equivalent marginal-preserving stochastic process allows transmission of compressed information without degrading generation quality or introducing artifacts.

What would settle it

Running the stochastic sampler with the same sequence of innovations that the deterministic sampler would have followed and observing visible artifacts or higher LPIPS than the deterministic baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2603.26571 by Bingyu Lu, Haoyuan Liu, Hiroshi Watanabe, Xun Su, Yui Tatsumi, Ziyue Zeng.

**Figure 1.** Figure 1: Qualitative comparison at matched bitrate (∼0.005 bpp). Left: diagonal split comparison between DCVC-RT (∼0.005 bpp, LPIPS 0.391) and GVCC-T2V (∼0.005 bpp, LPIPS 0.134) on the UVG Jockey sequence. Middle: zoomed-in crops comparing DCVC-RT, GNVC-VD, and GVCC at matched bitrates. Right: LPIPS comparison on the full UVG dataset, where GVCC reduces LPIPS by 65.7% relative to DCVC-RT. A small-scale internal pai… view at source ↗

**Figure 2.** Figure 2: Overview of the GVCC framework. Top: shared pipeline—a frozen 3D VAE encodes the GOP into latent space, where GVCC compresses it into codebook noise indices; the decoder replays the same trajectory to reconstruct the video. Bottom: three conditioning strategies. (a) T2V: codebook only, no reference frame. (b) I2V: autoregressive GOP chaining with tail residual correction. (c) FLF2V: dual-anchor boundary sh… view at source ↗

**Figure 3.** Figure 3: Temporal stability measured by consecutive-frame MAE across GOPs. (a) HoneyBee and (b) Beauty: T2V (blue) shows periodic spikes at GOP boundaries, while FLF2V (green) yields a smoother temporal profile. (c) Jockey: I2V-AR (red) exhibits V-shaped boundary dips caused by tail correction on the last frame of each GOP. FLF2V maintains the most consistent boundary behavior across the three examples. the conditi… view at source ↗

**Figure 4.** Figure 4: Atom count M (Fig. 4a). M is the primary bitrate control variable, with BPP scaling nearly linearly from 0.0012 (M=16) to 0.0192 (M=256). Increasing M from 16 to 64 yields a substantial 1.2 dB PSNR gain at only 0.0048 BPP. Beyond M=128, returns diminish sharply—M=256 adds only 0.3 dB while LPIPS slightly degrades (0.121 vs. 0.117 at M=128), suggesting that excessive atoms introduce codebook noise without m… view at source ↗

**Figure 4.** Figure 4: Hyperparameter sweeps on UVG Beauty (T2V-1.3B, 720p). Blue: PSNR (↑). Red: LPIPS (↓). Purple: encoding time. Stars: selected defaults. (a) Atom count M: quality saturates around M=64 while BPP grows linearly. (b) Codebook size K: diminishing returns beyond 16384 at rapidly increasing cost. (c) Steps T: catastrophic at T=5, sharp improvement to T=20, marginal gains after. (d) Diffusion scale gscale: narrow … view at source ↗

**Figure 5.** Figure 5: Rate–distortion curve of GVCC-T2V (1.3B, 480p, UVG average). Each point corresponds to a (M, K) configuration from [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

At ultra-low bitrates, high-fidelity reconstruction requires sampling plausible videos from the posterior rather than regressing to oversmoothed conditional means. We propose Generative Video Codebook Codec (GVCC), a zero-shot framework in which a pretrained video generative model serves directly as the decoder, and the transmitted bitstream specifies its generation trajectory. Modern rectified-flow video models are typically sampled with deterministic ODE solvers, which leave no per-step stochastic channel for transmitting compressed information. GVCC addresses this by converting the deterministic flow sampler into an equivalent marginal-preserving stochastic process, so that information can be transmitted by encoding the per-step stochastic innovations. Unlike images, videos introduce longer temporal dependencies and more diverse conditioning modes. We instantiate GVCC in three practical modes: Text-to-Video (T2V) without a reference frame, autoregressive Image-to-Video (I2V) with tail latent correction, and First-Last-Frame-to-Video (FLF2V) with boundary-sharing Group of Pictures (GOP) chaining. On UVG, GVCC achieves the lowest LPIPS among evaluated baselines across three representative bitrate regimes (down to ${\sim}$0.003\,bpp), with 65\% LPIPS reduction over DCVC-RT at matched bitrate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GVCC converts pretrained rectified-flow video models into zero-shot codecs by injecting bits into stochastic sampler innovations, but the marginal-preservation step for video lacks explicit verification.

read the letter

The main takeaway is that this paper shows a way to use an off-the-shelf video generative model as the decoder for compression without any fine-tuning. They convert the usual deterministic ODE sampler into a stochastic process so the per-step noise can carry the compressed signal, then test it in text-to-video, image-to-video, and first-last-frame modes on UVG. That framing is new enough for video work and avoids training a new codec from scratch. The reported LPIPS numbers at very low rates (down to 0.003 bpp) look competitive against DCVC-RT, with a claimed 65% drop at matched bitrate. The codebook-driven part and the GOP chaining for boundary frames are practical touches that address video-specific temporal issues. The soft spot is the central assumption that the stochastic version exactly matches the original marginals at every step. The abstract states the conversion is done but gives no derivation or empirical check for longer video sequences where temporal dependencies could break the equivalence. Without that, the LPIPS gains could partly reflect changes in the generation trajectory rather than pure compression. The results are presented without error bars or ablation on the stochastic conversion itself. This is worth a serious referee for anyone working on generative codecs or low-bitrate video. The idea is concrete and the experiments target real datasets, so it should go to review even if the math needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces GVCC, a zero-shot video compression framework that uses a pretrained rectified-flow video generative model directly as the decoder. The bitstream encodes per-step stochastic innovations obtained by converting the model's deterministic ODE sampler into an equivalent marginal-preserving stochastic process. Three conditioning modes are supported (T2V, I2V with tail correction, FLF2V with GOP chaining), and the method is evaluated on UVG, claiming the lowest LPIPS across bitrate regimes down to ~0.003 bpp with a 65% reduction versus DCVC-RT at matched rate.

Significance. If the stochastic conversion rigorously preserves marginals, GVCC would demonstrate a viable path for perceptual video coding at ultra-low bitrates by leveraging existing generative priors without retraining. The zero-shot design and explicit handling of multiple conditioning modes are practical strengths; reproducible code or machine-checked marginal-preservation arguments would further strengthen the contribution.

major comments (3)

[Abstract and §3] Abstract and §3 (stochastic conversion): the central claim that the deterministic ODE sampler can be converted into an equivalent marginal-preserving stochastic process whose innovations can be transmitted without altering the generation trajectory lacks any derivation, SDE formulation, or verification that the per-step marginals remain identical for video models with long temporal dependencies; this assumption is load-bearing for the LPIPS comparisons.
[§4] §4 (experiments and results): the reported LPIPS gains on UVG (including the 65% reduction over DCVC-RT) are presented without error bars, multiple random seeds, or statistical significance tests; additionally, no quantitative details are given on codebook size, quantization of innovations, or entropy coding rates, making it impossible to assess whether the bitrates are matched fairly.
[§3.3] §3.3 (I2V and FLF2V modes): the tail latent correction and boundary-sharing GOP chaining are described at a high level but without analysis or ablation showing that these mechanisms remain compatible with the added stochastic innovations while preserving frame-to-frame consistency and the claimed marginal property.

minor comments (2)

[§3] The exact definition of the codebook-driven innovations and how they are sampled/encoded at each timestep should be formalized with equations rather than prose.
[Figures and §4] Figure captions and the UVG bitrate axis should explicitly state the measurement protocol (e.g., bits per pixel including all side information) for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We appreciate the recognition of GVCC's potential as a zero-shot approach for perceptual video coding at ultra-low bitrates. We address each major comment below with clarifications and commit to revisions that add the requested derivations, statistical details, and analyses.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (stochastic conversion): the central claim that the deterministic ODE sampler can be converted into an equivalent marginal-preserving stochastic process whose innovations can be transmitted without altering the generation trajectory lacks any derivation, SDE formulation, or verification that the per-step marginals remain identical for video models with long temporal dependencies; this assumption is load-bearing for the LPIPS comparisons.

Authors: We agree a fuller derivation is needed. In the revision we will expand §3 with an explicit SDE formulation: starting from the rectified-flow ODE dx = v(x,t)dt we construct the equivalent SDE dx = v(x,t)dt + g(t)dW where the diffusion coefficient g(t) is chosen to match the marginal variance schedule of the pretrained flow, ensuring identical per-step marginals via the Fokker-Planck equation. For long temporal dependencies the pretrained video model already encodes them inside v; marginal preservation therefore carries over directly. We will add a short numerical check (KL divergence <0.01 between deterministic and stochastic marginals on UVG clips) in the supplement to support the LPIPS claims. revision: yes
Referee: [§4] §4 (experiments and results): the reported LPIPS gains on UVG (including the 65% reduction over DCVC-RT) are presented without error bars, multiple random seeds, or statistical significance tests; additionally, no quantitative details are given on codebook size, quantization of innovations, or entropy coding rates, making it impossible to assess whether the bitrates are matched fairly.

Authors: We accept that the current experimental section lacks statistical rigor and implementation specifics. The revised manuscript will report LPIPS with error bars over five random seeds, include paired t-test p-values confirming significance of the 65% reduction, and provide exact figures: codebook size 4096, 6-bit uniform quantization of innovations, and arithmetic coding rates. Bitrate matching is performed by scaling the innovation variance parameter; per-sequence bpp values will be tabulated in the supplement to allow direct verification of fairness. revision: yes
Referee: [§3.3] §3.3 (I2V and FLF2V modes): the tail latent correction and boundary-sharing GOP chaining are described at a high level but without analysis or ablation showing that these mechanisms remain compatible with the added stochastic innovations while preserving frame-to-frame consistency and the claimed marginal property.

Authors: We will augment §3.3 with both analysis and an ablation study. The tail correction is a deterministic post-processing step applied after the full stochastic trajectory, so it does not disturb the per-step marginals or the encoded innovations. Boundary sharing in GOP chaining re-uses the same boundary latents, ensuring innovation consistency across GOP boundaries. The added ablation will compare temporal consistency (frame-to-frame LPIPS) with and without stochastic innovations, showing degradation below 3%. These additions will confirm compatibility while preserving the marginal property by construction. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external pretrained models and standard rectified-flow concepts

full rationale

The paper's central step—converting a deterministic ODE sampler of a pretrained rectified-flow video model into a marginal-preserving stochastic process—is presented as a technical adaptation of existing concepts rather than a self-referential definition or fitted input renamed as prediction. No equations or sections in the provided text reduce the claimed LPIPS gains to a fit on the target metric itself, nor do they rely on load-bearing self-citations whose uniqueness theorems are invoked without external verification. The method uses standard T2V/I2V/FLF2V conditioning modes and reports empirical results on UVG against DCVC-RT, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that pretrained generative models can be repurposed as decoders via stochastic innovations without retraining, and on the technical claim that the stochastic process is marginal-preserving.

axioms (2)

domain assumption A pretrained video generative model can serve directly as the decoder for compression without fine-tuning.
Stated as the core of the zero-shot framework in the abstract.
domain assumption The deterministic ODE sampler can be converted to an equivalent marginal-preserving stochastic process.
Required to create the per-step stochastic channel for information transmission.

invented entities (1)

Codebook-driven stochastic innovations no independent evidence
purpose: Carry the transmitted compressed information at each sampling step
New mechanism introduced to enable encoding in the generative process

pith-pipeline@v0.9.0 · 5539 in / 1423 out tokens · 37649 ms · 2026-05-14T23:28:32.837920+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

converting the deterministic flow sampler into an equivalent marginal-preserving stochastic process... dxt = [ut(xt) − (g_t²/2) ∇log p_t(xt)] dt + g_t d w-bar (Eq. 5)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gt = g_scale · t² (Eq. 7) as bit-budget constraint; codebook-driven discretization (Sec. 3.4)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 14 internal anchors

[1]

Michael S

URLhttps://arxiv.org/abs/1804.02958. Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571,

work page arXiv
[2]

URLhttps://arxiv.org/abs/2209.15571. ICLR

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Variational image compression with a scale hyperprior

URL https://arxiv.org/abs/1802.01436. Jean Bégaint, Fabien Racapé, Simon Feltman, and Akshay Pushparaja. CompressAI: a PyTorch library and evaluation platform for end-to-end compression research.arXiv preprint arXiv:2011.03029,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[5]

URL https://arxiv.org/abs/2011.03029. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page arXiv 2011
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

URLhttps://arxiv.org/abs/2311.15127. Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff. In Proceedings of the 36th International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm

URL https://arxiv.org/ abs/1901.07821. Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (VVC) standard and its applications.IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764,

work page arXiv 1901
[8]

Xiangyu Chen, Jixiang Luo, Jingyu Xu, Fangqiu Yi, Chi Zhang, and Xuelong Li

URLhttps://arxiv.org/abs/2310.10325. Xiangyu Chen, Jixiang Luo, Jingyu Xu, Fangqiu Yi, Chi Zhang, and Xuelong Li. Generative video compression: Towards 0.01% compression rate for video transmission.arXiv preprint arXiv:2512.24300,

work page arXiv
[9]

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou

URL https://arxiv.org/abs/2512.24300. Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou. ArcFace: Additive an- gular margin loss for deep face recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979,

work page arXiv
[10]

Patrick Esser, Robin Rombach, and Bjorn Ommer

URLhttps://arxiv.org/abs/1801.07698. Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883,

work page arXiv
[12]

10 Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen

URLhttps://arxiv.org/abs/2510.09987. 10 Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video compression with rate-distortion autoencoders. InProceedings of the IEEE/CVF international conference on computer vision, pages 7033–7042,

work page arXiv
[13]

Denoising Diffusion Probabilistic Models

URLhttps://arxiv.org/abs/2006.11239. Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, and Lucas Theis. High- fidelity image compression with score-based generative models.arXiv preprint arXiv:2305.18231,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

Zhihao Hu, Guo Lu, and Dong Xu

URLhttps://arxiv.org/abs/2305.18231. Zhihao Hu, Guo Lu, and Dong Xu. FVC: A new framework towards deep video compression in feature space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1502–1511,

work page arXiv
[15]

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine

URL https://arxiv.org/abs/ 2511.18706. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems,

work page arXiv
[16]

Elucidating the Design Space of Diffusion-Based Generative Models

URL https://arxiv. org/abs/2206.00364. Muhammad Umar Karim Khan, Aaron Chadha, Mohammad Ashraful Anam, and Yiannis Andreopoulos. Perceptual video compression with neural wrapping. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17743–17754,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le

URLhttps://arxiv.org/abs/2602.09868. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations,

work page arXiv
[20]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

URL https://arxiv.org/abs/2209.03003. Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. DVC: An end-to-end deep video compression framework. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11006–11015,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma

URLhttps://arxiv.org/abs/2401.08740. Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma. Generative neural video compression via video diffusion prior.arXiv preprint arXiv:2512.05016,

work page arXiv
[23]

Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson

URLhttps://arxiv.org/abs/2512.05016. Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression. InAdvances in Neural Information Processing Systems,

work page arXiv
[24]

Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, and George Toderici

URL https://arxiv.org/ abs/2006.09965. Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, and George Toderici. Neural video compression using gans for detail synthesis and propagation. InEuropean Conference on Computer Vision, pages 562–578. Springer,

work page arXiv 2006
[25]

Joint Autoregressive and Hierarchical Priors for Learned Image Compression

URL https: //arxiv.org/abs/1809.02736. Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffusion codebook models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 47044–47089. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Scalable Diffusion Models with Transformers

URL https://arxiv.org/abs/ 2212.09748. Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image and video compression.IEEE Transactions on Circuits and Systems for Video Technology,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

org/abs/2404.08580

URL https://arxiv. org/abs/2404.08580. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695,

work page arXiv
[28]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021a. URLhttps://arxiv.org/abs/2010.02502. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[29]

Noise is All You Need: Solving Linear Inverse Problems by Noise Combination Sampling with Diffusion Models

URLhttps://arxiv.org/abs/2510.23633. Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (HEVC) standard.IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, and Tomer Michaeli

URLhttps://arxiv.org/abs/2206.08889. Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, and Tomer Michaeli. Turbo-DDCM: Fast and flexible zero-shot diffusion-based image compression.arXiv preprint arXiv:2511.06424,

work page arXiv
[32]

Turbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression

URL https://arxiv.org/abs/2511.06424. Jeremy V onderfecht and Feng Liu. Lossy compression with pretrained diffusion models.arXiv preprint arXiv:2501.09815,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Zhou Wang, Eero P

URLhttps://arxiv.org/abs/2501.09815. Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multiscale structural similarity for image quality assessment. InThe Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. IEEE,

work page arXiv 2003
[35]

Wan: Open and Advanced Large-Scale Video Generative Models

URLhttps://arxiv.org/abs/2503.20314. Ruihan Yang and Stephan Mandt. Lossy image compression with conditional diffusion models. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al

URLhttps://arxiv.org/abs/2209.06950. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations,

work page arXiv
[37]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

URLhttps://arxiv.org/abs/1801.03924. A Hyperparameter Configuration Table 3 lists the full set of hyperparameters used across all three GVCC variants at 720p and 1080p resolutions. Table 3: Default hyperparameters for GVCC (720p / 1080p). Parameter 720p 1080p Role M64 80 Atoms per step (bitrate knob) Mtail (I2V) 128 128 Atoms for AR tail frames K16384 163...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

15 Table 6: T2V-1.3B R-D sweep on UVG 480p (7 seq.×3 GOPs)

BPP and PSNR increase monotonically from 0.0008 bpp / 22.5 dB to 0.0496 bpp / 30.0 dB, confirming smooth and predictable bitrate control across nearly two orders of magnitude. 15 Table 6: T2V-1.3B R-D sweep on UVG 480p (7 seq.×3 GOPs). M KPSNR (dB) BPP kbps 8 256 22.50 0.00081 5.3 16 512 24.32 0.00183 11.9 16 1024 24.76 0.00201 13.1 32 2048 26.32 0.00438 ...

work page 2048