arxiv: 2604.10546 · v2 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

Shiyin Jiang , Wei Long , Minghao Han , Zhenghao Chen , Ce Zhu , Shuhang Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords vector quantizationrate-distortion optimizationimage compressiondifferentiable relaxationgenerative compressionlow bitrateautoregressive entropy modelperceptual quality

0 comments

The pith

Making the codebook distribution differentiable allows joint rate-distortion optimization in vector-quantized image compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the longstanding separation between vector quantization for representation and separate entropy modeling, which has prevented true end-to-end rate-distortion training in this class of compressors. By introducing a differentiable relaxation of the discrete codebook selection, gradients from the rate term can now flow back and shape the latent prior during training. The resulting RDVQ framework adds an autoregressive entropy model that supplies both accurate probability estimates and practical test-time rate control. If the approach holds, it would produce lightweight models that deliver competitive perceptual quality at far lower bitrates than prior methods while using substantially fewer parameters.

Core claim

RDVQ shows that relaxing the codebook distribution to be differentiable removes the barrier between representation learning and entropy modeling in vector quantization. This permits the entropy loss to directly influence the latent prior inside a single end-to-end optimization loop. Paired with an autoregressive entropy model, the formulation supports both precise rate estimation and adjustable rate-distortion trade-offs at inference time. Experiments indicate that the resulting lightweight networks achieve strong perceptual quality at extremely low bitrates, with reported bitrate savings of up to 75.71 percent on DISTS and 37.63 percent on LPIPS relative to RDEIC on DIV2K-val.

What carries the argument

Differentiable relaxation of the codebook distribution, which lets the entropy loss directly shape the latent prior.

If this is right

Entropy-constrained vector quantization becomes feasible, letting the rate term directly influence which codes are selected during training.
An autoregressive entropy model supplies both accurate probability estimates and test-time rate control without separate post-processing.
Lightweight architectures can match or exceed prior perceptual quality while using significantly fewer parameters at very low bitrates.
The framework unifies tokenization and compression under a single entropy-constrained objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relaxation technique could be applied to other discrete latent models in generative pipelines where compression cost must be considered during training.
Because the method separates the quantization step from the entropy model only through a differentiable bridge, it may simplify integration with downstream generative tasks that already rely on discrete tokens.
The reported parameter efficiency suggests the approach could be attractive for on-device or bandwidth-constrained deployment scenarios that current heavier generative compressors cannot address.

Load-bearing premise

The differentiable relaxation approximates the true discrete codebook selection closely enough that the entropy loss can guide optimization without introducing bias or instability that would erase the claimed rate-distortion gains.

What would settle it

Ablating the differentiable relaxation, retraining from scratch, and checking whether the entropy term still produces the reported bitrate reductions on DIV2K-val while keeping perceptual metrics stable; instability or loss of the gains would falsify the central mechanism.

Figures

Figures reproduced from arXiv: 2604.10546 by Ce Zhu, Minghao Han, Shiyin Jiang, Shuhang Gu, Wei Long, Zhenghao Chen.

**Figure 2.** Figure 2: Overview of RDVQ. The analysis transform ga extracts multi-scale features, which are flattened into a sequence y for vector quantization and entropy modeling. The VQ module produces hard-quantized embeddings yq, discrete indices yind, and a relaxed distribution psoft. During training, reconstruction is performed from yq, while the rate objective is computed as the cross-entropy between the relaxed distrib… view at source ↗

**Figure 3.** Figure 3: Dependency-aware ordering for autoregressive entropy modeling. Scale-wise spatial orders are defined within each scale and concatenated into a unified order vector o, from which the attention mask M = (o > o ⊤) is constructed. and then define a distance-aware soft distribution p_{\text {soft}}(b,l,k) = \operatorname {softmax}_k \left ( -\frac {d_{b,l,k}}{\tau } \right ), \label {eq:soft_posterior} (6) whe… view at source ↗

**Figure 4.** Figure 4: Rate-distortion curves on the Kodak, the DIV2k-val and the CLIC2020-test datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Visual examples on the CLIC2020 test set. Zoom in for better view. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Model efficiency comparison on DIV2K [1]. RDVQ achieves the best BD-DISTS with less than 20% of the parameters of most baselines, while maintaining competitive latency. position is extended to a multi-scale design to capture richer structural information. The entropy model is implemented using standard transformer layers with the proposed masking mechanism, enabling autoregressive prediction of codebook … view at source ↗

**Figure 7.** Figure 7: Visual examples on the Kodak dataset. 0.012 0.018 0.027 0.034 0.043 BPP 50 60 70 80 90 100 Codebook Usage (%) Codebook Usage vs. Rate Constraint Rate Constraint [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Codebook usage becomes more concentrated under [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Test-time rate adjustment via prefix transmission. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: PCA visualization of the largest-scale encoder features [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at https://github.com/CVL-UESTC/RDVQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RDVQ's differentiable relaxation lets entropy loss directly tune the VQ prior, but the hard argmax at inference creates an unmeasured train-test gap that could inflate the reported rate savings.

read the letter

The main point is that this paper supplies a clean mechanism for joint rate-distortion optimization inside vector-quantized generative compression. By relaxing the codebook selection to be differentiable, the entropy term can shape the latent distribution during training instead of treating quantization as a separate, non-differentiable step after the fact. That is the actual technical step forward over prior VQ-GAN and RDEIC lines of work. They also add an autoregressive entropy model that supports rate control at test time and keep the overall architecture lightweight, which matters for the low-bitrate regime they target. The DIV2K-val numbers against RDEIC look substantial on perceptual metrics, and the claim of fewer parameters for competitive quality is worth noting if it holds under scrutiny. The entropy-constrained formulation itself is presented as a modeling choice rather than a post-hoc fit, which gives it some independence from the usual self-citation loops in this area. The soft spot is exactly the one raised in the stress test. Training uses the relaxed distribution while inference switches to standard hard VQ; without an explicit measurement of the resulting rate gap or a bound on how much the entropy estimate deviates from the arithmetic-coded length, the 75 % and 37 % bitrate reductions cannot be taken at face value. The abstract gives no indication that this discrepancy was quantified, so the central claim of unbiased end-to-end optimization rests on an assumption that may not survive deployment. Minor gaps such as missing ablation tables or error bars are secondary until that mismatch is addressed. This work is for people already working on discrete latent codecs and perceptual compression at very low rates. A reader who needs a practical way to tighten the rate term inside VQ pipelines will find the formulation and the autoregressive model useful even if the numbers need re-checking. It is coherent enough on its own terms to deserve a serious referee rather than a desk rejection, provided the review explicitly asks for the train-inference rate comparison.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RDVQ, a framework for vector quantization in generative image compression that employs a differentiable relaxation of the codebook distribution to enable joint end-to-end rate-distortion optimization. It introduces an autoregressive entropy model supporting accurate modeling and test-time rate control. Experiments claim strong low-bitrate performance on DIV2K-val with a lightweight architecture, including up to 75.71% bitrate reduction on DISTS and 37.63% on LPIPS versus RDEIC, while highlighting an entropy-constrained view of VQ.

Significance. If the relaxation permits unbiased optimization without material train-test discrepancy, the work offers a principled unification of VQ tokenization and entropy modeling that could advance efficient generative compression. The lightweight architecture and reported perceptual gains at extreme low rates, combined with planned code release, represent practical strengths for the field.

major comments (2)

[§3] §3 (differentiable relaxation derivation): The entropy model is trained on the soft (relaxed) codebook distribution, yet inference uses hard argmax quantization. No bound, KL(soft || hard) measurement, or empirical comparison of model-estimated rate versus actual arithmetic-coded bitrate is reported. This mismatch is load-bearing for the central claim of unbiased RD optimization and directly affects the reliability of the 75.71% and 37.63% bitrate reductions.
[Abstract, §4] Abstract and §4 (experimental claims): The specific percentage reductions versus RDEIC are presented without accompanying ablation studies on the relaxation temperature, entropy model architecture, or error analysis on the rate gap. This leaves the robustness of the perceptual metric improvements (DISTS, LPIPS) difficult to verify.

minor comments (2)

The abstract states that code will be available at a GitHub link; confirming this link is active and includes training scripts would aid reproducibility.
[§3] Notation for the relaxation (e.g., temperature parameter or Gumbel-softmax formulation) should be introduced with an explicit equation reference in §3 to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the presentation of our differentiable relaxation and experimental claims. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (differentiable relaxation derivation): The entropy model is trained on the soft (relaxed) codebook distribution, yet inference uses hard argmax quantization. No bound, KL(soft || hard) measurement, or empirical comparison of model-estimated rate versus actual arithmetic-coded bitrate is reported. This mismatch is load-bearing for the central claim of unbiased RD optimization and directly affects the reliability of the 75.71% and 37.63% bitrate reductions.

Authors: We agree that explicit quantification of the train-test gap is necessary to support the unbiased RD optimization claim. The relaxation (a temperature-controlled softmax over codebook distances) is constructed so that the soft distribution converges pointwise to the hard argmax as temperature approaches zero; this is the standard justification for using the soft proxy during training. In the revision we will add: (i) empirical KL(soft || hard) statistics computed on the DIV2K-val set at the operating temperature, and (ii) a direct comparison of the entropy-model rate estimate versus the actual arithmetic-coded bitrate obtained with hard quantization. These measurements will be reported in an expanded §3. A formal error bound is not currently derived in the manuscript; we will therefore limit the addition to the empirical evidence rather than claiming a new theoretical guarantee. revision: partial
Referee: [Abstract, §4] Abstract and §4 (experimental claims): The specific percentage reductions versus RDEIC are presented without accompanying ablation studies on the relaxation temperature, entropy model architecture, or error analysis on the rate gap. This leaves the robustness of the perceptual metric improvements (DISTS, LPIPS) difficult to verify.

Authors: The reported bitrate reductions are obtained with the full RDVQ model (fixed temperature, autoregressive entropy model) on DIV2K-val. We concur that additional controls would strengthen the claims. In the revised manuscript we will insert a new ablation subsection in §4 that varies (a) the relaxation temperature over a small grid around the value used in the main experiments and (b) the entropy-model context size. We will also report the mean and standard deviation of the rate-estimation error (estimated entropy minus actual coded rate) across the test images. These results will be summarized in the text and in a supplementary table, allowing readers to assess the sensitivity of the DISTS and LPIPS gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent modeling choice and empirical validation

full rationale

The paper's central contribution is a differentiable relaxation of the codebook distribution to enable joint RD optimization in VQ-based compression, followed by an autoregressive entropy model. This relaxation is presented as a technical innovation (e.g., allowing entropy loss to shape the latent prior) without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Performance claims rest on external benchmarks (DIV2K-val comparisons to RDEIC) rather than tautological equivalence to inputs. No equations or steps in the provided abstract or description collapse the claimed bitrate reductions or perceptual gains to prior author work by construction. The train-test distribution mismatch is a potential correctness concern but does not constitute circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the approach appears to rest on standard VQ assumptions plus the validity of the differentiable relaxation.

pith-pipeline@v0.9.0 · 5525 in / 1042 out tokens · 46348 ms · 2026-05-10T16:00:32.779389+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Ntire 2017 challenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition workshops, pages 126–135, 2017. 2, 6

2017
[2]

Generative adversarial networks for extreme learned image compression

Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 221–231, 2019. 3

2019
[3]

Multi-realism image compression with a conditional generator

Eirikur Agustsson, David Minnen, George Toderici, and Fabian Mentzer. Multi-realism image compression with a conditional generator. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22324–22333, 2023. 3

2023
[4]

End-to-end optimized image compression.arXiv preprint arXiv:1611.01704, 2016

Johannes Ball ´e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression.arXiv preprint arXiv:1611.01704, 2016. 2

work page arXiv 2016
[5]

Variational image compression with a scale hyperprior

Johannes Ball ´e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior.arXiv preprint arXiv:1802.01436,

work page Pith review arXiv
[6]

Bpg image format.https://bellard

Fabrice Bellard. Bpg image format.https://bellard. org/bpg/. Accessed: 2024-10-26. 1

2024
[7]

Overview of the versatile video coding (vvc) standard and its applications

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Tech- nology, 31(10):3736–3764, 2021. 1, 3, 6

2021
[8]

Muckley, Jakob Verbeek, and St´ephane Lathuili`ere

Marlene Careil, Matthew J. Muckley, Jakob Verbeek, and St´ephane Lathuili`ere. Towards image compression with per- fect realism at ultra-low bitrates. InThe Twelfth International Conference on Learning Representations, 2024. 3, 6

2024
[9]

Learned image compression with discretized gaussian mixture likelihoods and attention modules

Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Learned image compression with discretized gaussian mixture likelihoods and attention modules. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7939–7948, 2020. 2

2020
[10]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 2, 6

2020
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

Qarv: Quantization-aware resnet vae for lossy image compression.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):436–450, 2023

Zhihao Duan, Ming Lu, Jack Ma, Yuning Huang, Zhan Ma, and Fengqing Zhu. Qarv: Quantization-aware resnet vae for lossy image compression.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):436–450, 2023. 3

2023
[13]

Image com- pression with product quantized masked image modeling

Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Herv´e J´egou. Image com- pression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372, 2022. 2

work page arXiv 2022
[14]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2, 3

2020
[15]

Kodak dataset

Rich Franzen. Kodak dataset. Online, 2013. Available athttps://r0k.us/graphics/kodak/(accessed: 2025–11–13). 6

2013
[16]

Springer Science & Business Media,

Allen Gersho and Robert M Gray.Vector quantization and signal compression. Springer Science & Business Media,
[17]

A residual diffusion model for high perceptual quality codec augmentation.arXiv preprint arXiv:2301.05489, 2023

Noor Fathima Ghouse, Jens Petersen, Auke Wiggers, Tianlin Xu, and Guillaume Sautiere. A residual diffusion model for high perceptual quality codec augmentation.arXiv preprint arXiv:2301.05489, 2023. 3

work page arXiv 2023
[18]

Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 2, 3, 6

2020
[19]

Oscar: One-step diffusion codec across multiple bit-rates.arXiv preprint arXiv:2505.16091,

Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, and Yulun Zhang. Oscar: One- step diffusion codec across multiple bit-rates.arXiv preprint arXiv:2505.16091, 2025. 3, 6

work page arXiv 2025
[20]

Causal context adjust- ment loss for learned image compression.arXiv preprint arXiv:2410.04847, 2024

Minghao Han, Shiyin Jiang, Shengxi Li, Xin Deng, Mai Xu, Ce Zhu, and Shuhang Gu. Causal context adjust- ment loss for learned image compression.arXiv preprint arXiv:2410.04847, 2024. 2

work page arXiv 2024
[21]

Generative image compression by estimating gradients of the rate-variable feature distribution

Minghao Han, Weiyi You, Jinhua Zhang, Leheng Zhang, Ce Zhu, and Shuhang Gu. Generative image compression by estimating gradients of the rate-variable feature distribution. arXiv preprint arXiv:2505.20984, 2025. 3

work page arXiv 2025
[22]

Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adap- tive coding

Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang. Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adap- tive coding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5718– 5727, 2022. 2

2022
[23]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

2017
[24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.arXiv preprint arxiv:2006.11239,

work page internal anchor Pith review arXiv 2006
[25]

Zhihao Hu, Guo Lu, and Dong Xu

Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, and Lucas Theis. High- fidelity image compression with score-based generative models.arXiv preprint arXiv:2305.18231, 2023. 1, 3

work page arXiv 2023
[26]

Generative latent coding for ultra-low bitrate image com- pression

Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image com- pression. InProceedings of the IEEE/CVF Conference on 9 Computer Vision and Pattern Recognition, pages 26088– 26098, 2024. 3

2024
[27]

Mlic++: Linear complexity multi-reference entropy modeling for learned image compression,

Wei Jiang, Jiayu Yang, Yongqi Zhai, Feng Gao, and Rong- gang Wang. Mlic++: Linear complexity multi-reference entropy modeling for learned image compression.arXiv preprint arXiv:2307.15421, 2023. 2, 3

work page arXiv 2023
[28]

Ultra lowrate image compression with semantic residual coding and compression-aware diffu- sion.arXiv preprint arXiv:2505.08281, 2025

Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Ji- awen Gu, and Zhan Ma. Ultra lowrate image compression with semantic residual coding and compression-aware diffu- sion.arXiv preprint arXiv:2505.08281, 2025. 6, 3

work page arXiv 2025
[29]

Egic: enhanced low-bit-rate generative image compression guided by semantic segmentation

Nikolai K ¨orber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, and Bj ¨orn Schuller. Egic: enhanced low-bit-rate generative image compression guided by semantic segmentation. InEuropean Conference on Computer Vision, pages 202–220. Springer, 2024. 3

2024
[30]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 6, 2

1956
[31]

Neural image compres- sion with text-guided encoding for both pixel-level and per- ceptual fidelity

Hagyeong Lee, Minkyu Kim, Jun-Hyuk Kim, Seungeon Kim, Dokwan Oh, and Jaeho Lee. Neural image compres- sion with text-guided encoding for both pixel-level and per- ceptual fidelity. InInternational Conference on Machine Learning, 2024. 3

2024
[32]

Text+ sketch: Image compression at ultra low rates

Eric Lei, Yi ˘git Berkay Uslu, Hamed Hassani, and Shirin Saeedi Bidokhti. Text+ sketch: Image compression at ultra low rates. InICML 2023 Workshop on Neural Com- pression: From Information Theory to Applications, 2023. 3

2023
[33]

Once-for-all: Controllable generative image compression with dynamic granularity adaptation

Anqi Li, Feng Li, Yuxi Liu, Runmin Cong, Yao Zhao, and Huihui Bai. Once-for-all: Controllable generative image compression with dynamic granularity adaptation. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1, 2, 3

2025
[34]

Frequency-aware transformer for learned image compression.arXiv preprint arXiv:2310.16387, 2023

Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Junni Zou, and Hongkai Xiong. Frequency-aware transformer for learned image compression.arXiv preprint arXiv:2310.16387, 2023. 2

work page arXiv 2023
[35]

Texture vector-quantization and recon- struction aware prediction for generative super-resolution

Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xingyu Zhou, and Shuhang Gu. Texture vector-quantization and recon- struction aware prediction for generative super-resolution. arXiv preprint arXiv:2509.23774, 2025. 3

work page arXiv 2025
[36]

Learned image compression with hierarchical progressive context modeling,

Yuqi Li, Haotian Zhang, Li Li, and Dong Liu. Learned image compression with hierarchical progressive context modeling. arXiv preprint arXiv:2507.19125, 2025. 3

work page arXiv 2025
[37]

Toward extreme image compression with latent feature guidance and diffusion prior.IEEE Transactions on Circuits and Systems for Video Technology, 35(1):888–899,

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Jing- wen Jiang. Toward extreme image compression with latent feature guidance and diffusion prior.IEEE Transactions on Circuits and Systems for Video Technology, 35(1):888–899,
[38]

Rdeic: Accelerating diffusion-based extreme im- age compression with relay residual diffusion.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Aj- mal Mian. Rdeic: Accelerating diffusion-based extreme im- age compression with relay residual diffusion.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025. 2, 6, 3

2025
[39]

Synonymous variational inference for perceptual im- age compression.arXiv preprint arXiv:2505.22438, 2025

Zijian Liang, Kai Niu, Changshuo Wang, Jin Xu, and Ping Zhang. Synonymous variational inference for perceptual im- age compression.arXiv preprint arXiv:2505.22438, 2025. 3

work page arXiv 2025
[40]

Learned image compression with mixed transformer-cnn architectures

Jinming Liu, Heming Sun, and Jiro Katto. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14388–14397, 2023. 2

2023
[41]

Icmh- net: Neural image compression towards both machine vision and human vision

Lei Liu, Zhihao Hu, Zhenghao Chen, and Dong Xu. Icmh- net: Neural image compression towards both machine vision and human vision. InProceedings of the 31st ACM Interna- tional Conference on Multimedia, pages 8047–8056, 2023. 2, 3

2023
[42]

An efficient adaptive compression method for human perception and machine vision tasks.arXiv preprint arXiv:2501.04329,

Lei Liu, Zhenghao Chen, Zhihao Hu, and Dong Xu. An efficient adaptive compression method for human perception and machine vision tasks.arXiv preprint arXiv:2501.04329,

work page arXiv
[43]

Adaptive bitrate quantization scheme without codebook for learned image compression

Jonas L ¨ohdefink, Jonas Sitzmann, Andreas B ¨ar, and Tim Fingscheidt. Adaptive bitrate quantization scheme without codebook for learned image compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1732–1737, 2022. 2

2022
[44]

Learned image compression with dictionary- based entropy model.arXiv preprint arXiv:2504.00496,

Jingbo Lu, Leheng Zhang, Xingyu Zhou, Mu Li, Wen Li, and Shuhang Gu. Learned image compression with dictionary- based entropy model.arXiv preprint arXiv:2504.00496,

work page arXiv
[45]

Hybridflow: Infusing continuity into masked codebook for extreme low-bitrate image compression

Lei Lu, Yanyue Xie, Wei Jiang, Wei Wang, Xue Lin, and Yanzhi Wang. Hybridflow: Infusing continuity into masked codebook for extreme low-bitrate image compression. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3010–3018, 2024. 1, 2

2024
[46]

Correcting diffusion-based perceptual image compression with privi- leged end-to-end decoder.arXiv preprint arXiv:2404.04916,

Yiyang Ma, Wenhan Yang, and Jiaying Liu. Correcting diffusion-based perceptual image compression with privi- leged end-to-end decoder.arXiv preprint arXiv:2404.04916,

work page arXiv
[47]

Extreme im- age compression using fine-tuned vqgans

Qi Mao, Tinghan Yang, Yinuo Zhang, Zijian Wang, Meng Wang, Shiqi Wang, Libiao Jin, and Siwei Ma. Extreme im- age compression using fine-tuned vqgans. In2024 Data Compression Conference (DCC), pages 203–212. IEEE,
[48]

High-fidelity generative image compres- sion.Advances in neural information processing systems, 33:11913–11924, 2020

Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compres- sion.Advances in neural information processing systems, 33:11913–11924, 2020. 1, 3

2020
[49]

M2t: Masking transformers twice for faster decoding

Fabian Mentzer, Eirikur Agustson, and Michael Tschannen. M2t: Masking transformers twice for faster decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5340–5349, 2023. 2, 3

2023
[50]

Channel-wise autoregres- sive entropy models for learned image compression

David Minnen and Saurabh Singh. Channel-wise autoregres- sive entropy models for learned image compression. In2020 IEEE International Conference on Image Processing (ICIP), pages 3339–3343. IEEE, 2020. 2

2020
[51]

Muckley, Alaaeldin El-Nouby, Karen Ullrich, Herv´e J ´egou, and Jakob Verbeek

Matthew J. Muckley, Alaaeldin El-Nouby, Karen Ullrich, Herv´e J ´egou, and Jakob Verbeek. Improving statistical fi- delity for neural image compression with implicit local like- lihood models. InInternational Conference on Machine Learning, 2023. 6, 3 10

2023
[52]

Compressed image generation with denoising diffu- sion codebook models.arXiv preprint arXiv:2502.01189,

Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffu- sion codebook models.arXiv preprint arXiv:2502.01189,

work page arXiv
[53]

Compressed image generation with denoising diffusion codebook models

Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffusion codebook models. InForty-second International Conference on Machine Learning, 2025. 3

2025
[54]

Diffo: Single-step diffusion for image compression at ultra-low bi- trates.arXiv preprint arXiv:2506.16572, 2025

Chanung Park, Joo Chan Lee, and Jong Hwan Ko. Diffo: Single-step diffusion for image compression at ultra-low bi- trates.arXiv preprint arXiv:2506.16572, 2025. 2, 3

work page arXiv 2025
[55]

Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993,

Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, and Chao Zhou. Visual autoregres- sive modeling for image super-resolution.arXiv preprint arXiv:2501.18993, 2025. 3

work page arXiv 2025
[56]

Bridging the gap between gaus- sian diffusion models and universal quantization for image compression

Lucas Relic, Roberto Azevedo, Yang Zhang, Markus Gross, and Christopher Schroers. Bridging the gap between gaus- sian diffusion models and universal quantization for image compression. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2449–2458, 2025. 3

2025
[57]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2

2015
[58]

End-to-end learned image compression with fixed point weight quantization

Heming Sun, Zhengxue Cheng, Masaru Takeuchi, and Jiro Katto. End-to-end learned image compression with fixed point weight quantization. In2020 IEEE International Conference on Image Processing (ICIP), pages 3359–3363. IEEE, 2020. 2

2020
[59]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 3, 5, 1

work page internal anchor Pith review arXiv 2024
[60]

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, and Tomer Michaeli

Lucas Theis, Tim Salimans, Matthew D Hoffman, and Fabian Mentzer. Lossy compression with gaussian diffusion. arXiv preprint arXiv:2206.08889, 2022. 3

work page arXiv 2022
[61]

Workshop and challenge on learned image compression (clic2020)

George Toderici, Wenzhe Shi, Radu Timofte, Lucas Theis, Johannes Balle, Eirikur Agustsson, Nick Johnston, and Fabian Mentzer. Workshop and challenge on learned image compression (clic2020). InCVPR, 2020. 6

2020
[62]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6

2023
[63]

Switchable token-specific codebook quantization for face image compression.arXiv preprint arXiv:2510.22943, 2025

Yongbo Wang, Haonan Wang, Guodong Mu, Ruixin Zhang, Jiaqi Chen, Jingyun Zhang, Jun Wang, Yuan Xie, Zhizhong Zhang, and Shouhong Ding. Switchable token-specific codebook quantization for face image compression.arXiv preprint arXiv:2510.22943, 2025. 2

work page arXiv 2025
[64]

Enhanced invertible encoding for learned image compression

Yueqi Xie, Ka Leong Cheng, and Qifeng Chen. Enhanced invertible encoding for learned image compression. InPro- ceedings of the 29th ACM international conference on mul- timedia, pages 162–170, 2021. 2

2021
[65]

Unifying generation and compression: Ultra-low bi- trate image coding via multi-stage transformer

Naifu Xue, Qi Mao, Zijian Wang, Yuan Zhang, and Siwei Ma. Unifying generation and compression: Ultra-low bi- trate image coding via multi-stage transformer. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024. 2, 3, 5

2024
[66]

Dlf: Extreme image compression with dual- generative latent fusion.arXiv preprint arXiv:2503.01428,

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, and Yan Lu. Dlf: Extreme image compression with dual- generative latent fusion.arXiv preprint arXiv:2503.01428,

work page arXiv
[67]

One-step diffusion-based image compression with semantic distillation.arXiv preprint arXiv:2505.16687,

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, and Yan Lu. One-step diffusion-based image compression with semantic distillation.arXiv preprint arXiv:2505.16687,

work page arXiv
[68]

Lossy image compression with conditional diffusion models.Advances in Neural In- formation Processing Systems, 36:64971–64995, 2023

Ruihan Yang and Stephan Mandt. Lossy image compression with conditional diffusion models.Advances in Neural In- formation Processing Systems, 36:64971–64995, 2023. 3

2023
[69]

MV AR: Visual autoregressive modeling with scale and spatial markovian conditioning

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, and Shuhang Gu. MV AR: Visual autoregressive modeling with scale and spatial markovian conditioning. InThe Four- teenth International Conference on Learning Representa- tions, 2026. 3

2026
[70]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 2, 6

2018
[71]

Stablecodec: Taming one-step diffusion for extreme image compression

Tianyu Zhang, Xin Luo, Li Li, and Dong Liu. Stablecodec: Taming one-step diffusion for extreme image compression. arXiv preprint arXiv:2506.21977, 2025. 6, 7, 3

work page arXiv 2025
[72]

Unified multivariate gaussian mixture for efficient neural image compression

Xiaosu Zhu, Jingkuan Song, Lianli Gao, Feng Zheng, and Heng Tao Shen. Unified multivariate gaussian mixture for efficient neural image compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17612–17621, 2022. 2, 3

2022
[73]

The devil is in the details: Window-based attention for image compression

Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. The devil is in the details: Window-based attention for image compression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17492– 17501, 2022. 2 11 Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression Supplementary Mate...

2022
[74]

Finally, joint RD optimization is performed with the full objectiveL= LD +λL R and the codebook fixed

to provide stable probability estimation. Finally, joint RD optimization is performed with the full objectiveL= LD +λL R and the codebook fixed. Models are first trained at relatively high bitrates and then progressively fine-tuned toward lower bitrates. From this stage onward, all models are trained in FP32 precision for stable optimization. High-resolut...

2020