Recognition: unknown
Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Pith reviewed 2026-05-08 13:34 UTC · model grok-4.3
The pith
Variable codebook sizes that grow along the token sequence prevent rapid entropy collapse in autoregressive image generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The per-position conditional entropy of visual token sequences on ImageNet decays so fast that t* equals ceil of log base 2 of N over log base 2 of K, which for K equal to 16384 occurs within the first 2 of 256 positions. Variable Codebook Size Quantization counters this by assigning each position its own codebook size K_t that increases from 2 to K_max, leaving the autoregressive training objective and model architecture unchanged, and thereby produces higher-fidelity samples and an emergent semantic hierarchy in the early tokens.
What carries the argument
Variable Codebook Size Quantization (VCQ), which replaces a single fixed codebook with a sequence of increasing codebook sizes K_t from K_min equals 2 to K_max.
If this is right
- A base autoregressive transformer reaches gFID 14.80 without CFG, down from 27.98.
- A 684-million-parameter model reaches gFID 1.71 using only standard next-token prediction.
- The first 10 tokens alone yield 43.8 percent top-1 accuracy in a linear probe on ImageNet.
- No semantic regularization or causal alignment is required to obtain these gains.
Where Pith is reading between the lines
- The same position-dependent capacity allocation could be tested on other modalities whose information density varies along sequences, such as video or audio tokens.
- The induced coarse-to-fine hierarchy may enable new forms of progressive or controllable generation by operating only on early tokens.
- Future schedules for K_t could be learned rather than fixed to be strictly monotonic.
Load-bearing premise
The quick drop in uncertainty after the first few tokens is the main bottleneck of fixed codebooks, and growing the codebook size later in the sequence will fix it without changing training dynamics or introducing new problems.
What would settle it
Train both a fixed large codebook baseline and a VCQ model on the same ImageNet data, then measure the actual per-position conditional entropy on a held-out set; if the entropy curves remain identical and gFID scores show no gap, the mechanism is falsified.
read the original abstract
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes that fixed codebook sizes in autoregressive visual tokenizers lead to an 'Entropy Cliff,' where per-position conditional entropy decays rapidly (formalized as t^* = ceil(log2 N / log2 K)), turning most of the 256 positions on ImageNet into memorization tasks after only t^*=2 for K=16384. It proposes Variable Codebook Size Quantization (VCQ) that grows K_t monotonically from K_min=2 to K_max, claiming this leaves the AR loss, parameter count, and training unchanged. Empirical results show gFID (w/o CFG) dropping from 27.98 to 14.80 on ImageNet 256x256 with a base model, reaching 1.71 when scaled to 684M parameters; additionally, the first 10 tokens alone yield 43.8% linear-probe top-1 accuracy vs. 27.1% for uniform codebooks, indicating an induced coarse-to-fine hierarchy.
Significance. If the gains are causally tied to redistributing capacity to match the observed entropy decay rather than incidental effects, VCQ offers a simple, training-procedure-preserving way to improve discrete AR visual generation. The reported gFID numbers and the linear-probe hierarchy result would be notable contributions, highlighting that capacity organization matters beyond total bits. The information-theoretic observation on language vs. vision entropy profiles is also potentially useful for tokenizer design.
major comments (3)
- [Abstract] Abstract: the claim that 'the loss function, parameter count, and AR training procedure remain unchanged' is load-bearing for the method's simplicity, yet variable per-position K_t requires a position-dependent output projection (or padding to K_max); without explicit implementation details or a parameter-count table, it is unclear whether the total parameters are truly identical to the fixed-K baseline or if the output head is modified.
- [Abstract] Abstract (t^* formula): t^* = ceil(log2 N / log2 K) is presented as the point where the conditional distribution becomes deterministic, but N is undefined in the given text (presumably the effective number of distinct images or patterns); a precise derivation showing how this yields t^*=2 for ImageNet with K=16384 is needed to ground the 'memorization problem' claim, as the formula appears to be an approximation rather than a direct per-position entropy calculation.
- [Experiments] Experiments/results: the gFID reductions (27.98 to 14.80 base; 1.71 scaled) are attributed to taming the entropy cliff via monotonic K_t growth, but no ablation is described that holds total effective capacity fixed (e.g., product of K_t across positions or sum of log2 K_t equal to the baseline) while varying the schedule; without this, the causal link between the entropy profile and the performance gain cannot be isolated from effects such as the induced hierarchy or altered token statistics.
minor comments (1)
- [Abstract] Abstract: the linear-probe comparison (43.8% vs 27.1% top-1 on first 10 tokens) would benefit from details on the probe architecture, training protocol, and whether the uniform baseline uses the same first-10-token restriction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'the loss function, parameter count, and AR training procedure remain unchanged' is load-bearing for the method's simplicity, yet variable per-position K_t requires a position-dependent output projection (or padding to K_max); without explicit implementation details or a parameter-count table, it is unclear whether the total parameters are truly identical to the fixed-K baseline or if the output head is modified.
Authors: We appreciate this observation. In our implementation, the autoregressive Transformer uses a single output projection layer with output dimension K_max, identical to a fixed-K baseline using K = K_max. For positions with K_t < K_max, we mask the logits beyond the first K_t entries before the softmax and restrict the cross-entropy loss accordingly; no additional parameters or per-position heads are introduced. This keeps the total parameter count and training procedure unchanged from the baseline. We will add explicit implementation details and a parameter-count comparison table in the revised method section. revision: partial
-
Referee: [Abstract] Abstract (t^* formula): t^* = ceil(log2 N / log2 K) is presented as the point where the conditional distribution becomes deterministic, but N is undefined in the given text (presumably the effective number of distinct images or patterns); a precise derivation showing how this yields t^*=2 for ImageNet with K=16384 is needed to ground the 'memorization problem' claim, as the formula appears to be an approximation rather than a direct per-position entropy calculation.
Authors: We agree that the presentation of t^* requires more rigor. N is the effective number of distinct visual patterns (approximately 2^20 for ImageNet given its size and diversity). With K=16384 (log2 K ≈ 14), the ratio log2 N / log2 K ≈ 1.43 yields ceil(1.43) = 2. This follows from the pigeonhole principle on cumulative capacity: after t tokens at most K^t sequences are distinguishable. For t=2 this exceeds the dataset size, but the formula approximates the position at which observed conditional entropy H(x_t | x_<t) collapses in practice. We will include a full derivation linking the formula to per-position entropy measurements plus supporting plots in a revised appendix. revision: partial
-
Referee: [Experiments] Experiments/results: the gFID reductions (27.98 to 14.80 base; 1.71 scaled) are attributed to taming the entropy cliff via monotonic K_t growth, but no ablation is described that holds total effective capacity fixed (e.g., product of K_t across positions or sum of log2 K_t equal to the baseline) while varying the schedule; without this, the causal link between the entropy profile and the performance gain cannot be isolated from effects such as the induced hierarchy or altered token statistics.
Authors: We acknowledge the merit of isolating the monotonic schedule from total capacity effects. While the original submission compared directly to the standard fixed-K baseline, we will add a controlled ablation in the revision: a capacity-matched (same sum of log2 K_t) but non-monotonic schedule (e.g., random or decreasing K_t) trained under identical conditions. Preliminary internal checks suggest the monotonic entropy-aligned schedule outperforms such controls, consistent with the linear-probe hierarchy results. This will be reported to strengthen the causal claim. revision: yes
Circularity Check
No circularity: t* is an explicit information-theoretic bound; VCQ is a design proposal; gains are empirical.
full rationale
The derivation chain begins with an empirical observation of rapid conditional-entropy decay on ImageNet, formalized by the closed-form bound t* = ceil(log2 N / log2 K) that simply equates cumulative codebook capacity to dataset cardinality. VCQ is introduced as an explicit monotonic schedule K_t from 2 to K_max with the explicit statement that loss, parameter count, and training procedure are unchanged. All performance numbers (gFID 27.98→14.80, 1.71 scaled) are reported as direct experimental comparisons against a fixed-K baseline. No equation reduces to a fitted parameter, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- K_min =
2
- K_max
- growth_schedule
axioms (2)
- domain assumption The per-position conditional entropy decays rapidly along the sequence in visual data.
- standard math t* = ceil(log2 N / log2 K) formalizes the entropy cliff position.
invented entities (1)
-
Entropy Cliff
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gomez and Lukasz Kaiser and Illia Polosukhin , editor =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =
2017
-
[2]
9th International Conference on Learning Representations,
Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,. 2021 , url =
2021
-
[3]
Learning Transferable Visual Models From Natural Language Supervision , booktitle =
Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =. 2021 , url =
2021
-
[4]
Transactions on Machine Learning Research , issn=
Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , url=
2024
-
[5]
2023 , eprint=
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
2023
-
[6]
7th International Conference on Learning Representations,
Ilya Loshchilov and Frank Hutter , title =. 7th International Conference on Learning Representations,. 2019 , url =
2019
-
[7]
, title =
Isola, Phillip and Zhu, Jun-Yan and Zhou, Tinghui and Efros, Alexei A. , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[8]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , booktitle =
Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , editor =. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , booktitle =. 2017 , url =
2017
-
[9]
Goodfellow and Wojciech Zaremba and Vicki Cheung and Alec Radford and Xi Chen , editor =
Tim Salimans and Ian J. Goodfellow and Wojciech Zaremba and Vicki Cheung and Alec Radford and Xi Chen , editor =. Improved Techniques for Training GANs , booktitle =. 2016 , url =
2016
-
[10]
Improved Precision and Recall Metric for Assessing Generative Models , booktitle =
Tuomas Kynk. Improved Precision and Recall Metric for Assessing Generative Models , booktitle =. 2019 , url =
2019
-
[11]
Efros and Eli Shechtman and Oliver Wang , title =
Richard Zhang and Phillip Isola and Alexei A. Efros and Eli Shechtman and Oliver Wang , title =. 2018. 2018 , url =. doi:10.1109/CVPR.2018.00068 , timestamp =
-
[12]
Denoising Diffusion Probabilistic Models , booktitle =
Jonathan Ho and Ajay Jain and Pieter Abbeel , editor =. Denoising Diffusion Probabilistic Models , booktitle =. 2020 , url =
2020
-
[13]
Diffusion Models Beat GANs on Image Synthesis , booktitle =
Prafulla Dhariwal and Alexander Quinn Nichol , editor =. Diffusion Models Beat GANs on Image Synthesis , booktitle =. 2021 , url =
2021
-
[14]
2022 , eprint=
Classifier-Free Diffusion Guidance , author=. 2022 , eprint=
2022
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
2022
-
[16]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
William Peebles and Saining Xie , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00387 , timestamp =
-
[17]
and Boffi, Nicholas M
Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining. SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers. Computer Vision -- ECCV 2024. 2024
2024
-
[18]
The Thirteenth International Conference on Learning Representations , year=
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think , author=. The Thirteenth International Conference on Learning Representations , year=
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Yao, Jingfeng and Yang, Bin and Wang, Xinggang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =
2025
-
[20]
2024 , eprint=
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching , author=. 2024 , eprint=
2024
-
[21]
Neural Discrete Representation Learning , booktitle =
A. Neural Discrete Representation Learning , booktitle =. 2017 , url =
2017
-
[22]
Taming Transformers for High-Resolution Image Synthesis , booktitle =
Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =. 2021 , url =. doi:10.1109/CVPR46437.2021.01268 , timestamp =
-
[23]
Vector-quantized Image Modeling with Improved
Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu , booktitle=. Vector-quantized Image Modeling with Improved. 2022 , url=
2022
-
[24]
The Twelfth International Conference on Learning Representations , year=
Language Model Beats Diffusion - Tokenizer is key to visual generation , author=. The Twelfth International Conference on Learning Representations , year=
-
[25]
2025 , eprint=
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation , author=. 2025 , eprint=
2025
-
[26]
2025 , eprint=
Scalable Image Tokenization with Index Backpropagation Quantization , author=. 2025 , eprint=
2025
-
[27]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
An Image is Worth 32 Tokens for Reconstruction and Generation , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[28]
2025 , eprint=
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length , author=. 2025 , eprint=
2025
-
[29]
2025 , eprint=
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression , author=. 2025 , eprint=
2025
-
[30]
2025 , eprint=
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning , author=. 2025 , eprint=
2025
-
[31]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Sargent, Kyle and Hsu, Kyle and Johnson, Justin and Fei-Fei, Li and Wu, Jiajun , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =
2025
-
[32]
2025 , eprint=
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation , author=. 2025 , eprint=
2025
-
[33]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[34]
The Fourteenth International Conference on Learning Representations , year=
Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model , author=. The Fourteenth International Conference on Learning Representations , year=
-
[35]
2026 , eprint=
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation , author=. 2026 , eprint=
2026
-
[36]
, title =
Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
2022
-
[37]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Lee, Doyup and Kim, Chiheon and Kim, Saehoon and Cho, Minsu and Han, Wook-Shin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
2022
-
[38]
Proceedings of the 38th International Conference on Machine Learning , pages =
Zero-Shot Text-to-Image Generation , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
2021
-
[39]
Transactions on Machine Learning Research , issn=
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. Transactions on Machine Learning Research , issn=. 2022 , url=
2022
-
[40]
2024 , eprint=
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , author=. 2024 , eprint=
2024
-
[41]
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction , url =
Tian, Keyu and Jiang, Yi and Yuan, Zehuan and Peng, Bingyue and Wang, Liwei , booktitle =. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction , url =. doi:10.52202/079017-2694 , editor =
-
[42]
Autoregressive Image Generation without Vector Quantization , url =
Li, Tianhong and Tian, Yonglong and Li, He and Deng, Mingyang and He, Kaiming , booktitle =. Autoregressive Image Generation without Vector Quantization , url =. doi:10.52202/079017-1797 , editor =
-
[43]
Transactions on Machine Learning Research , issn=
MaskBit: Embedding-free Image Generation via Bit Tokens , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
2024
-
[44]
2024 , eprint=
Randomized Autoregressive Visual Generation , author=. 2024 , eprint=
2024
-
[45]
and Wang, Yu-Xiong , title =
Pang, Ziqi and Zhang, Tianyuan and Luan, Fujun and Man, Yunze and Tan, Hao and Zhang, Kai and Freeman, William T. and Wang, Yu-Xiong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =
2025
-
[46]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Wang, Yuqing and Ren, Shuhuai and Lin, Zhijie and Han, Yujin and Guo, Haoyuan and Yang, Zhenheng and Zou, Difan and Feng, Jiashi and Liu, Xihui , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =
2025
-
[47]
2025 , eprint=
DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction , author=. 2025 , eprint=
2025
-
[48]
2025 , eprint=
SpectralAR: Spectral Autoregressive Visual Generation , author=. 2025 , eprint=
2025
-
[49]
The Thirteenth International Conference on Learning Representations , year=
ImageFolder: Autoregressive Image Generation with Folded Tokens , author=. The Thirteenth International Conference on Learning Representations , year=
-
[50]
Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael S. Bernstein and Alexander C. Berg and Li Fei. ImageNet Large Scale Visual Recognition Challenge , journal =. 2015 , url =. doi:10.1007/S11263-015-0816-Y , timestamp =
-
[51]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.