arxiv: 2605.06207 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

Bowen Zheng , Weijian Luo , Guang Yang , Colin Zhang , Tianyang Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords variable codebookautoregressive generationvisual tokenizationentropy cliffimage synthesisImageNettransformer

0 comments

The pith

Variable codebook sizes that grow along the token sequence prevent rapid entropy collapse in autoregressive image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed codebook sizes in visual tokenizers create an entropy cliff where conditional uncertainty about the next token drops to near zero after only a couple of positions, turning the rest of the sequence into memorization. The paper proposes Variable Codebook Size Quantization to start with a tiny codebook of size 2 and increase it monotonically to the maximum while keeping the loss function, parameter count, and standard next-token training exactly the same. On ImageNet 256 by 256 this change lowers gFID without classifier-free guidance from 27.98 to 14.80 for a base autoregressive transformer and reaches 1.71 when scaled to 684 million parameters. The same design naturally produces a coarse-to-fine hierarchy, so the first 10 tokens alone support 43.8 percent top-1 linear-probe accuracy versus 27.1 percent for uniform codebooks. The core insight is that how capacity is distributed across positions matters as much as the total capacity.

Core claim

The per-position conditional entropy of visual token sequences on ImageNet decays so fast that t* equals ceil of log base 2 of N over log base 2 of K, which for K equal to 16384 occurs within the first 2 of 256 positions. Variable Codebook Size Quantization counters this by assigning each position its own codebook size K_t that increases from 2 to K_max, leaving the autoregressive training objective and model architecture unchanged, and thereby produces higher-fidelity samples and an emergent semantic hierarchy in the early tokens.

What carries the argument

Variable Codebook Size Quantization (VCQ), which replaces a single fixed codebook with a sequence of increasing codebook sizes K_t from K_min equals 2 to K_max.

If this is right

A base autoregressive transformer reaches gFID 14.80 without CFG, down from 27.98.
A 684-million-parameter model reaches gFID 1.71 using only standard next-token prediction.
The first 10 tokens alone yield 43.8 percent top-1 accuracy in a linear probe on ImageNet.
No semantic regularization or causal alignment is required to obtain these gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same position-dependent capacity allocation could be tested on other modalities whose information density varies along sequences, such as video or audio tokens.
The induced coarse-to-fine hierarchy may enable new forms of progressive or controllable generation by operating only on early tokens.
Future schedules for K_t could be learned rather than fixed to be strictly monotonic.

Load-bearing premise

The quick drop in uncertainty after the first few tokens is the main bottleneck of fixed codebooks, and growing the codebook size later in the sequence will fix it without changing training dynamics or introducing new problems.

What would settle it

Train both a fixed large codebook baseline and a VCQ model on the same ImageNet data, then measure the actual per-position conditional entropy on a held-out set; if the entropy curves remain identical and gFID scores show no gap, the mechanism is falsified.

read the original abstract

Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VCQ shows that growing the codebook size along the sequence cuts gFID roughly in half on ImageNet AR generation by sidestepping the rapid entropy drop after the first few tokens.

read the letter

The main point is that fixed codebook sizes in visual tokenizers waste capacity once conditional entropy collapses after a couple positions, and letting K_t grow from 2 to max fixes enough of that to drop gFID from 27.98 to 14.80 on the base model and reach 1.71 when scaled, all with a plain autoregressive transformer and no extra losses or alignment tricks. The t* formula is a direct information-theoretic bound that explains why this cliff hits visual data but not language, and the linear-probe result on the first 10 tokens (43.8% vs 27.1%) is a clean side benefit showing the coarse-to-fine structure that emerges for free. The fact that loss, parameter count, and training stay identical is a real practical win. The results look strong enough on the reported numbers to merit attention from anyone scaling discrete AR image models. The soft spot is that the gains could partly trace to the induced hierarchy or changed token statistics rather than exact entropy matching. Without ablations that hold total effective capacity fixed while changing only the growth schedule, it is hard to isolate the causal contribution of the cliff observation itself. The growth schedule also adds a few choices for K_min, K_max, and the ramp, even if they are not fitted parameters. This is for people working on VQ-style tokenizers plus transformers for vision generation who want efficiency gains without new training machinery. The thinking is direct and the empirical delta is large enough that it deserves a serious referee to check the details and ablations.

Referee Report

3 major / 1 minor

Summary. The paper observes that fixed codebook sizes in autoregressive visual tokenizers lead to an 'Entropy Cliff,' where per-position conditional entropy decays rapidly (formalized as t^* = ceil(log2 N / log2 K)), turning most of the 256 positions on ImageNet into memorization tasks after only t^*=2 for K=16384. It proposes Variable Codebook Size Quantization (VCQ) that grows K_t monotonically from K_min=2 to K_max, claiming this leaves the AR loss, parameter count, and training unchanged. Empirical results show gFID (w/o CFG) dropping from 27.98 to 14.80 on ImageNet 256x256 with a base model, reaching 1.71 when scaled to 684M parameters; additionally, the first 10 tokens alone yield 43.8% linear-probe top-1 accuracy vs. 27.1% for uniform codebooks, indicating an induced coarse-to-fine hierarchy.

Significance. If the gains are causally tied to redistributing capacity to match the observed entropy decay rather than incidental effects, VCQ offers a simple, training-procedure-preserving way to improve discrete AR visual generation. The reported gFID numbers and the linear-probe hierarchy result would be notable contributions, highlighting that capacity organization matters beyond total bits. The information-theoretic observation on language vs. vision entropy profiles is also potentially useful for tokenizer design.

major comments (3)

[Abstract] Abstract: the claim that 'the loss function, parameter count, and AR training procedure remain unchanged' is load-bearing for the method's simplicity, yet variable per-position K_t requires a position-dependent output projection (or padding to K_max); without explicit implementation details or a parameter-count table, it is unclear whether the total parameters are truly identical to the fixed-K baseline or if the output head is modified.
[Abstract] Abstract (t^* formula): t^* = ceil(log2 N / log2 K) is presented as the point where the conditional distribution becomes deterministic, but N is undefined in the given text (presumably the effective number of distinct images or patterns); a precise derivation showing how this yields t^*=2 for ImageNet with K=16384 is needed to ground the 'memorization problem' claim, as the formula appears to be an approximation rather than a direct per-position entropy calculation.
[Experiments] Experiments/results: the gFID reductions (27.98 to 14.80 base; 1.71 scaled) are attributed to taming the entropy cliff via monotonic K_t growth, but no ablation is described that holds total effective capacity fixed (e.g., product of K_t across positions or sum of log2 K_t equal to the baseline) while varying the schedule; without this, the causal link between the entropy profile and the performance gain cannot be isolated from effects such as the induced hierarchy or altered token statistics.

minor comments (1)

[Abstract] Abstract: the linear-probe comparison (43.8% vs 27.1% top-1 on first 10 tokens) would benefit from details on the probe architecture, training protocol, and whether the uniform baseline uses the same first-10-token restriction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the loss function, parameter count, and AR training procedure remain unchanged' is load-bearing for the method's simplicity, yet variable per-position K_t requires a position-dependent output projection (or padding to K_max); without explicit implementation details or a parameter-count table, it is unclear whether the total parameters are truly identical to the fixed-K baseline or if the output head is modified.

Authors: We appreciate this observation. In our implementation, the autoregressive Transformer uses a single output projection layer with output dimension K_max, identical to a fixed-K baseline using K = K_max. For positions with K_t < K_max, we mask the logits beyond the first K_t entries before the softmax and restrict the cross-entropy loss accordingly; no additional parameters or per-position heads are introduced. This keeps the total parameter count and training procedure unchanged from the baseline. We will add explicit implementation details and a parameter-count comparison table in the revised method section. revision: partial
Referee: [Abstract] Abstract (t^* formula): t^* = ceil(log2 N / log2 K) is presented as the point where the conditional distribution becomes deterministic, but N is undefined in the given text (presumably the effective number of distinct images or patterns); a precise derivation showing how this yields t^*=2 for ImageNet with K=16384 is needed to ground the 'memorization problem' claim, as the formula appears to be an approximation rather than a direct per-position entropy calculation.

Authors: We agree that the presentation of t^* requires more rigor. N is the effective number of distinct visual patterns (approximately 2^20 for ImageNet given its size and diversity). With K=16384 (log2 K ≈ 14), the ratio log2 N / log2 K ≈ 1.43 yields ceil(1.43) = 2. This follows from the pigeonhole principle on cumulative capacity: after t tokens at most K^t sequences are distinguishable. For t=2 this exceeds the dataset size, but the formula approximates the position at which observed conditional entropy H(x_t | x_<t) collapses in practice. We will include a full derivation linking the formula to per-position entropy measurements plus supporting plots in a revised appendix. revision: partial
Referee: [Experiments] Experiments/results: the gFID reductions (27.98 to 14.80 base; 1.71 scaled) are attributed to taming the entropy cliff via monotonic K_t growth, but no ablation is described that holds total effective capacity fixed (e.g., product of K_t across positions or sum of log2 K_t equal to the baseline) while varying the schedule; without this, the causal link between the entropy profile and the performance gain cannot be isolated from effects such as the induced hierarchy or altered token statistics.

Authors: We acknowledge the merit of isolating the monotonic schedule from total capacity effects. While the original submission compared directly to the standard fixed-K baseline, we will add a controlled ablation in the revision: a capacity-matched (same sum of log2 K_t) but non-monotonic schedule (e.g., random or decreasing K_t) trained under identical conditions. Preliminary internal checks suggest the monotonic entropy-aligned schedule outperforms such controls, consistent with the linear-probe hierarchy results. This will be reported to strengthen the causal claim. revision: yes

Circularity Check

0 steps flagged

No circularity: t* is an explicit information-theoretic bound; VCQ is a design proposal; gains are empirical.

full rationale

The derivation chain begins with an empirical observation of rapid conditional-entropy decay on ImageNet, formalized by the closed-form bound t* = ceil(log2 N / log2 K) that simply equates cumulative codebook capacity to dataset cardinality. VCQ is introduced as an explicit monotonic schedule K_t from 2 to K_max with the explicit statement that loss, parameter count, and training procedure are unchanged. All performance numbers (gFID 27.98→14.80, 1.71 scaled) are reported as direct experimental comparisons against a fixed-K baseline. No equation reduces to a fitted parameter, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claim relies on the domain assumption of entropy decay in visual data and the design choice of variable K_t, with minimal free parameters beyond the schedule.

free parameters (3)

K_min = 2
Chosen as the minimal codebook size to induce coarse-to-fine hierarchy.
K_max
The maximum codebook size, likely set to match standard like 16384.
growth_schedule
How K_t increases monotonically along the sequence; not specified in abstract.

axioms (2)

domain assumption The per-position conditional entropy decays rapidly along the sequence in visual data.
Observed on ImageNet with K=16384, happens within 2 out of 256 positions.
standard math t* = ceil(log2 N / log2 K) formalizes the entropy cliff position.
Derived from information theory as the point where entropy drops below capacity.

invented entities (1)

Entropy Cliff no independent evidence
purpose: To name and highlight the rapid entropy decay phenomenon in visual token sequences.
Named observation from data analysis.

pith-pipeline@v0.9.0 · 5693 in / 1541 out tokens · 52464 ms · 2026-05-08T13:34:19.688550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 7 canonical work pages

[1]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

2017
[2]

9th International Conference on Learning Representations,

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,. 2021 , url =

2021
[3]

Learning Transferable Visual Models From Natural Language Supervision , booktitle =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =. 2021 , url =

2021
[4]

Transactions on Machine Learning Research , issn=

Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[5]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[6]

7th International Conference on Learning Representations,

Ilya Loshchilov and Frank Hutter , title =. 7th International Conference on Learning Representations,. 2019 , url =

2019
[7]

, title =

Isola, Phillip and Zhu, Jun-Yan and Zhou, Tinghui and Efros, Alexei A. , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[8]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , booktitle =

Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , editor =. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , booktitle =. 2017 , url =

2017
[9]

Goodfellow and Wojciech Zaremba and Vicki Cheung and Alec Radford and Xi Chen , editor =

Tim Salimans and Ian J. Goodfellow and Wojciech Zaremba and Vicki Cheung and Alec Radford and Xi Chen , editor =. Improved Techniques for Training GANs , booktitle =. 2016 , url =

2016
[10]

Improved Precision and Recall Metric for Assessing Generative Models , booktitle =

Tuomas Kynk. Improved Precision and Recall Metric for Assessing Generative Models , booktitle =. 2019 , url =

2019
[11]

Efros and Eli Shechtman and Oliver Wang , title =

Richard Zhang and Phillip Isola and Alexei A. Efros and Eli Shechtman and Oliver Wang , title =. 2018. 2018 , url =. doi:10.1109/CVPR.2018.00068 , timestamp =

work page doi:10.1109/cvpr.2018.00068 2018
[12]

Denoising Diffusion Probabilistic Models , booktitle =

Jonathan Ho and Ajay Jain and Pieter Abbeel , editor =. Denoising Diffusion Probabilistic Models , booktitle =. 2020 , url =

2020
[13]

Diffusion Models Beat GANs on Image Synthesis , booktitle =

Prafulla Dhariwal and Alexander Quinn Nichol , editor =. Diffusion Models Beat GANs on Image Synthesis , booktitle =. 2021 , url =

2021
[14]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

2022
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

2022
[16]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

William Peebles and Saining Xie , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00387 , timestamp =

work page doi:10.1109/iccv51070.2023.00387 2023
[17]

and Boffi, Nicholas M

Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining. SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers. Computer Vision -- ECCV 2024. 2024

2024
[18]

The Thirteenth International Conference on Learning Representations , year=

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think , author=. The Thirteenth International Conference on Learning Representations , year=
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yao, Jingfeng and Yang, Bin and Wang, Xinggang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[20]

2024 , eprint=

FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching , author=. 2024 , eprint=

2024
[21]

Neural Discrete Representation Learning , booktitle =

A. Neural Discrete Representation Learning , booktitle =. 2017 , url =

2017
[22]

Taming Transformers for High-Resolution Image Synthesis , booktitle =

Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =. 2021 , url =. doi:10.1109/CVPR46437.2021.01268 , timestamp =

work page doi:10.1109/cvpr46437.2021.01268 2021
[23]

Vector-quantized Image Modeling with Improved

Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu , booktitle=. Vector-quantized Image Modeling with Improved. 2022 , url=

2022
[24]

The Twelfth International Conference on Learning Representations , year=

Language Model Beats Diffusion - Tokenizer is key to visual generation , author=. The Twelfth International Conference on Learning Representations , year=
[25]

2025 , eprint=

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation , author=. 2025 , eprint=

2025
[26]

2025 , eprint=

Scalable Image Tokenization with Index Backpropagation Quantization , author=. 2025 , eprint=

2025
[27]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

An Image is Worth 32 Tokens for Reconstruction and Generation , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[28]

2025 , eprint=

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length , author=. 2025 , eprint=

2025
[29]

2025 , eprint=

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression , author=. 2025 , eprint=

2025
[30]

2025 , eprint=

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning , author=. 2025 , eprint=

2025
[31]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Sargent, Kyle and Hsu, Kyle and Johnson, Justin and Fei-Fei, Li and Wu, Jiajun , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[32]

2025 , eprint=

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation , author=. 2025 , eprint=

2025
[33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[34]

The Fourteenth International Conference on Learning Representations , year=

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model , author=. The Fourteenth International Conference on Learning Representations , year=
[35]

2026 , eprint=

ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation , author=. 2026 , eprint=

2026
[36]

, title =

Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

2022
[37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Lee, Doyup and Kim, Chiheon and Kim, Saehoon and Cho, Minsu and Han, Wook-Shin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

2022
[38]

Proceedings of the 38th International Conference on Machine Learning , pages =

Zero-Shot Text-to-Image Generation , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

2021
[39]

Transactions on Machine Learning Research , issn=

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022
[40]

2024 , eprint=

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , author=. 2024 , eprint=

2024
[41]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction , url =

Tian, Keyu and Jiang, Yi and Yuan, Zehuan and Peng, Bingyue and Wang, Liwei , booktitle =. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction , url =. doi:10.52202/079017-2694 , editor =

work page doi:10.52202/079017-2694
[42]

Autoregressive Image Generation without Vector Quantization , url =

Li, Tianhong and Tian, Yonglong and Li, He and Deng, Mingyang and He, Kaiming , booktitle =. Autoregressive Image Generation without Vector Quantization , url =. doi:10.52202/079017-1797 , editor =

work page doi:10.52202/079017-1797
[43]

Transactions on Machine Learning Research , issn=

MaskBit: Embedding-free Image Generation via Bit Tokens , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[44]

2024 , eprint=

Randomized Autoregressive Visual Generation , author=. 2024 , eprint=

2024
[45]

and Wang, Yu-Xiong , title =

Pang, Ziqi and Zhang, Tianyuan and Luan, Fujun and Man, Yunze and Tan, Hao and Zhang, Kai and Freeman, William T. and Wang, Yu-Xiong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Yuqing and Ren, Shuhuai and Lin, Zhijie and Han, Yujin and Guo, Haoyuan and Yang, Zhenheng and Zou, Difan and Feng, Jiashi and Liu, Xihui , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[47]

2025 , eprint=

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction , author=. 2025 , eprint=

2025
[48]

2025 , eprint=

SpectralAR: Spectral Autoregressive Visual Generation , author=. 2025 , eprint=

2025
[49]

The Thirteenth International Conference on Learning Representations , year=

ImageFolder: Autoregressive Image Generation with Folded Tokens , author=. The Thirteenth International Conference on Learning Representations , year=
[50]

S., Berg, A

Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael S. Bernstein and Alexander C. Berg and Li Fei. ImageNet Large Scale Visual Recognition Challenge , journal =. 2015 , url =. doi:10.1007/S11263-015-0816-Y , timestamp =

work page doi:10.1007/s11263-015-0816-y 2015
[51]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

work page doi:10.18653/v1/p16-1144 2016