pith. sign in

arxiv: 2606.11363 · v1 · pith:G7JVHPMMnew · submitted 2026-06-09 · 💻 cs.CV

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

Pith reviewed 2026-06-27 13:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords vector quantizationcodebook collapseencoder driftnon-stationary lossgenerative modelingImageNet reconstructionlatent diffusionfull codebook utilization
0
0 comments X

The pith

NSVQ prevents codebook collapse by making the codebook track then lock to encoder movement in vector quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that codebook collapse arises when the encoder shifts its latent outputs faster than the codebook can update, leaving vectors unassigned and raising quantization error through the straight-through estimator. NSVQ counters this with three coordinated changes: a dense non-stationary loss that keeps every code updated to current encoder statistics, a replacement step for codes that lose assignments, and a staged schedule that first lets the codebook follow the encoder and then freezes the encoder to fix the latent geometry. Experiments on ImageNet-1k at 128 by 128 resolution with 65,536 codes show that this yields lower reconstruction FID than SimVQ while keeping every code in use. The same trained models also improve FID when plugged into latent diffusion pipelines for image generation.

Core claim

NSVQ is a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. It first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. On ImageNet-1k at 128×128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

What carries the argument

The NSVQ training schedule, which pairs a dense non-stationary embedding loss with codebook replacement and staged encoder freezing to first follow then stabilize the latent distribution.

If this is right

  • Full codebook utilization is preserved at 65,536 entries on ImageNet-1k.
  • Reconstruction rFID improves from 2.39 to 2.10 relative to SimVQ under identical codebook size.
  • Downstream latent diffusion models trained on the resulting latents achieve lower generation FID on ImageNet.
  • The staged freeze-then-refine schedule can be inserted into existing VQ pipelines without changing the encoder or decoder architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar drift-tracking losses could be tested on audio or video VQ models where codebook collapse is also reported.
  • The replacement step might be replaced by softer reassignment rules in future variants without losing the core benefit.
  • Freezing the encoder after initial alignment may reduce the total number of training steps needed to reach stable utilization.
  • The method's emphasis on non-stationary statistics could be adapted to other quantization schemes that rely on straight-through gradients.

Load-bearing premise

Encoder drift is the main driver of codebook collapse, and the three-part NSVQ schedule will track and then consolidate the codebook without creating new instabilities or needing heavy retuning.

What would settle it

Run NSVQ and SimVQ side-by-side on ImageNet-1k at 128×128 with 65,536 codes and check whether rFID drops below 2.39 or codebook utilization falls below 100 percent.

Figures

Figures reproduced from arXiv: 2606.11363 by Abbas Alili, Hao Lu, Metin N. Gurcan, Onur Koyun, Yongxin Guo, Zhengjie Zhu.

Figure 1
Figure 1. Figure 1: Overview of NSVQ. Stage 1 prevents early collapse using the non-stationary embedding [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Encoder-drift stress test. We vary the encoder learning-rate multiplier to intervene on [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Controlled evidence that encoder drift and codebook collapse remain coupled under [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of NSVQ hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detailed training dynamics on CelebA 128 × 128, including codebook utilization, com￾mitment loss, and perceptual loss. Encoder freezing begins at step 45,350 in this run, marking the transition from Stage 1 to the frozen-encoder warm-up stage [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of generator and discriminator losses. The adversarial objective is [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed encoder drift across the proposed training phases. Drift is high during Stage 1, [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative reconstruction comparison. From left to right: ground truth, NSVQ (our [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies encoder drift as a driver of codebook collapse in large-codebook vector quantization and proposes NSVQ, a three-component training strategy (dense non-stationary embedding loss, codebook replacement, stage-wise encoder freezing) that first tracks drift, then consolidates the codebook under a fixed encoder, and finally allows adversarial refinement. On ImageNet-1k at 128×128 with 65,536 codes it reports an rFID reduction from 2.39 (SimVQ) to 2.10 while preserving 100 % utilization; downstream latent-diffusion experiments also show improved generation FID.

Significance. If the reported metric gains are reproducible and the three-stage procedure proves robust, NSVQ supplies a practical, empirically grounded intervention for a recurring failure mode in VQ-based generative pipelines; the concrete rFID delta on a standard benchmark with full utilization maintained is a clear strength.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central empirical claim (rFID drop from 2.39 to 2.10, 100 % utilization) is presented without ablations that isolate the contribution of the dense non-stationary loss, the replacement rule, or the stage-wise freezing schedule; this absence directly undermines in the causal role attributed to encoder drift.
  2. [Method] Method description: no quantitative verification (e.g., code-assignment histograms, drift-norm plots, or controlled ablations) is supplied to establish that encoder drift is the primary driver rather than a correlate; the three-stage procedure therefore rests on an untested mechanistic assumption that is load-bearing for the proposed remedy.
minor comments (2)
  1. [Abstract] The abstract states the three stages at a high level; a concise numbered list or diagram in the method section would improve readability.
  2. [Experiments] No error bars, multiple random seeds, or training-curve figures are referenced; adding these would strengthen the reproducibility of the reported FID numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical isolation of components and mechanistic verification. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central empirical claim (rFID drop from 2.39 to 2.10, 100 % utilization) is presented without ablations that isolate the contribution of the dense non-stationary loss, the replacement rule, or the stage-wise freezing schedule; this absence directly undermines in the causal role attributed to encoder drift.

    Authors: We agree that the current experiments do not include ablations that separately quantify the contribution of each NSVQ component. This limits the strength of the causal attribution to encoder drift. In the revised manuscript we will add a dedicated ablation study that reports rFID and utilization when each component is removed or disabled in turn, while keeping all other training details fixed. revision: yes

  2. Referee: [Method] Method description: no quantitative verification (e.g., code-assignment histograms, drift-norm plots, or controlled ablations) is supplied to establish that encoder drift is the primary driver rather than a correlate; the three-stage procedure therefore rests on an untested mechanistic assumption that is load-bearing for the proposed remedy.

    Authors: The manuscript motivates encoder drift from observed training dynamics and the straight-through estimator feedback loop, yet we acknowledge that direct quantitative evidence such as drift-norm trajectories or assignment histograms is not provided. We will incorporate these visualizations and supporting measurements in the Method and Experiments sections of the revision to substantiate the mechanistic premise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method only

full rationale

The paper describes an empirical training procedure (dense non-stationary loss + replacement + stage-wise freezing) and reports benchmark rFID numbers on ImageNet. No derivation chain, uniqueness theorem, or prediction is claimed; the central results are direct experimental comparisons rather than quantities forced by the method's own definitions or self-citations. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the method relies on the standard straight-through estimator and the assumption that encoder drift is observable and correctable through the listed interventions.

pith-pipeline@v0.9.1-grok · 5751 in / 1223 out tokens · 61452 ms · 2026-06-27T13:17:02.190448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages

  1. [1]

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville

    URLhttps://arxiv.org/abs/2106.08254. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  2. [2]

    Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang

    URL https://arxiv.org/ abs/1308.3432. Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang. Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140,

  3. [3]

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever

    URL https://arxiv.org/abs/2509.10140. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341,

  4. [4]

    Patrick Esser, Robin Rombach, and Björn Ommer

    URL https://arxiv.org/abs/ 2005.00341. Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883,

  5. [5]

    Xianghong Fang, Yuan Yuan, Dehan Kong, and Tim G

    URLhttps://arxiv.org/abs/2012.09841. Xianghong Fang, Yuan Yuan, Dehan Kong, and Tim G. J. Rudner. VQ-Transplant: Efficient VQ-module integration for pre-trained visual tokenizers. InInternational Conference on Learning Representations,

  6. [6]

    Wengang Guo, Kaiyan Lin, and Wei Ye

    doi: 10.1109/18.720541. Wengang Guo, Kaiyan Lin, and Wei Ye. Deep embedded k-means clustering. In2021 IEEE International Conference on Data Mining Workshops (ICDMW), pages 686–694. IEEE,

  7. [7]

    2021.00090

    doi: 10.1109/ICDMW53433. 2021.00090. Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 14096–14113,

  8. [9]

    org/abs/2503.17760

    URL https://arxiv. org/abs/2503.17760. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738,

  9. [10]

    In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019

    doi: 10.1109/ICCV .2015.425. Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InInternational Conference on Learning Representations,

  10. [11]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    URL https://arxiv.org/abs/ 1906.00446. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695,

  11. [12]

    Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar

    URLhttps://arxiv.org/abs/2112.10752. Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders.arXiv preprint arXiv:1805.11063,

  12. [13]

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C

    URLhttps://arxiv.org/abs/1805.11063. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252,

  13. [14]

    Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu

    URL https://arxiv.org/abs/ 1711.00937. Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision,

  14. [15]

    Will Williams, Sam Ringer, Tom Ash, John Hughes, David MacLeod, and Jamie Dougherty

    URLhttps://arxiv.org/abs/2503.16430. Will Williams, Sam Ringer, Tom Ash, John Hughes, David MacLeod, and Jamie Dougherty. Hierarchical quantized autoencoders.arXiv preprint arXiv:2002.08111,

  15. [16]

    URL https://arxiv.org/abs/2002. 08111. Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. InInternational Con- ference on Learning Representations,

  16. [17]

    Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, and Jiwen Lu

    doi: 10.1109/TASLP.2021.3129994. Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Quantize-then-rectify: Efficient VQ-V AE training.arXiv preprint arXiv:2507.10547,

  17. [18]

    Jiahui Zhang, Fangneng Zhan, Christian Theobalt, and Shijian Lu

    URLhttps://arxiv.org/abs/2507.10547. Jiahui Zhang, Fangneng Zhan, Christian Theobalt, and Shijian Lu. Regularized vector quantization for tokenized image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18467–18476,

  18. [19]

    Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang

    URLhttps://arxiv.org/abs/2303.06424. Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, and De-An Huang. QLIP: Text-aligned visual tokenization unifies auto-regressive multimodal understanding and generation.arXiv preprint arXiv:2502.05178,

  19. [20]

    Chuanxia Zheng and Andrea Vedaldi

    URLhttps://arxiv.org/abs/2502.05178. Chuanxia Zheng and Andrea Vedaldi. Online clustered codebook. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22798–22807,

  20. [21]

    Scaling the codebook size of VQGAN to 100,000 with a utilization rate of 99%

    Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of VQGAN to 100,000 with a utilization rate of 99%. InAdvances in Neural Information Processing Systems, 2024a. URL https://arxiv.org/abs/2406.11837. Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, and Linli Xu. Addressing representation collapse in vector quantized models with one l...

  21. [22]

    Because the encoder has already been updated usingx 1, the latent position ofx 2 changes approximately as z(1) e (x2)≈z (0) e (x2) +J (0) θ (x2)∆θ(0).(30) However, becausec q2 was not selected when processingx 1, it does not directly move: c(1) q2 =c (0) q2 .(31) This creates an immediate asymmetry: the latent feature ofx2 has drifted, but its previously ...

  22. [23]

    Stage 2 then reintroduces adversarial refinement under the same fixed latent geometry

    At this point, freezing no longer prevents the encoder from learning a VQ-compatible representation; rather, it removes further encoder drift and converts codebook learning into a stable consolidation problem. Stage 2 then reintroduces adversarial refinement under the same fixed latent geometry. Therefore, in NSVQ, encoder freezing is not used as a stand-...