pith. machine review for the scientific record. sign in

arxiv: 2309.15505 · v2 · submitted 2023-09-27 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Finite Scalar Quantization: VQ-VAE Made Simple

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords finite scalar quantizationvector quantizationVQ-VAEdiscrete latentsimage generationcodebook collapsemasked transformers
0
0 comments X

The pith

FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes finite scalar quantization as a simpler way to create discrete latent representations in variational autoencoders. Instead of complex vector quantization with its various training stabilizations, FSQ projects the continuous latent to a small number of dimensions and rounds each to one of a few fixed values. The resulting discrete codes form an implicit codebook whose size is the product of the per-dimension choices. This allows the same downstream models for image generation and dense prediction tasks to be trained on these representations. The approach matches the performance of standard VQ-VAE methods while avoiding codebook collapse and eliminating the need for commitment losses or codebook maintenance techniques.

Core claim

By projecting the VAE latent representation down to typically fewer than 10 dimensions and quantizing each dimension independently to a small set of fixed values, we obtain an implicit codebook given by the Cartesian product of these sets. Training the same autoregressive and masked transformer models on these discrete codes yields competitive performance on image generation with MaskGIT and on depth estimation, colorization, and panoptic segmentation with UViM, without suffering from codebook collapse or requiring the auxiliary losses and reseeding procedures of vector quantization.

What carries the argument

Finite scalar quantization, which reduces the latent to a low-dimensional vector and applies independent scalar quantization to each coordinate using fixed level sets, with the effective codebook arising as their product.

If this is right

  • Autoregressive and masked transformer models for image generation can be trained directly on FSQ codes and achieve competitive results.
  • Dense prediction tasks such as depth estimation, colorization, and panoptic segmentation reach similar accuracy when using FSQ-based discrete representations.
  • The method requires no commitment loss, codebook reseeding, code splitting, or entropy penalties to learn useful discrete codes.
  • Codebook collapse is avoided because each dimension is quantized independently to fixed values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • FSQ's success suggests that much of the representational power in VQ comes from the exponential growth of the codebook size rather than the vector nature of the quantization.
  • This simplification could make discrete latent models easier to implement and scale to new domains where VQ training instabilities have been a barrier.
  • Since the quantization levels are fixed, the method might allow for more predictable bit-rate control in compression applications compared to learned codebooks.

Load-bearing premise

Projecting the latent representation to a small number of dimensions and quantizing each one independently to fixed levels still captures enough information for the downstream tasks to perform as well as full vector quantization.

What would settle it

Training the same models with FSQ and with VQ on identical tasks and codebook sizes, then observing that FSQ yields substantially lower generation quality or task accuracy would falsify the claim of competitive performance.

read the original abstract

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Finite Scalar Quantization (FSQ) to replace vector quantization in VQ-VAEs. It projects the latent representation to a small number of dimensions (typically fewer than 10), quantizes each coordinate independently to a fixed set of levels, and forms an implicit product codebook whose cardinality matches that of a standard VQ codebook. The same downstream models (MaskGIT for image generation; UViM for depth estimation, colorization, and panoptic segmentation) are then trained on the resulting discrete latents. The central claim is that FSQ achieves competitive performance on these tasks while avoiding codebook collapse and eliminating the need for commitment losses, reseeding, entropy penalties, and related machinery.

Significance. If the empirical claims hold, FSQ offers a substantial simplification of discrete latent learning for generative and dense-prediction vision models. By removing the complex stabilization techniques required by VQ-VAEs and still matching performance, the method lowers the barrier to using discrete representations and may improve training stability and reproducibility. The approach is attractive because the quantization step itself introduces no learned parameters once the number of scalar dimensions and levels per dimension are fixed.

major comments (2)
  1. [§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.
  2. [§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.
minor comments (2)
  1. [§3] Clarify in the method section how the specific number of scalar dimensions and levels per dimension are chosen in each experiment to exactly match the VQ codebook size used in the baselines.
  2. [§4] Add codebook utilization statistics (e.g., percentage of active codes) for both FSQ and VQ runs to support the claim that FSQ does not suffer from collapse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.

    Authors: We agree that an ablation on the projection dimensionality, while keeping the codebook size fixed, would provide valuable insight into whether the product codebook preserves joint statistics. Although our experiments demonstrate competitive performance with the chosen dimensionality (typically <10), we will add such an ablation study in the revised manuscript, focusing on the MaskGIT image generation task. This will include varying the number of scalar dimensions from 4 to 16 while adjusting levels per dimension to maintain equivalent codebook cardinality, and report the resulting FID scores. We believe this will confirm that performance does not degrade significantly, indicating that the implicit codebook captures the necessary statistics without requiring the downstream model to compensate for a severe bottleneck. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.

    Authors: We apologize if the quantitative results were not sufficiently prominent in the main text. The full manuscript includes detailed tables (such as Table 1 comparing FID scores for MaskGIT with FSQ vs. VQ, and Table 3 for UViM tasks with metrics like RMSE for depth and mIoU for segmentation) that provide direct comparisons with matched codebook sizes (e.g., 1024 or 4096). These tables include standard deviations from multiple runs where applicable. To address the referee's concern, we will move or duplicate key comparison tables into the main body of the paper and ensure all numbers are explicitly stated in the text, with clear indications of matched training budgets and codebook cardinalities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; FSQ is a direct algorithmic substitution with empirical validation

full rationale

The paper introduces FSQ by projecting the VAE latent to a small number of dimensions (typically <10) and independently quantizing each to fixed levels, yielding a product codebook whose cardinality is set by explicit choice to match a VQ baseline. This is a design decision, not a derivation. All performance claims (competitive results on MaskGIT image generation and UViM dense prediction tasks) are presented as empirical outcomes after training the same downstream models, without any equations that reduce those outcomes to fitted parameters, self-cited uniqueness theorems, or ansatzes imported from prior work by the same authors. No load-bearing self-citations appear in the derivation chain, and the method does not rename known results or smuggle assumptions via citation. The central claim of avoiding codebook collapse and complex VQ machinery is therefore supported by direct substitution and experiment rather than circular reduction to inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method introduces no new physical entities. It relies on the standard assumption that a low-dimensional projection followed by independent rounding can approximate the capacity of a learned vector codebook. The number of dimensions and levels per dimension are chosen by hand to match a target codebook size; these are free parameters.

free parameters (2)
  • number of scalar dimensions
    Chosen (typically <10) so that the product of per-dimension levels equals the desired codebook size; directly controls representational capacity.
  • levels per dimension
    Small fixed integers (e.g., 3 or 5) selected to achieve target codebook cardinality; values themselves are fixed but the count is a design choice.
axioms (1)
  • domain assumption A low-dimensional projection of the VAE latent preserves task-relevant information when each coordinate is independently quantized.
    Invoked when claiming that FSQ representations remain expressive enough for the same downstream models.

pith-pipeline@v0.9.0 · 5527 in / 1468 out tokens · 24066 ms · 2026-05-16T18:28:56.392306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.Jcost Jcost_eq_zero_iff echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  2. TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization

    cs.AI 2026-05 unverdicted novelty 7.0

    TimeTok is a unified framework using hierarchical tokenization for granularity-controllable time-series generation that achieves state-of-the-art performance in standard tasks and shows transferability across heteroge...

  3. Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

    cs.IR 2026-04 unverdicted novelty 7.0

    AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.

  4. Neuro-Symbolic ODE Discovery with Latent Grammar Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by d...

  5. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  6. Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

    cs.LG 2026-05 unverdicted novelty 6.0

    C2LT-3D factorizes 3D tokenization into canonical local geometry, partition-conditioned context, and relational seam variables to make latent states operational for assembly-level validation and repair in open-world m...

  7. Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

    q-bio.BM 2026-05 unverdicted novelty 6.0

    Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...

  8. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  9. UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

    cs.AI 2026-05 unverdicted novelty 6.0

    UxSID uses Semantic IDs and dual-level attention for semantic-group shared interest memory to efficiently model ultra-long user sequences, claiming SOTA performance and 0.337% revenue lift in advertising A/B tests.

  10. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  11. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  12. From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage

    cs.CV 2026-04 unverdicted novelty 6.0

    HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.

  13. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  14. fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

    cs.CL 2025-11 unverdicted novelty 6.0

    fMRI-LM builds a foundation model that aligns fMRI signals with language through tokenization, LLM adaptation, and instruction tuning to enable semantic understanding of brain activity.

  15. UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

    cs.AI 2026-05 unverdicted novelty 5.0

    UxSID introduces semantic-group shared interest memory with Semantic IDs and dual-level attention to model ultra-long user sequences, claiming state-of-the-art results and a 0.337% revenue lift in advertising A/B tests.

  16. JaiTTS: A Thai Voice Cloning Model

    cs.CL 2026-04 unverdicted novelty 5.0

    JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.

  17. JaiTTS: A Thai Voice Cloning Model

    cs.CL 2026-04 unverdicted novelty 5.0

    JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.

  18. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  19. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  20. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    arXiv preprint arXiv:2201.07520 , year=

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multi- modal model of the internet. arXiv preprint arXiv:2201.07520,

  2. [2]

    Scaling laws for gen- erative mixed-modal language models

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for gen- erative mixed-modal language models. arXiv preprint arXiv:2301.03728,

  3. [3]

    High Quality Monocular Depth Estimation via Transfer Learning

    Ibraheem Alhashim and Peter Wonka. High quality monocular depth estimation via transfer learn- ing. arXiv preprint arXiv:1812.11941,

  4. [4]

    End-to-end optimized image compression

    Johannes Ball´e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704,

  5. [5]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,

  6. [6]

    Muse: Text-to-image gen- eration via masked generative transformers

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen- eration via masked generative transformers. arXiv preprint arXiv:2301.00704,

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  8. [8]

    Jukebox: A Generative Model for Music

    URL https:// github.com/openai/guided-diffusion/tree/main/evaluations. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,

  9. [9]

    Variable-rate discrete represen- tation learning

    Sander Dieleman, Charlie Nash, Jesse Engel, and Karen Simonyan. Variable-rate discrete represen- tation learning. arXiv preprint arXiv:2103.06089,

  10. [10]

    Image compression with product quantized masked image modeling

    Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Herv ´e J´egou. Image compression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372,

  11. [11]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. 2021 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 12868–12878,

  12. [12]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  13. [13]

    Disentanglement via latent quantization

    Kyle Hsu, Will Dorrell, James CR Whittington, Jiajun Wu, and Chelsea Finn. Disentanglement via latent quantization. arXiv preprint arXiv:2305.18378,

  14. [14]

    Not all image regions matter: Masked vector quantization for autoregressive image generation

    Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quantization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2002–2011,

  15. [15]

    Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks

    Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842,

  16. [16]

    Colorization transformer

    Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. Colorization transformer. arXiv preprint arXiv:2102.04432,

  17. [17]

    Robust training of vector quan- tized bottleneck models

    Adrian Ła ´ncucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans JGA Dolfing, Sameer Khurana, Tanel Alum¨ae, and Antoine Laurent. Robust training of vector quan- tized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN) , pp. 1–7. IEEE,

  18. [18]

    M2t: Masking transformers twice for faster decoding

    Fabian Mentzer, Eirikur Agustsson, and Michael Tschannen. M2t: Masking transformers twice for faster decoding. arXiv preprint arXiv:2304.07313,

  19. [19]

    Theory and Experiments on Vector Quantized Autoencoders

    Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063,

  20. [20]

    Sq-vae: Vari- ational bayes on discrete representation with self-annealed stochastic quantization.arXiv preprint arXiv:2205.07547,

    Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Ue- saka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. Sq-vae: Vari- ational bayes on discrete representation with self-annealed stochastic quantization.arXiv preprint arXiv:2205.07547,

  21. [21]

    Lossy Image Compression with Compressive Autoencoders

    Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz´ar. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395,

  22. [22]

    Vector-quantized Image Modeling with Improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,