arxiv: 2309.15505 · v2 · submitted 2023-09-27 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer , David Minnen , Eirikur Agustsson , Michael Tschannen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords finite scalar quantizationvector quantizationVQ-VAEdiscrete latentsimage generationcodebook collapsemasked transformers

0 comments

The pith

FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes finite scalar quantization as a simpler way to create discrete latent representations in variational autoencoders. Instead of complex vector quantization with its various training stabilizations, FSQ projects the continuous latent to a small number of dimensions and rounds each to one of a few fixed values. The resulting discrete codes form an implicit codebook whose size is the product of the per-dimension choices. This allows the same downstream models for image generation and dense prediction tasks to be trained on these representations. The approach matches the performance of standard VQ-VAE methods while avoiding codebook collapse and eliminating the need for commitment losses or codebook maintenance techniques.

Core claim

By projecting the VAE latent representation down to typically fewer than 10 dimensions and quantizing each dimension independently to a small set of fixed values, we obtain an implicit codebook given by the Cartesian product of these sets. Training the same autoregressive and masked transformer models on these discrete codes yields competitive performance on image generation with MaskGIT and on depth estimation, colorization, and panoptic segmentation with UViM, without suffering from codebook collapse or requiring the auxiliary losses and reseeding procedures of vector quantization.

What carries the argument

Finite scalar quantization, which reduces the latent to a low-dimensional vector and applies independent scalar quantization to each coordinate using fixed level sets, with the effective codebook arising as their product.

If this is right

Autoregressive and masked transformer models for image generation can be trained directly on FSQ codes and achieve competitive results.
Dense prediction tasks such as depth estimation, colorization, and panoptic segmentation reach similar accuracy when using FSQ-based discrete representations.
The method requires no commitment loss, codebook reseeding, code splitting, or entropy penalties to learn useful discrete codes.
Codebook collapse is avoided because each dimension is quantized independently to fixed values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

FSQ's success suggests that much of the representational power in VQ comes from the exponential growth of the codebook size rather than the vector nature of the quantization.
This simplification could make discrete latent models easier to implement and scale to new domains where VQ training instabilities have been a barrier.
Since the quantization levels are fixed, the method might allow for more predictable bit-rate control in compression applications compared to learned codebooks.

Load-bearing premise

Projecting the latent representation to a small number of dimensions and quantizing each one independently to fixed levels still captures enough information for the downstream tasks to perform as well as full vector quantization.

What would settle it

Training the same models with FSQ and with VQ on identical tasks and codebook sizes, then observing that FSQ yields substantially lower generation quality or task accuracy would falsify the claim of competitive performance.

read the original abstract

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FSQ swaps vector quantization for a low-dim scalar product codebook and claims to match VQ performance with far less training hassle.

read the letter

FSQ takes the VAE latent, projects it to a handful of dimensions, and quantizes each coordinate independently to a small fixed set of levels. The effective codebook is just the cartesian product of those scalars, sized to match a standard VQ codebook. This is a clean algorithmic swap rather than another tweak to the usual vector setup, and it is presented as new in exactly this replacement role for VQ-VAE latents. They plug the resulting discrete codes into MaskGIT for image generation and into UViM for depth, colorization, and panoptic segmentation, reporting competitive results across the board. The main practical gain is that codebook collapse disappears by construction and none of the usual VQ machinery (commitment losses, reseeding, splitting, entropy penalties) is needed. That is a real engineering simplification if the numbers hold. The abstract gives no concrete numbers, baselines, or ablation tables, so the performance claim is hard to judge from the text alone. The projection step could discard joint statistics across the original latent dimensions, and the stress-test concern is fair: without a controlled check of projection dimensionality at fixed total codebook size, it is unclear how much capacity is actually preserved. If the full paper contains those ablations and the deltas are small, the simplification stands; if not, that is the main gap. This is useful for anyone already training discrete latent models for generation or dense prediction who wants fewer knobs to turn. It shows straightforward thinking about the quantization step itself and deserves a serious referee because the method is simple to reimplement and the claims are specific enough to verify or refute.

Referee Report

2 major / 2 minor

Summary. The paper proposes Finite Scalar Quantization (FSQ) to replace vector quantization in VQ-VAEs. It projects the latent representation to a small number of dimensions (typically fewer than 10), quantizes each coordinate independently to a fixed set of levels, and forms an implicit product codebook whose cardinality matches that of a standard VQ codebook. The same downstream models (MaskGIT for image generation; UViM for depth estimation, colorization, and panoptic segmentation) are then trained on the resulting discrete latents. The central claim is that FSQ achieves competitive performance on these tasks while avoiding codebook collapse and eliminating the need for commitment losses, reseeding, entropy penalties, and related machinery.

Significance. If the empirical claims hold, FSQ offers a substantial simplification of discrete latent learning for generative and dense-prediction vision models. By removing the complex stabilization techniques required by VQ-VAEs and still matching performance, the method lowers the barrier to using discrete representations and may improve training stability and reproducibility. The approach is attractive because the quantization step itself introduces no learned parameters once the number of scalar dimensions and levels per dimension are fixed.

major comments (2)

[§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.
[§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.

minor comments (2)

[§3] Clarify in the method section how the specific number of scalar dimensions and levels per dimension are chosen in each experiment to exactly match the VQ codebook size used in the baselines.
[§4] Add codebook utilization statistics (e.g., percentage of active codes) for both FSQ and VQ runs to support the claim that FSQ does not suffer from collapse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.

Authors: We agree that an ablation on the projection dimensionality, while keeping the codebook size fixed, would provide valuable insight into whether the product codebook preserves joint statistics. Although our experiments demonstrate competitive performance with the chosen dimensionality (typically <10), we will add such an ablation study in the revised manuscript, focusing on the MaskGIT image generation task. This will include varying the number of scalar dimensions from 4 to 16 while adjusting levels per dimension to maintain equivalent codebook cardinality, and report the resulting FID scores. We believe this will confirm that performance does not degrade significantly, indicating that the implicit codebook captures the necessary statistics without requiring the downstream model to compensate for a severe bottleneck. revision: yes
Referee: [§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.

Authors: We apologize if the quantitative results were not sufficiently prominent in the main text. The full manuscript includes detailed tables (such as Table 1 comparing FID scores for MaskGIT with FSQ vs. VQ, and Table 3 for UViM tasks with metrics like RMSE for depth and mIoU for segmentation) that provide direct comparisons with matched codebook sizes (e.g., 1024 or 4096). These tables include standard deviations from multiple runs where applicable. To address the referee's concern, we will move or duplicate key comparison tables into the main body of the paper and ensure all numbers are explicitly stated in the text, with clear indications of matched training budgets and codebook cardinalities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; FSQ is a direct algorithmic substitution with empirical validation

full rationale

The paper introduces FSQ by projecting the VAE latent to a small number of dimensions (typically <10) and independently quantizing each to fixed levels, yielding a product codebook whose cardinality is set by explicit choice to match a VQ baseline. This is a design decision, not a derivation. All performance claims (competitive results on MaskGIT image generation and UViM dense prediction tasks) are presented as empirical outcomes after training the same downstream models, without any equations that reduce those outcomes to fitted parameters, self-cited uniqueness theorems, or ansatzes imported from prior work by the same authors. No load-bearing self-citations appear in the derivation chain, and the method does not rename known results or smuggle assumptions via citation. The central claim of avoiding codebook collapse and complex VQ machinery is therefore supported by direct substitution and experiment rather than circular reduction to inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method introduces no new physical entities. It relies on the standard assumption that a low-dimensional projection followed by independent rounding can approximate the capacity of a learned vector codebook. The number of dimensions and levels per dimension are chosen by hand to match a target codebook size; these are free parameters.

free parameters (2)

number of scalar dimensions
Chosen (typically <10) so that the product of per-dimension levels equals the desired codebook size; directly controls representational capacity.
levels per dimension
Small fixed integers (e.g., 3 or 5) selected to achieve target codebook cardinality; values themselves are fixed but the count is a design choice.

axioms (1)

domain assumption A low-dimensional projection of the VAE latent preserves task-relevant information when each coordinate is independently quantized.
Invoked when claiming that FSQ representations remain expressive enough for the same downstream models.

pith-pipeline@v0.9.0 · 5527 in / 1468 out tokens · 24066 ms · 2026-05-16T18:28:56.392306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.Jcost Jcost_eq_zero_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization
cs.AI 2026-05 unverdicted novelty 7.0

TimeTok is a unified framework using hierarchical tokenization for granularity-controllable time-series generation that achieves state-of-the-art performance in standard tasks and shows transferability across heteroge...
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale
cs.IR 2026-04 unverdicted novelty 7.0

AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
cs.LG 2026-04 unverdicted novelty 7.0

Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by d...
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure
cs.LG 2026-05 unverdicted novelty 6.0

C2LT-3D factorizes 3D tokenization into canonical local geometry, partition-conditioned context, and relational seam variables to make latent states operational for assembly-level validation and repair in open-world m...
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
q-bio.BM 2026-05 unverdicted novelty 6.0

Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
cs.AI 2026-05 unverdicted novelty 6.0

UxSID uses Semantic IDs and dual-level attention for semantic-group shared interest memory to efficiently model ultra-long user sequences, claiming SOTA performance and 0.337% revenue lift in advertising A/B tests.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
cs.CV 2026-04 unverdicted novelty 6.0

HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
cs.CL 2025-11 unverdicted novelty 6.0

fMRI-LM builds a foundation model that aligns fMRI signals with language through tokenization, LLM adaptation, and instruction tuning to enable semantic understanding of brain activity.
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
cs.AI 2026-05 unverdicted novelty 5.0

UxSID introduces semantic-group shared interest memory with Semantic IDs and dual-level attention to model ultra-long user sequences, claiming state-of-the-art results and a 0.337% revenue lift in advertising A/B tests.
JaiTTS: A Thai Voice Cloning Model
cs.CL 2026-04 unverdicted novelty 5.0

JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
JaiTTS: A Thai Voice Cloning Model
cs.CL 2026-04 unverdicted novelty 5.0

JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
cs.CV 2026-04 unverdicted novelty 5.0

UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 18 Pith papers · 8 internal anchors

[1]

arXiv preprint arXiv:2201.07520 , year=

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multi- modal model of the internet. arXiv preprint arXiv:2201.07520,

work page arXiv
[2]

Scaling laws for gen- erative mixed-modal language models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for gen- erative mixed-modal language models. arXiv preprint arXiv:2301.03728,

work page arXiv
[3]

High Quality Monocular Depth Estimation via Transfer Learning

Ibraheem Alhashim and Peter Wonka. High quality monocular depth estimation via transfer learn- ing. arXiv preprint arXiv:1812.11941,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

End-to-end optimized image compression

Johannes Ball´e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704,

work page arXiv
[5]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Muse: Text-to-image gen- eration via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen- eration via masked generative transformers. arXiv preprint arXiv:2301.00704,

work page arXiv
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Jukebox: A Generative Model for Music

URL https:// github.com/openai/guided-diffusion/tree/main/evaluations. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[9]

Variable-rate discrete represen- tation learning

Sander Dieleman, Charlie Nash, Jesse Engel, and Karen Simonyan. Variable-rate discrete represen- tation learning. arXiv preprint arXiv:2103.06089,

work page arXiv
[10]

Image compression with product quantized masked image modeling

Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Herv ´e J´egou. Image compression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372,

work page arXiv
[11]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. 2021 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 12868–12878,

work page 2021
[12]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Disentanglement via latent quantization

Kyle Hsu, Will Dorrell, James CR Whittington, Jiajun Wu, and Chelsea Finn. Disentanglement via latent quantization. arXiv preprint arXiv:2305.18378,

work page arXiv
[14]

Not all image regions matter: Masked vector quantization for autoregressive image generation

Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quantization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2002–2011,

work page 2002
[15]

Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks

Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842,

work page arXiv
[16]

Colorization transformer

Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. Colorization transformer. arXiv preprint arXiv:2102.04432,

work page arXiv
[17]

Robust training of vector quan- tized bottleneck models

Adrian Ła ´ncucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans JGA Dolfing, Sameer Khurana, Tanel Alum¨ae, and Antoine Laurent. Robust training of vector quan- tized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN) , pp. 1–7. IEEE,

work page 2020
[18]

M2t: Masking transformers twice for faster decoding

Fabian Mentzer, Eirikur Agustsson, and Michael Tschannen. M2t: Masking transformers twice for faster decoding. arXiv preprint arXiv:2304.07313,

work page arXiv
[19]

Theory and Experiments on Vector Quantized Autoencoders

Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Sq-vae: Vari- ational bayes on discrete representation with self-annealed stochastic quantization.arXiv preprint arXiv:2205.07547,

Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Ue- saka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. Sq-vae: Vari- ational bayes on discrete representation with self-annealed stochastic quantization.arXiv preprint arXiv:2205.07547,

work page arXiv
[21]

Lossy Image Compression with Compressive Autoencoders

Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz´ar. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,

work page internal anchor Pith review arXiv