MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging
Pith reviewed 2026-06-28 22:40 UTC · model grok-4.3
The pith
MergeTok unifies continuous VAE and discrete VQ visual tokenizers in one architecture by clustering tokens during encoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MergeTok jointly optimizes a continuous VAE and a discrete VQ tokenizer within a shared encoder-decoder by leveraging token merging during encoding to establish a semantic bridge, imposing merged-token semantic alignment on the VAE for disentangled representations and group-wise constraints on the VQ for training stability.
What carries the argument
Token merging during encoding, which clusters similar tokens to supply dual supervision signals for semantic alignment in the VAE and group-wise constraints in the VQ.
If this is right
- The VAE branch produces more disentangled and semantic-aware latents from the merged-token alignment.
- The VQ branch gains training stability from intra-group diversity and inter-group exclusivity constraints.
- The unified model achieves lower rFID than strong VAE and VQ baselines on ImageNet-256 at matched token budgets.
- The resulting tokens remain compatible with both autoregressive and diffusion generators.
Where Pith is reading between the lines
- The merging step could be tested as a general bridge for unifying continuous and discrete models in other modalities.
- Semantic organization induced by merging might support finer control in editing or conditional generation tasks.
- Dual supervision from merging could lower reliance on hand-designed auxiliary losses during tokenizer training.
Load-bearing premise
Token merging during encoding will reliably create dual supervision signals that improve VAE disentanglement and VQ stability without introducing new training instabilities or reconstruction artifacts.
What would settle it
If MergeTok on ImageNet-256 produces higher rFID or codebook collapse than matched separate VAE and VQ baselines, the unification benefit would be refuted.
Figures
read the original abstract
Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MergeTok, a unified encoder-decoder architecture for visual tokenization that jointly optimizes a continuous VAE branch and a discrete VQ branch by applying token merging during encoding. The merging operation is claimed to create dual supervision signals: semantic alignment that regularizes the VAE latent space toward disentanglement, and group-wise constraints that stabilize VQ codebook usage by promoting intra-group diversity and inter-group exclusivity. Experiments on ImageNet-256 report competitive rFID under matched token budgets, with tokens that are semantically organized and compatible with both autoregressive and diffusion generators.
Significance. If the dual-supervision mechanism holds, the work would meaningfully advance visual tokenization by unifying the complementary strengths of VAE and VQ families within a single model. The reported lower rFID relative to strong baselines and the production of generator-compatible discrete tokens constitute falsifiable experimental evidence; the token-merging bridge is presented as the key innovation enabling semantic organization without sacrificing reconstruction fidelity.
major comments (1)
- [Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.
minor comments (1)
- [§3] Notation for the merged-token loss terms is introduced without an explicit equation reference; adding a numbered equation in §3 would clarify how the group-wise constraints are formalized.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for major revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.
Authors: We agree that an ablation isolating the token-merging operation (e.g., joint VAE+VQ training without merging) would strengthen the central claim. The current results compare MergeTok to standalone VAE and VQ baselines under matched token budgets, but do not directly test joint optimization absent the merge step. We will add this ablation in the revision, together with quantitative metrics for per-group codebook usage diversity (e.g., intra-group entropy) and side-by-side training-curve stability comparisons. These additions will clarify whether the observed improvements in rFID and semantic organization stem specifically from the merging-induced structural prior. revision: yes
Circularity Check
No circularity detected; method is a proposed design without self-referential reductions
full rationale
The paper introduces MergeTok as an architectural design that applies token merging to create dual supervision signals between VAE and VQ branches. The abstract and claims describe this as a structural prior imposed by clustering, without any equations, fitted parameters renamed as predictions, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on the proposed mechanism rather than renaming known results or smuggling ansatzes. The derivation chain is self-contained as an engineering proposal, consistent with the reader's preliminary score of 2.0 indicating no evaluable circularity from the abstract alone.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Token merging: Your vit but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InInternational Conference on Learning Representations (ICLR),
-
[2]
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 12
2022
-
[3]
Softvq-vae: Efficient 1-dimensional continuous tokenizer
Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 9, 12
2025
-
[4]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 3, 9
2025
-
[5]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024. 2
-
[6]
Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation
Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, and Ying Chen. Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation. InArXiv, 2025. 3, 2
2025
-
[7]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.ArXiv, abs/2105.05233,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 6
2021
-
[9]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June
-
[10]
Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation
Gao et al. Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation. ArXiv, 2024. 3
2024
-
[11]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 7
2022
-
[12]
Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. ArXiv, abs/2501.07730, 2025. 1
-
[13]
Auto-encoding variational bayes
Diederik P Kingma. Auto-encoding variational bayes. InInternational Conference on Learning Represen- tations (ICLR), 2013. 1, 9, 12 11
2013
-
[14]
Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization
Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 8, 2, 7
2025
-
[15]
Mage: Masked generative encoder to unify representation learning and image synthesis
Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. Mage: Masked generative encoder to unify representation learning and image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7
2023
-
[16]
Autoregressive image generation without vector quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024. 9
2024
-
[17]
Lawrence Zitnick
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 13
2014
-
[18]
Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025
Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, and Gao Huang. Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025. 2
-
[19]
Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025
Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. URL https: //api.semanticscholar.org/CorpusID:281394841. 3
-
[20]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,
-
[21]
Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025. 3, 8, 2
2025
-
[22]
Boffi, Eric Vanden-Eijnden, and Saining Xie
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (ECCV), 2024. 6
2024
-
[23]
Peebles and Saining Xie
William S. Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 1, 8, 6
2023
-
[24]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. 6
2022
-
[25]
Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization
Kyle Sargent, Kyle Hsu, Justin Johnson, Fei-Fei Li, and Jiajun Wu. Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization. InInternational Conference on Computer Vision (ICCV), 2025. 8
2025
-
[26]
Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024. 2, 8
2024
-
[27]
Autoregres- sive model beats diffusion: Llama for scalable image generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregres- sive model beats diffusion: Llama for scalable image generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 8, 6, 12
2024
-
[28]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024. 7
2024
-
[29]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning (ICML), pages 10347–10357, 2021. 7
2021
-
[30]
Neural discrete representation learning
Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In ArXiv, 2017. 1
2017
-
[31]
Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024. 8
-
[32]
Bridging continuous and discrete tokens for autoregressive visual generation
Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation. InInternational Conference on Computer Vision (ICCV), 2025. 3 12
2025
-
[33]
Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation
Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8, 6, 9, 13
2025
-
[34]
Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding
Xu et al. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. ArXiv, 2024. 3
2024
-
[35]
Sicheng Yang, Xing Hu, Qiang Wu, and Dawei Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025. 3, 8
-
[36]
Reconstruction vs
Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, 2025. 1, 9
2025
-
[37]
Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Magvit: Masked generative video transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2023. 2, 3
2023
-
[39]
Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P. Murphy, Alexander G. Hauptmann, and Lu Jiang. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. InArXiv, 2023. 3
2023
-
[40]
An image is worth 32 tokens for reconstruction and generation
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 3, 8, 9, 12
2024
-
[41]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations (ICLR), 2025. 3, 9
2025
-
[42]
Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, and Biye Li. V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025. 1
-
[43]
Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation
Zhao et al. Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation. ArXiv, 2024. 2
2024
-
[44]
Hita: Holistic tokenizer for autoregressive image generation
Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8
2025
-
[45]
Vision foundation models as effective visual tokenizers for autoregressive image generation
Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation. InConference on Neural Information Processing Systems (NeurIPS), 2025. 8
2025
-
[46]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.ArXiv, abs/2510.11690, 2025. 9 13 Appendix for MergeTok Roadmap The appendix is organized to follow the structure of the main paper, with each part providing deeper coverage of the corresponding main-text section. Part I – Extended Discussion(A...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Evenly partition tokens into two setsAandB(e.g., by odd/even indices)
-
[48]
For each token in A, find its most similar token in B according to the current-layer attention features
-
[49]
Select the top pairs for merging under a pre-defined schedule
-
[50]
After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer
Aggregate features within each selected pair (e.g., by averaging) to form merged tokens. After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer. Token similarity is computed using attention features from the current layer, typically the self-attention keys K...
-
[51]
K-means on encoder output: offline k-means clustering applied to the final encoder output ZL, run once at each training step on the current batch
-
[52]
K-means on frozen DINOv2 features: clustering on the teacher’s patch features rather than the student’s
-
[53]
Spatial grid grouping: a fixed 16×16→8×8 spatial pooling that ignores content and groups tokens purely by image position
-
[54]
All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons
Random partition: tokens are randomly assigned to K groups with a fixed seed per image. All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons. Only the source ofSdiffers. Table A7:Impact of alternative grouping strategies on MergeTok-SB.All variants share the same V AE/VQ branches, alignment loss, and ...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.