MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

Anna Wang; Chen Chen; Cheng Tan; Haonan Lu; Haoqian Wang; Luyuan Zhang; Qingsong Xie; Siyuan Li; Yanhao Zhang; Zedong Wang

arxiv: 2605.30904 · v1 · pith:3KLKFESYnew · submitted 2026-05-29 · 💻 cs.CV

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

Luyuan Zhang , Siyuan Li , Zedong Wang , Qingsong Xie , Cheng Tan , Anna Wang , Yanhao Zhang , Chen Chen

show 2 more authors

Haonan Lu Haoqian Wang

This is my paper

Pith reviewed 2026-06-28 22:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual tokenizationVAEvector quantizationtoken mergingimage generationdisentangled representationsautoregressive generation

0 comments

The pith

MergeTok unifies continuous VAE and discrete VQ visual tokenizers in one architecture by clustering tokens during encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MergeTok to address the split between continuous VAEs, which reconstruct images well but have entangled latents, and discrete VQ models, which support generation but train unstably. By clustering similar tokens during encoding, the method creates a structural prior that aligns semantics in the VAE branch and adds group constraints to the VQ branch. This dual signal leads to more organized representations while maintaining reconstruction quality. The result is a tokenizer that works with both autoregressive and diffusion generators and shows lower rFID on ImageNet-256 under matched budgets. A sympathetic reader would care because it offers a path to tokenizers that are simultaneously high-fidelity and generator-friendly without separate models.

Core claim

MergeTok jointly optimizes a continuous VAE and a discrete VQ tokenizer within a shared encoder-decoder by leveraging token merging during encoding to establish a semantic bridge, imposing merged-token semantic alignment on the VAE for disentangled representations and group-wise constraints on the VQ for training stability.

What carries the argument

Token merging during encoding, which clusters similar tokens to supply dual supervision signals for semantic alignment in the VAE and group-wise constraints in the VQ.

If this is right

The VAE branch produces more disentangled and semantic-aware latents from the merged-token alignment.
The VQ branch gains training stability from intra-group diversity and inter-group exclusivity constraints.
The unified model achieves lower rFID than strong VAE and VQ baselines on ImageNet-256 at matched token budgets.
The resulting tokens remain compatible with both autoregressive and diffusion generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The merging step could be tested as a general bridge for unifying continuous and discrete models in other modalities.
Semantic organization induced by merging might support finer control in editing or conditional generation tasks.
Dual supervision from merging could lower reliance on hand-designed auxiliary losses during tokenizer training.

Load-bearing premise

Token merging during encoding will reliably create dual supervision signals that improve VAE disentanglement and VQ stability without introducing new training instabilities or reconstruction artifacts.

What would settle it

If MergeTok on ImageNet-256 produces higher rFID or codebook collapse than matched separate VAE and VQ baselines, the unification benefit would be refuted.

Figures

Figures reproduced from arXiv: 2605.30904 by Anna Wang, Chen Chen, Cheng Tan, Haonan Lu, Haoqian Wang, Luyuan Zhang, Qingsong Xie, Siyuan Li, Yanhao Zhang, Zedong Wang.

**Figure 1.** Figure 1: (a) Discrete VQ. Features are quantized by nearest-neighbor codebook lookup, but codebook updates can be sparse. (b) Continuous VAE. Features are mapped to continuous Gaussian latents through reparameterization for stable reconstruction. (c) MergeTok. MergeTok adopts a dual-branch design to jointly optimize VAE and VQ tokenization. The VAE branch introduces online token merging to inject semantic structure… view at source ↗

**Figure 2.** Figure 2: Overall Framework of MergeTok. We propose a dual-branch architecture that jointly optimizes continuous and discrete representations with shared encoder and decoder. (i) VAE Branch (Bottom) applies ToMe [1] to extract dense semantic tokens, which are aligned with a teacher model (also equipped with ToMe). The resulting source map is then employed to unmerge the groups back to the full lattice for reconstruc… view at source ↗

**Figure 3.** Figure 3: Semantic Condensing Effects. We visualize PCA-3 components of raw/reconstructed images and the corresponding ToMe source maps to show how MergeTok organizes visual information. The VAE branch is constrained by token-wise aggregation, yielding semantic separability comparable to discrete models. The VQ branch without ToMe shows inherent clustering due to quantization. branch, ToMe is bypassed and S enters o… view at source ↗

**Figure 4.** Figure 4: Reconstruction in both VQ and VAE branches across Token Granularities. We visualize reconstructions from both branches while sweeping the target sampling center K∗ controlled by merge ratio r from #256 (left) to #64 (right). It shows MergeTok’s robustness to varying compression rates. The red marker indicates the optimal kept-token count (K∗ = 128) found during training. 3.3 Improving VQ with VAE-Derived G… view at source ↗

**Figure 5.** Figure 5: Kept token number vs rFID/gFID. With K∗ = 128, MergeTok achieves competitive rFID and gFID. where K∗ denotes a hyperparameter that approximates the dataset’s information density, and {ki} enumerates the admissible kept-token counts; σ controls the dispersion of the discrete Gaussian, and sampling is clipped to the valid set. The corresponding merge ratio is then computed by a scheduling function, r = … view at source ↗

read the original abstract

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MergeTok's token-merging bridge between VAE and VQ is a clean idea but the abstract gives no evidence that the merge step itself produces the claimed dual supervision gains.

read the letter

The paper introduces MergeTok, which adds token merging inside a shared encoder-decoder to link a continuous VAE branch with a discrete VQ branch. Clustering similar tokens is meant to regularize the VAE latent space toward semantic alignment while imposing group constraints that stabilize VQ codebook usage.

What the work does reasonably well is frame the complementary weaknesses of the two tokenizer families and propose a single structural prior to address both. The reported competitive rFID on ImageNet-256, lower than matched VAE and VQ baselines under the same token budget, is the concrete result they put forward, and the claim that the resulting tokens work with both autoregressive and diffusion generators is a practical plus.

The soft spots are straightforward. The abstract describes dual supervision from the merge operation but supplies no ablations that isolate the merging step, no training-curve comparisons, and no metrics on disentanglement or per-group codebook diversity. Without those, it is impossible to tell whether the merging itself drives the improvements or whether simply training the two branches together would be enough. The assertion that this happens without new instabilities or reconstruction artifacts is stated but not shown.

This paper is for researchers who build or tune visual tokenizers for generation tasks. A reader already working on VAE-VQ hybrids might pick up the merging concept as worth testing, but anyone expecting ready-to-use results would need the full experiments first.

It deserves peer review because the unification mechanism is distinct enough from prior work to merit expert examination, even though the current evidence is limited to the abstract-level claims.

Referee Report

1 major / 1 minor

Summary. The paper introduces MergeTok, a unified encoder-decoder architecture for visual tokenization that jointly optimizes a continuous VAE branch and a discrete VQ branch by applying token merging during encoding. The merging operation is claimed to create dual supervision signals: semantic alignment that regularizes the VAE latent space toward disentanglement, and group-wise constraints that stabilize VQ codebook usage by promoting intra-group diversity and inter-group exclusivity. Experiments on ImageNet-256 report competitive rFID under matched token budgets, with tokens that are semantically organized and compatible with both autoregressive and diffusion generators.

Significance. If the dual-supervision mechanism holds, the work would meaningfully advance visual tokenization by unifying the complementary strengths of VAE and VQ families within a single model. The reported lower rFID relative to strong baselines and the production of generator-compatible discrete tokens constitute falsifiable experimental evidence; the token-merging bridge is presented as the key innovation enabling semantic organization without sacrificing reconstruction fidelity.

major comments (1)

[Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.

minor comments (1)

[§3] Notation for the merged-token loss terms is introduced without an explicit equation reference; adding a numbered equation in §3 would clarify how the group-wise constraints are formalized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.

Authors: We agree that an ablation isolating the token-merging operation (e.g., joint VAE+VQ training without merging) would strengthen the central claim. The current results compare MergeTok to standalone VAE and VQ baselines under matched token budgets, but do not directly test joint optimization absent the merge step. We will add this ablation in the revision, together with quantitative metrics for per-group codebook usage diversity (e.g., intra-group entropy) and side-by-side training-curve stability comparisons. These additions will clarify whether the observed improvements in rFID and semantic organization stem specifically from the merging-induced structural prior. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is a proposed design without self-referential reductions

full rationale

The paper introduces MergeTok as an architectural design that applies token merging to create dual supervision signals between VAE and VQ branches. The abstract and claims describe this as a structural prior imposed by clustering, without any equations, fitted parameters renamed as predictions, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on the proposed mechanism rather than renaming known results or smuggling ansatzes. The derivation chain is self-contained as an engineering proposal, consistent with the reader's preliminary score of 2.0 indicating no evaluable circularity from the abstract alone.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5778 in / 938 out tokens · 15896 ms · 2026-06-28T22:40:30.430232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InInternational Conference on Learning Representations (ICLR),
[2]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 12

2022
[3]

Softvq-vae: Efficient 1-dimensional continuous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 9, 12

2025
[4]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 3, 9

2025
[5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024. 2

work page arXiv 2024
[6]

Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, and Ying Chen. Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation. InArXiv, 2025. 3, 2

2025
[7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.ArXiv, abs/2105.05233,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 6

2021
[9]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June
[10]

Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation

Gao et al. Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation. ArXiv, 2024. 3

2024
[11]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 7

2022
[12]

Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. ArXiv, abs/2501.07730, 2025. 1

work page arXiv 2025
[13]

Auto-encoding variational bayes

Diederik P Kingma. Auto-encoding variational bayes. InInternational Conference on Learning Represen- tations (ICLR), 2013. 1, 9, 12 11

2013
[14]

Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 8, 2, 7

2025
[15]

Mage: Masked generative encoder to unify representation learning and image synthesis

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. Mage: Masked generative encoder to unify representation learning and image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7

2023
[16]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024. 9

2024
[17]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 13

2014
[18]

Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025

Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, and Gao Huang. Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025. 2

work page arXiv 2025
[19]

Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. URL https: //api.semanticscholar.org/CorpusID:281394841. 3

work page arXiv 2025
[20]

Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv
[21]

Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025. 3, 8, 2

2025
[22]

Boffi, Eric Vanden-Eijnden, and Saining Xie

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (ECCV), 2024. 6

2024
[23]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 1, 8, 6

2023
[24]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. 6

2022
[25]

Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization

Kyle Sargent, Kyle Hsu, Justin Johnson, Fei-Fei Li, and Jiajun Wu. Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization. InInternational Conference on Computer Vision (ICCV), 2025. 8

2025
[26]

Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024. 2, 8

2024
[27]

Autoregres- sive model beats diffusion: Llama for scalable image generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregres- sive model beats diffusion: Llama for scalable image generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 8, 6, 12

2024
[28]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024. 7

2024
[29]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning (ICML), pages 10347–10357, 2021. 7

2021
[30]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In ArXiv, 2017. 1

2017
[31]

Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024. 8

work page arXiv 2024
[32]

Bridging continuous and discrete tokens for autoregressive visual generation

Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation. InInternational Conference on Computer Vision (ICCV), 2025. 3 12

2025
[33]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8, 6, 9, 13

2025
[34]

Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

Xu et al. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. ArXiv, 2024. 3

2024
[35]

Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025

Sicheng Yang, Xing Hu, Qiang Wu, and Dawei Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025. 3, 8

work page arXiv 2025
[36]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, 2025. 1, 9

2025
[37]

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2023. 2, 3

2023
[39]

Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P

Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P. Murphy, Alexander G. Hauptmann, and Lu Jiang. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. InArXiv, 2023. 3

2023
[40]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 3, 8, 9, 12

2024
[41]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations (ICLR), 2025. 3, 9

2025
[42]

V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025

Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, and Biye Li. V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025. 1

work page arXiv 2025
[43]

Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation

Zhao et al. Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation. ArXiv, 2024. 2

2024
[44]

Hita: Holistic tokenizer for autoregressive image generation

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8

2025
[45]

Vision foundation models as effective visual tokenizers for autoregressive image generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation. InConference on Neural Information Processing Systems (NeurIPS), 2025. 8

2025
[46]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.ArXiv, abs/2510.11690, 2025. 9 13 Appendix for MergeTok Roadmap The appendix is organized to follow the structure of the main paper, with each part providing deeper coverage of the corresponding main-text section. Part I – Extended Discussion(A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Evenly partition tokens into two setsAandB(e.g., by odd/even indices)
[48]

For each token in A, find its most similar token in B according to the current-layer attention features
[49]

Select the top pairs for merging under a pre-defined schedule
[50]

After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer

Aggregate features within each selected pair (e.g., by averaging) to form merged tokens. After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer. Token similarity is computed using attention features from the current layer, typically the self-attention keys K...
[51]

K-means on encoder output: offline k-means clustering applied to the final encoder output ZL, run once at each training step on the current batch
[52]

K-means on frozen DINOv2 features: clustering on the teacher’s patch features rather than the student’s
[53]

Spatial grid grouping: a fixed 16×16→8×8 spatial pooling that ignores content and groups tokens purely by image position
[54]

All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons

Random partition: tokens are randomly assigned to K groups with a fixed seed per image. All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons. Only the source ofSdiffers. Table A7:Impact of alternative grouping strategies on MergeTok-SB.All variants share the same V AE/VQ branches, alignment loss, and ...

2017

[1] [1]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InInternational Conference on Learning Representations (ICLR),

[2] [2]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 12

2022

[3] [3]

Softvq-vae: Efficient 1-dimensional continuous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 9, 12

2025

[4] [4]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 3, 9

2025

[5] [5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024. 2

work page arXiv 2024

[6] [6]

Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, and Ying Chen. Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation. InArXiv, 2025. 3, 2

2025

[7] [7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.ArXiv, abs/2105.05233,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 6

2021

[9] [9]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June

[10] [10]

Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation

Gao et al. Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation. ArXiv, 2024. 3

2024

[11] [11]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 7

2022

[12] [12]

Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. ArXiv, abs/2501.07730, 2025. 1

work page arXiv 2025

[13] [13]

Auto-encoding variational bayes

Diederik P Kingma. Auto-encoding variational bayes. InInternational Conference on Learning Represen- tations (ICLR), 2013. 1, 9, 12 11

2013

[14] [14]

Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 8, 2, 7

2025

[15] [15]

Mage: Masked generative encoder to unify representation learning and image synthesis

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. Mage: Masked generative encoder to unify representation learning and image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7

2023

[16] [16]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024. 9

2024

[17] [17]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 13

2014

[18] [18]

Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025

Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, and Gao Huang. Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025. 2

work page arXiv 2025

[19] [19]

Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. URL https: //api.semanticscholar.org/CorpusID:281394841. 3

work page arXiv 2025

[20] [20]

Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv

[21] [21]

Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025. 3, 8, 2

2025

[22] [22]

Boffi, Eric Vanden-Eijnden, and Saining Xie

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (ECCV), 2024. 6

2024

[23] [23]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 1, 8, 6

2023

[24] [24]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. 6

2022

[25] [25]

Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization

Kyle Sargent, Kyle Hsu, Justin Johnson, Fei-Fei Li, and Jiajun Wu. Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization. InInternational Conference on Computer Vision (ICCV), 2025. 8

2025

[26] [26]

Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024. 2, 8

2024

[27] [27]

Autoregres- sive model beats diffusion: Llama for scalable image generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregres- sive model beats diffusion: Llama for scalable image generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 8, 6, 12

2024

[28] [28]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024. 7

2024

[29] [29]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning (ICML), pages 10347–10357, 2021. 7

2021

[30] [30]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In ArXiv, 2017. 1

2017

[31] [31]

Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024. 8

work page arXiv 2024

[32] [32]

Bridging continuous and discrete tokens for autoregressive visual generation

Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation. InInternational Conference on Computer Vision (ICCV), 2025. 3 12

2025

[33] [33]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8, 6, 9, 13

2025

[34] [34]

Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

Xu et al. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. ArXiv, 2024. 3

2024

[35] [35]

Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025

Sicheng Yang, Xing Hu, Qiang Wu, and Dawei Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025. 3, 8

work page arXiv 2025

[36] [36]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, 2025. 1, 9

2025

[37] [37]

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2023. 2, 3

2023

[39] [39]

Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P

Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P. Murphy, Alexander G. Hauptmann, and Lu Jiang. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. InArXiv, 2023. 3

2023

[40] [40]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 3, 8, 9, 12

2024

[41] [41]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations (ICLR), 2025. 3, 9

2025

[42] [42]

V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025

Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, and Biye Li. V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025. 1

work page arXiv 2025

[43] [43]

Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation

Zhao et al. Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation. ArXiv, 2024. 2

2024

[44] [44]

Hita: Holistic tokenizer for autoregressive image generation

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8

2025

[45] [45]

Vision foundation models as effective visual tokenizers for autoregressive image generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation. InConference on Neural Information Processing Systems (NeurIPS), 2025. 8

2025

[46] [46]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.ArXiv, abs/2510.11690, 2025. 9 13 Appendix for MergeTok Roadmap The appendix is organized to follow the structure of the main paper, with each part providing deeper coverage of the corresponding main-text section. Part I – Extended Discussion(A...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Evenly partition tokens into two setsAandB(e.g., by odd/even indices)

[48] [48]

For each token in A, find its most similar token in B according to the current-layer attention features

[49] [49]

Select the top pairs for merging under a pre-defined schedule

[50] [50]

After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer

Aggregate features within each selected pair (e.g., by averaging) to form merged tokens. After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer. Token similarity is computed using attention features from the current layer, typically the self-attention keys K...

[51] [51]

K-means on encoder output: offline k-means clustering applied to the final encoder output ZL, run once at each training step on the current batch

[52] [52]

K-means on frozen DINOv2 features: clustering on the teacher’s patch features rather than the student’s

[53] [53]

Spatial grid grouping: a fixed 16×16→8×8 spatial pooling that ignores content and groups tokens purely by image position

[54] [54]

All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons

Random partition: tokens are randomly assigned to K groups with a fixed seed per image. All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons. Only the source ofSdiffers. Table A7:Impact of alternative grouping strategies on MergeTok-SB.All variants share the same V AE/VQ branches, alignment loss, and ...

2017