pith. sign in

arxiv: 2605.30904 · v1 · pith:3KLKFESYnew · submitted 2026-05-29 · 💻 cs.CV

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

Pith reviewed 2026-06-28 22:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual tokenizationVAEvector quantizationtoken mergingimage generationdisentangled representationsautoregressive generation
0
0 comments X

The pith

MergeTok unifies continuous VAE and discrete VQ visual tokenizers in one architecture by clustering tokens during encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MergeTok to address the split between continuous VAEs, which reconstruct images well but have entangled latents, and discrete VQ models, which support generation but train unstably. By clustering similar tokens during encoding, the method creates a structural prior that aligns semantics in the VAE branch and adds group constraints to the VQ branch. This dual signal leads to more organized representations while maintaining reconstruction quality. The result is a tokenizer that works with both autoregressive and diffusion generators and shows lower rFID on ImageNet-256 under matched budgets. A sympathetic reader would care because it offers a path to tokenizers that are simultaneously high-fidelity and generator-friendly without separate models.

Core claim

MergeTok jointly optimizes a continuous VAE and a discrete VQ tokenizer within a shared encoder-decoder by leveraging token merging during encoding to establish a semantic bridge, imposing merged-token semantic alignment on the VAE for disentangled representations and group-wise constraints on the VQ for training stability.

What carries the argument

Token merging during encoding, which clusters similar tokens to supply dual supervision signals for semantic alignment in the VAE and group-wise constraints in the VQ.

If this is right

  • The VAE branch produces more disentangled and semantic-aware latents from the merged-token alignment.
  • The VQ branch gains training stability from intra-group diversity and inter-group exclusivity constraints.
  • The unified model achieves lower rFID than strong VAE and VQ baselines on ImageNet-256 at matched token budgets.
  • The resulting tokens remain compatible with both autoregressive and diffusion generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The merging step could be tested as a general bridge for unifying continuous and discrete models in other modalities.
  • Semantic organization induced by merging might support finer control in editing or conditional generation tasks.
  • Dual supervision from merging could lower reliance on hand-designed auxiliary losses during tokenizer training.

Load-bearing premise

Token merging during encoding will reliably create dual supervision signals that improve VAE disentanglement and VQ stability without introducing new training instabilities or reconstruction artifacts.

What would settle it

If MergeTok on ImageNet-256 produces higher rFID or codebook collapse than matched separate VAE and VQ baselines, the unification benefit would be refuted.

Figures

Figures reproduced from arXiv: 2605.30904 by Anna Wang, Chen Chen, Cheng Tan, Haonan Lu, Haoqian Wang, Luyuan Zhang, Qingsong Xie, Siyuan Li, Yanhao Zhang, Zedong Wang.

Figure 1
Figure 1. Figure 1: (a) Discrete VQ. Features are quantized by nearest-neighbor codebook lookup, but codebook updates can be sparse. (b) Continuous VAE. Features are mapped to continuous Gaussian latents through reparameterization for stable reconstruction. (c) MergeTok. MergeTok adopts a dual-branch design to jointly optimize VAE and VQ tokenization. The VAE branch introduces online token merging to inject semantic structure… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Framework of MergeTok. We propose a dual-branch architecture that jointly optimizes continuous and discrete representations with shared encoder and decoder. (i) VAE Branch (Bottom) applies ToMe [1] to extract dense semantic tokens, which are aligned with a teacher model (also equipped with ToMe). The resulting source map is then employed to unmerge the groups back to the full lattice for reconstruc… view at source ↗
Figure 3
Figure 3. Figure 3: Semantic Condensing Effects. We visualize PCA-3 components of raw/reconstructed images and the corresponding ToMe source maps to show how MergeTok organizes visual information. The VAE branch is constrained by token-wise aggregation, yielding semantic separability comparable to discrete models. The VQ branch without ToMe shows inherent clustering due to quantization. branch, ToMe is bypassed and S enters o… view at source ↗
Figure 4
Figure 4. Figure 4: Reconstruction in both VQ and VAE branches across Token Granularities. We visualize reconstructions from both branches while sweeping the target sampling center K∗ controlled by merge ratio r from #256 (left) to #64 (right). It shows MergeTok’s robustness to varying compression rates. The red marker indicates the optimal kept-token count (K∗ = 128) found during training. 3.3 Improving VQ with VAE-Derived G… view at source ↗
Figure 5
Figure 5. Figure 5: Kept token number vs rFID/gFID. With K∗ = 128, MergeTok achieves competi￾tive rFID and gFID. where K∗ denotes a hyperparameter that approxi￾mates the dataset’s information density, and {ki} enumerates the admissible kept-token counts; σ con￾trols the dispersion of the discrete Gaussian, and sampling is clipped to the valid set. The corre￾sponding merge ratio is then computed by a schedul￾ing function, r = … view at source ↗
read the original abstract

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MergeTok, a unified encoder-decoder architecture for visual tokenization that jointly optimizes a continuous VAE branch and a discrete VQ branch by applying token merging during encoding. The merging operation is claimed to create dual supervision signals: semantic alignment that regularizes the VAE latent space toward disentanglement, and group-wise constraints that stabilize VQ codebook usage by promoting intra-group diversity and inter-group exclusivity. Experiments on ImageNet-256 report competitive rFID under matched token budgets, with tokens that are semantically organized and compatible with both autoregressive and diffusion generators.

Significance. If the dual-supervision mechanism holds, the work would meaningfully advance visual tokenization by unifying the complementary strengths of VAE and VQ families within a single model. The reported lower rFID relative to strong baselines and the production of generator-compatible discrete tokens constitute falsifiable experimental evidence; the token-merging bridge is presented as the key innovation enabling semantic organization without sacrificing reconstruction fidelity.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.
minor comments (1)
  1. [§3] Notation for the merged-token loss terms is introduced without an explicit equation reference; adding a numbered equation in §3 would clarify how the group-wise constraints are formalized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): the central claim that token merging supplies dual supervision responsible for semantic organization and VQ stability is load-bearing, yet the manuscript provides no ablation that isolates the merge step (e.g., joint VAE+VQ optimization without merging) nor quantitative metrics such as per-group codebook usage diversity or training-curve stability comparisons. Without these, it remains possible that observed gains arise from joint optimization alone rather than the asserted structural prior.

    Authors: We agree that an ablation isolating the token-merging operation (e.g., joint VAE+VQ training without merging) would strengthen the central claim. The current results compare MergeTok to standalone VAE and VQ baselines under matched token budgets, but do not directly test joint optimization absent the merge step. We will add this ablation in the revision, together with quantitative metrics for per-group codebook usage diversity (e.g., intra-group entropy) and side-by-side training-curve stability comparisons. These additions will clarify whether the observed improvements in rFID and semantic organization stem specifically from the merging-induced structural prior. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is a proposed design without self-referential reductions

full rationale

The paper introduces MergeTok as an architectural design that applies token merging to create dual supervision signals between VAE and VQ branches. The abstract and claims describe this as a structural prior imposed by clustering, without any equations, fitted parameters renamed as predictions, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on the proposed mechanism rather than renaming known results or smuggling ansatzes. The derivation chain is self-contained as an engineering proposal, consistent with the reader's preliminary score of 2.0 indicating no evaluable circularity from the abstract alone.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5778 in / 938 out tokens · 15896 ms · 2026-06-28T22:40:30.430232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InInternational Conference on Learning Representations (ICLR),

  2. [2]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 12

  3. [3]

    Softvq-vae: Efficient 1-dimensional continuous tokenizer

    Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 9, 12

  4. [4]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 3, 9

  5. [5]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models.ArXiv, abs/2403.06764, 2024. 2

  6. [6]

    Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation

    Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, and Ying Chen. Wave-particle (continuous-discrete) dualistic visual tokenization for unified understanding and generation. InArXiv, 2025. 3, 2

  7. [7]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.ArXiv, abs/2105.05233,

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 6

  9. [9]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June

  10. [10]

    Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation

    Gao et al. Vtbench: A standardized benchmark for visual tokenizers in autoregressive image generation. ArXiv, 2024. 3

  11. [11]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 7

  12. [12]

    Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens

    Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. ArXiv, abs/2501.07730, 2025. 1

  13. [13]

    Auto-encoding variational bayes

    Diederik P Kingma. Auto-encoding variational bayes. InInternational Conference on Learning Represen- tations (ICLR), 2013. 1, 9, 12 11

  14. [14]

    Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

    Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 8, 2, 7

  15. [15]

    Mage: Masked generative encoder to unify representation learning and image synthesis

    Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. Mage: Masked generative encoder to unify representation learning and image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7

  16. [16]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024. 9

  17. [17]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 13

  18. [18]

    Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025

    Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, and Gao Huang. Coda: Repurposing continuous vaes for discrete tokenization.ArXiv, abs/2503.17760, 2025. 2

  19. [19]

    Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025

    Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. URL https: //api.semanticscholar.org/CorpusID:281394841. 3

  20. [20]

    Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

  21. [21]

    Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.ArXiv, 2025. 3, 8, 2

  22. [22]

    Boffi, Eric Vanden-Eijnden, and Saining Xie

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (ECCV), 2024. 6

  23. [23]

    Peebles and Saining Xie

    William S. Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 1, 8, 6

  24. [24]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022. 6

  25. [25]

    Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization

    Kyle Sargent, Kyle Hsu, Justin Johnson, Fei-Fei Li, and Jiajun Wu. Flow to the mode: Mode-seeking diffusion autoencoders for state-of-the-art image tokenization. InInternational Conference on Computer Vision (ICCV), 2025. 8

  26. [26]

    Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization.arXiv preprint, 2024. 2, 8

  27. [27]

    Autoregres- sive model beats diffusion: Llama for scalable image generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregres- sive model beats diffusion: Llama for scalable image generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 8, 6, 12

  28. [28]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024. 7

  29. [29]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InInternational Conference on Machine Learning (ICML), pages 10347–10357, 2021. 7

  30. [30]

    Neural discrete representation learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In ArXiv, 2017. 1

  31. [31]

    Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

    Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024. 8

  32. [32]

    Bridging continuous and discrete tokens for autoregressive visual generation

    Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation. InInternational Conference on Computer Vision (ICCV), 2025. 3 12

  33. [33]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8, 6, 9, 13

  34. [34]

    Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

    Xu et al. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. ArXiv, 2024. 3

  35. [35]

    Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025

    Sicheng Yang, Xing Hu, Qiang Wu, and Dawei Yang. Vaevq: Enhancing discrete visual tokenization through variational modeling.ArXiv, abs/2511.06863, 2025. 3, 8

  36. [36]

    Reconstruction vs

    Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, 2025. 1, 9

  37. [37]

    Vector-quantized Image Modeling with Improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 8

  38. [38]

    Magvit: Masked generative video transformer

    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2023. 2, 3

  39. [39]

    Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P

    Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming Yang, Kevin P. Murphy, Alexander G. Hauptmann, and Lu Jiang. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. InArXiv, 2023. 3

  40. [40]

    An image is worth 32 tokens for reconstruction and generation

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 3, 8, 9, 12

  41. [41]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations (ICLR), 2025. 3, 9

  42. [42]

    V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025

    Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, and Biye Li. V2flow: Unifying visual tokenization and large language model vocabularies for autoregressive image generation.ArXiv, abs/2503.07493, 2025. 1

  43. [43]

    Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation

    Zhao et al. Mingtok: A continuous unified tokenizer for autoregressive visual understanding and generation. ArXiv, 2024. 2

  44. [44]

    Hita: Holistic tokenizer for autoregressive image generation

    Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation. InInternational Conference on Computer Vision (ICCV), 2025. 8

  45. [45]

    Vision foundation models as effective visual tokenizers for autoregressive image generation

    Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation. InConference on Neural Information Processing Systems (NeurIPS), 2025. 8

  46. [46]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.ArXiv, abs/2510.11690, 2025. 9 13 Appendix for MergeTok Roadmap The appendix is organized to follow the structure of the main paper, with each part providing deeper coverage of the corresponding main-text section. Part I – Extended Discussion(A...

  47. [47]

    Evenly partition tokens into two setsAandB(e.g., by odd/even indices)

  48. [48]

    For each token in A, find its most similar token in B according to the current-layer attention features

  49. [49]

    Select the top pairs for merging under a pre-defined schedule

  50. [50]

    After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer

    Aggregate features within each selected pair (e.g., by averaging) to form merged tokens. After one merge operation, the sequence length is reduced from L to K (K < L ), producing a compressed sequence ZK ∈R K×D that is fed into the next layer. Token similarity is computed using attention features from the current layer, typically the self-attention keys K...

  51. [51]

    K-means on encoder output: offline k-means clustering applied to the final encoder output ZL, run once at each training step on the current batch

  52. [52]

    K-means on frozen DINOv2 features: clustering on the teacher’s patch features rather than the student’s

  53. [53]

    Spatial grid grouping: a fixed 16×16→8×8 spatial pooling that ignores content and groups tokens purely by image position

  54. [54]

    All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons

    Random partition: tokens are randomly assigned to K groups with a fixed seed per image. All five variants share the same V AE and VQ branches, alignment loss, and group-aware lossesLdiv andL cons. Only the source ofSdiffers. Table A7:Impact of alternative grouping strategies on MergeTok-SB.All variants share the same V AE/VQ branches, alignment loss, and ...