Recognition: unknown
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
Pith reviewed 2026-05-10 17:39 UTC · model grok-4.3
The pith
Decomposing token-to-latent compression into two stages plus joint self-supervised training allows effective token scaling in ViT autoencoders without latent collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TC-AE identifies that aggressive token-to-latent compression limits scaling and causes collapse. It decomposes this compression into two stages to reduce structural information loss, allowing effective token number scaling. Additionally, joint self-supervised training enhances the semantic structure of image tokens, producing more generative-friendly latents. As a result, the model achieves substantially better reconstruction and generative performance under deep compression ratios.
What carries the argument
Two-stage decomposition of token-to-latent compression combined with joint self-supervised training to enhance token semantics.
Load-bearing premise
That the primary bottleneck is in the token-to-latent compression stage and that splitting it plus self-supervision will reliably avoid collapse without introducing new failure modes.
What would settle it
Observing whether models using the two-stage approach show higher structural information retention or better FID scores in generation compared to single-stage baselines at the same compression ratio.
read the original abstract
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TC-AE, a ViT-based deep compression autoencoder that targets latent representation collapse by shifting focus to the token space. It studies token-number scaling via patch-size adjustment under a fixed latent budget, identifies aggressive single-stage token-to-latent compression as the bottleneck, and introduces a two-stage decomposition to reduce structural information loss while enabling scaling. A second innovation adds joint self-supervised training to strengthen the semantic structure of image tokens, yielding more generative-friendly latents. The central claim is that these changes deliver substantially better reconstruction and generative performance under deep compression.
Significance. If the empirical gains hold, the work provides a practical route to scale token capacity in ViT autoencoders without channel inflation or multi-stage training, which could improve tokenizers used in downstream visual generation pipelines and reduce the efficiency penalty of aggressive compression.
minor comments (2)
- Abstract: the claim of 'substantially improved reconstruction and generative performance' is stated without any numerical results, baseline comparisons, or compression ratios, which weakens the immediate readability of the contribution even though the full manuscript reportedly contains the supporting experiments.
- Method section (two-stage decomposition): while the high-level rationale is clear, the precise interface between the two stages, the loss weighting between reconstruction and SSL objectives, and the exact patch-size schedules used in the token-scaling study should be stated with equations or a diagram to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee accurately captures the core contributions of TC-AE in addressing token-to-latent compression bottlenecks and latent collapse via two-stage decomposition and joint self-supervised training. We will prepare a revised manuscript to incorporate any minor suggestions.
Circularity Check
No significant circularity detected
full rationale
The paper's central claims rest on two explicit architectural proposals (two-stage token-to-latent decomposition under fixed latent budget, plus joint self-supervised training on tokens) whose benefits are asserted via comparative experiments rather than by re-deriving or fitting the same quantities that were used to motivate them. No equations, parameters, or uniqueness theorems are shown to reduce to the inputs by construction, and no load-bearing premise is justified solely by self-citation. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973,
-
[2]
Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, and Chunhua Shen. Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation.arXiv preprint arXiv:2509.23736, 2025a. Hao Chen, Y ujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha R...
-
[3]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Philippe Hansen-Estruch, David Y an, Ching-Y ao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter V ajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755,
-
[5]
Wenkun He, Y uchao Gu, Junyu Chen, Dongyun Zou, Y ujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Y u, et al. Dc-gen: Post-training diffusion acceleration with deeply compressed latent space.arXiv preprint arXiv:2509.25180,
-
[6]
Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer
Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Y ongjie Lv, et al. Ming-univision: Joint image understanding and generation with a unified continuous tokenizer.arXiv preprint arXiv:2510.06590,
-
[7]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Y uan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301,
-
[9]
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794,
-
[10]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Y uan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,
work page internal anchor Pith review arXiv
-
[11]
Enze Xie, Junsong Chen, Y uyang Zhao, Jincheng Y u, Ligeng Zhu, Chengyue Wu, Y ujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427,
-
[12]
Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025
Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.arXiv preprint arXiv:2504.08736,
-
[13]
Towards scalable pre-training of visual tokenizers for generation
Jingfeng Y ao, Y uda Song, Y ucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025a. Jingfeng Y ao, Bin Y ang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recogniti...
-
[14]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690,
work page internal anchor Pith review arXiv
-
[15]
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Y uille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review arXiv
-
[16]
To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate
Self-supervision training.We adopt the same data augmentation pipelines as in the corresponding self-supervised learning methods (Caron et al., 2021; Zhou et al., 2021; Oquab et al., 2023). To ensure stable joint optimization with the reconstruction objective, we reduce the learning rate. Adversarial training.We adopt the adversarial training setup from R...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.