arxiv: 2604.24885 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.LG

Recognition: unknown

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Maitreya Patel , Jingtao Li , Weiming Zhuang , Yezhou Yang , Lingjuan Lv

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:09 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords autoregressive image generationimage tokenizationdynamic resolutiontransformer tokenizerefficient generationclass-conditioned generationarbitrary aspect ratios

0 comments

The pith

VibeToken encodes images into 32-256 dynamic tokens, enabling autoregressive generation at any resolution with fixed low compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors seek to demonstrate that autoregressive image models can reach quality levels close to diffusion models while supporting arbitrary resolutions and aspect ratios without a corresponding rise in cost. Their approach replaces rigid token grids with a 1D transformer tokenizer that outputs a short, user-chosen sequence of tokens for any input size. If the tokenizer preserves necessary detail, downstream autoregressive sampling becomes practical for high-resolution work because both token count and FLOPs stay bounded. The reported outcome is 1024 by 1024 synthesis from 64 tokens at constant 179G FLOPs, versus much higher costs for prior fixed-resolution autoregressive systems.

Core claim

VibeToken is a resolution-agnostic 1D Transformer-based image tokenizer that converts images of any size into a controllable sequence of 32 to 256 tokens. VibeToken-Gen, the class-conditioned autoregressive model built on it, generates images at arbitrary resolutions and aspect ratios. It reaches 3.94 gFID on 1024x1024 outputs using only 64 tokens, compared with a diffusion baseline that requires 1024 tokens for 5.87 gFID, while holding inference cost fixed at 179G FLOPs regardless of resolution.

What carries the argument

VibeToken, the 1D Transformer tokenizer that produces a dynamic, user-controllable sequence of 32-256 tokens from images of varying resolution and aspect ratio.

If this is right

Autoregressive generators can now operate at any resolution without quadratic growth in inference FLOPs.
High-resolution synthesis requires far fewer tokens than current diffusion pipelines while matching or exceeding their quality.
Constant compute cost across scales makes production deployment of autoregressive visual models feasible.
Class-conditioned generation works directly for varying aspect ratios without retraining or padding tricks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic tokenization idea could be tested on video sequences where frame count and spatial size both vary.
Training the tokenizer on a broad mix of resolutions might improve robustness beyond the resolutions evaluated here.
The efficiency advantage could allow autoregressive models to reach resolutions where diffusion methods become computationally prohibitive.

Load-bearing premise

The 1D tokenizer can encode and decode images at any resolution and aspect ratio using only a short dynamic token sequence without losing fine details required for high-quality autoregressive generation.

What would settle it

High-resolution images produced from 64 tokens exhibit visible artifacts or higher perceptual error rates than diffusion models that use many more tokens on the same benchmarks.

Figures

Figures reproduced from arXiv: 2604.24885 by Jingtao Li, Lingjuan Lv, Maitreya Patel, Weiming Zhuang, Yezhou Yang.

**Figure 1.** Figure 1: VibeToken-Gen image generation examples. A single resolution-generalist visual autoregressive model trained on ImageNet1k that can synthesize arbitrary resolution and aspect ratio images. The examples shown are from various resolutions between 256×256 and 1024×1024. VibeToken-Gen leverages our novel resolution-agnostic 1D visual tokenizer, VibeToken, that can efficiently encode any resolution into as few a… view at source ↗

**Figure 2.** Figure 2: Compute comparisons. Across resolution and metrics, VibeToken achieves strong efficiency (fewer tokens/FLOPs) compared to 2D baselines. This, in turn, leads to VibeToken-Gen being significantly more efficient with a constant 179 GFLOPs. tempts to improve compression rates up to 32×, but it still requires 1024 tokens for 1024×1024 resolution images. 2D tokenizers often underperform in terms of compression… view at source ↗

**Figure 3.** Figure 3: Overview of VibeToken tokenizer. VibeToken introduces four key components: 1) Dynamic grid position embedding, 2) Dynamic patch embedding, 3) Adaptive decoder resolution, and 4) Variable latent token-based resolution-agnostic encoding. These components provide full flexibility to 1D tokenizers, supporting arbitrary resolutions and compute controls. use T max H =T max W =32). For an input lattice with T_H=… view at source ↗

**Figure 4.** Figure 4: Dynamic token length. rFID vs. latent length L at 2562 . VibeToken is the only model that pairs competitive quality with resolution generalization. and native super-resolution performance in the appendix. Efficiency analysis view at source ↗

read the original abstract

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VibeToken claims constant-FLOPs AR generation at high resolutions via a dynamic 1D tokenizer, but the abstract leaves the compression fidelity unverified.

read the letter

The main thing to know is that this paper presents a 1D Transformer tokenizer called VibeToken that encodes images into a controllable number of tokens between 32 and 256, allowing an autoregressive generator to produce images at any resolution and aspect ratio with fixed compute cost. They show VibeToken-Gen doing 1024x1024 at 3.94 gFID using 64 tokens and 179G FLOPs, which is much better than LlamaGen's quadratic scaling and even beats a diffusion baseline that uses more tokens. This combination is new. Earlier work on AR image models stuck to fixed patch grids, so resolution changes meant big compute increases. Here the tokenizer adapts the sequence length. The paper does well at quantifying the gap: it directly compares the FLOPs and shows how AR could become practical for high-res work without the usual penalties. The soft spots are around the tokenizer performance. The abstract gives no reconstruction metrics or details on how the 1D transformer handles varying resolutions without downsampling artifacts or lost detail. If the encoding step drops information that matters for generation, the reported gFID and efficiency numbers won't hold when you try to reproduce or extend it. The stress-test point about needing to verify the compression is fair based on what's shown. This paper is for researchers focused on efficient generative modeling and the AR versus diffusion trade-offs. Someone looking for ways to scale AR models would find the direction useful. It deserves a serious referee because the efficiency problem it targets is real and the proposed fix is concrete, even if more evidence on the tokenizer is required.

Referee Report

2 major / 1 minor

Summary. The paper introduces VibeToken, a resolution-agnostic 1D Transformer-based image tokenizer that encodes arbitrary-resolution and aspect-ratio images into a dynamic, user-controllable sequence of 32-256 tokens. It then presents VibeToken-Gen, a class-conditioned autoregressive generator built on this tokenizer that supports out-of-the-box arbitrary resolutions, achieves 3.94 gFID on 1024x1024 images using only 64 tokens (versus 5.87 gFID for a diffusion baseline using 1024 tokens), and maintains constant 179G FLOPs independent of resolution (63.4x more efficient than LlamaGen at 1024x1024).

Significance. If the central claims hold after verification, the work would meaningfully advance autoregressive image synthesis by removing the fixed-resolution and quadratic-complexity barriers that have kept AR models behind diffusion approaches at high resolutions. The dynamic tokenization and constant-FLOPs property could enable practical high-resolution generation in production settings where compute budgets are constrained.

major comments (2)

[Abstract] Abstract: The headline efficiency and quality claims (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs) rest on the unverified premise that the 1D Transformer tokenizer faithfully reconstructs high-frequency detail across resolutions without artifacts; no PSNR, LPIPS, or token-count-vs-resolution ablation is reported to support this.
[VibeToken] VibeToken architecture section: The constant-FLOPs claim for VibeToken-Gen requires that the tokenizer's patch embedding and positional scheme produce a fixed-length sequence independent of input resolution and aspect ratio; if any component (e.g., adaptive pooling or resolution-dependent downsampling) scales with image size, the reported 63.4x efficiency gain over LlamaGen would not hold at 1024x1024.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly stated the training dataset size, model parameter counts for both tokenizer and generator, and the exact diffusion baseline used for the gFID comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of our claims that warrant clarification and additional support. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The headline efficiency and quality claims (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs) rest on the unverified premise that the 1D Transformer tokenizer faithfully reconstructs high-frequency detail across resolutions without artifacts; no PSNR, LPIPS, or token-count-vs-resolution ablation is reported to support this.

Authors: We acknowledge that the abstract emphasizes generation results and does not include explicit reconstruction metrics for VibeToken. To address this, the revised manuscript will add a dedicated subsection on tokenizer reconstruction quality. This will report PSNR and LPIPS scores across multiple resolutions and token counts (32-256), along with an ablation studying the trade-off between token count, reconstruction fidelity, and downstream generation performance. These additions will directly substantiate the tokenizer's ability to preserve high-frequency details. revision: yes
Referee: [VibeToken] VibeToken architecture section: The constant-FLOPs claim for VibeToken-Gen requires that the tokenizer's patch embedding and positional scheme produce a fixed-length sequence independent of input resolution and aspect ratio; if any component (e.g., adaptive pooling or resolution-dependent downsampling) scales with image size, the reported 63.4x efficiency gain over LlamaGen would not hold at 1024x1024.

Authors: VibeToken is designed such that the output sequence length is user-specified and independent of input resolution or aspect ratio. The architecture processes a variable number of patches from the input image but employs a fixed-capacity 1D Transformer bottleneck that always produces exactly K tokens (K chosen in [32, 256]), with resolution-agnostic positional encodings. This ensures VibeToken-Gen always operates on a fixed token sequence (e.g., 64 tokens), yielding constant 179G FLOPs regardless of resolution. We will expand the architecture section with a clearer description, pseudocode, and a diagram illustrating the fixed-output mechanism to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity; efficiency and performance claims are empirical design outcomes, not reductions by construction.

full rationale

The abstract and provided text introduce VibeToken as a 1D Transformer tokenizer producing dynamic 32-256 token sequences for arbitrary resolutions, followed by VibeToken-Gen as a class-conditioned AR model. Reported results (3.94 gFID at 1024x1024 with 64 tokens, constant 179G FLOPs vs. LlamaGen's 11T) are presented as measured comparisons to baselines, not as outputs of any derivation or equation chain. Constant FLOPs follow directly from the fixed-length token sequence design choice, which is the stated innovation rather than a fitted or renamed input. No equations, self-citations, uniqueness theorems, or ansatzes appear that would make any claim equivalent to its inputs by construction. The chain is self-contained via architecture and external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on the unstated assumption that a 1D transformer can serve as a lossless-enough image compressor for variable token budgets.

pith-pipeline@v0.9.0 · 5551 in / 1241 out tokens · 48872 ms · 2026-05-08T04:09:48.029400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 8 internal anchors

[1]

Flextok: Resam- pling images into 1d token sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O ˘guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El- Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resam- pling images into 1d token sequences of flexible length. In Forty-second International Conference on Machine Learning,
[2]

Flexivit: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Min- derer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023. 4, 1

2023
[3]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022. 8

2022
[4]

Masked autoencoders are effective tokenizers for diffusion models.ArXiv, abs/2502.03444, 2025

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhik- sha Raj. Masked autoencoders are effective tokenizers for diffusion models.ArXiv, abs/2502.03444, 2025. 1

work page arXiv 2025
[5]

Softvq-vae: Efficient 1-dimensional contin- uous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional contin- uous tokenizer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28358–28370, 2025. 2, 3, 1

2025
[6]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023. 4, 1

2023
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

2009
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review arXiv 2010
[9]

Adaptive length image tokenization via recurrent allocation

Shivam Duggal, Phillip Isola, Antonio Torralba, and William T Freeman. Adaptive length image tokenization via recurrent allocation. InFirst Workshop on Scalable Opti- mization for Efficient and Adaptive Foundation Models, 2024. 3

2024
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2, 8

2021
[11]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024. 2

2024
[12]

doi:10.48550/arXiv.2403.18361 , abstract =

Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yun- zhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vi- tar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024. 4, 1

work page arXiv 2024
[13]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 4, 1

2024
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2

work page internal anchor Pith review arXiv 2022
[15]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 6

2024
[16]

Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025. 2, 3, 1

work page arXiv 2025
[17]

Autoregressive image generation using resid- ual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523– 11532, 2022. 2, 1

2022
[18]

Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 2

2024
[19]

Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhik- sha Raj, and Zhe Lin. Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,

work page arXiv
[20]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 2

2023
[21]

Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473,

Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K Du, et al. Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473,

work page arXiv
[22]

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang. Atoken: A unified tokenizer for vision.ArXiv, abs/2509.14476, 2025. 1

work page arXiv 2025
[23]

Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024. 3, 1

work page arXiv 2024
[24]

Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024. 2, 6, 8, 1

work page arXiv 2024
[25]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025. 2, 6, 8

work page arXiv 2025
[26]

One-d-piece: Image tokenizer meets quality-controllable compression.arXiv preprint arXiv:2501.10064, 2025

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Taka- hashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality-controllable compression.arXiv preprint arXiv:2501.10064, 2025. 2, 3, 5, 6, 1

work page arXiv 2025
[27]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

work page internal anchor Pith review arXiv 2021
[28]

Eclipse: A resource-efficient text-to-image prior for image generations

Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-image prior for image generations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9069–9078, 2024. 2

2024
[29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
[30]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review arXiv 2023
[31]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review arXiv
[32]

Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474,

Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Mar- ios Savvides. Image tokenizer needs post-training.ArXiv, abs/2509.12474, 2025. 1

work page arXiv 2025
[33]

Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

2022
[34]

Stretching each dollar: Diffu- sion training from scratch on a micro-budget

Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, and Lingjuan Lyu. Stretching each dollar: Diffu- sion training from scratch on a micro-budget. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28596–28608, 2025. 2

2025
[35]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025. 2, 6

2025
[36]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2

2021
[37]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2, 6, 8, 1

work page internal anchor Pith review arXiv 2024
[38]

arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025. 2

work page arXiv 2025
[39]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 2, 6, 8

2024
[40]

Resformer: Scaling vits with multi-resolution training

Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, and Yu-Gang Jiang. Resformer: Scaling vits with multi-resolution training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22721–22731, 2023. 1

2023
[41]

Training data-efficient image transformers & distillation through atten- tion

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through atten- tion. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 1

2021
[42]

Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024

ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, et al. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024. 3, 6, 1

work page arXiv 2024
[43]

Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 3, 6, 1

work page arXiv 2025
[44]

Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025

Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, and Zicheng Liu. Instella-t2i: Pushing the limits of 1d discrete latent space image generation.arXiv preprint arXiv:2506.21022, 2025. 6, 1

work page arXiv 2025
[45]

Maskbit: Embedding-free image generation via bit tokens.arXiv preprint arXiv:2409.16211, 2024

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.arXiv preprint arXiv:2409.16211, 2024. 8

work page arXiv 2024
[46]

Vila-u: a unified foundation model integrating visual understanding and generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 2

work page arXiv 2024
[47]

Dc-ar: Efficient masked autoregressive image generation with deep compression hybrid tokenizer

Yecheng Wu, Han Cai, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, and Song Han. Dc-ar: Efficient masked autoregressive image generation with deep compression hybrid tokenizer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18034–18045, 2025. 2, 6

2025
[48]

Layton: Latent consistency tok- enizer for 1024-pixel image reconstruction and generation by 256 tokens.ArXiv, abs/2503.08377, 2025

Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Hao- nan Lu, and Zhenyu Yang. Layton: Latent consistency tok- enizer for 1024-pixel image reconstruction and generation by 256 tokens.ArXiv, abs/2503.08377, 2025. 1

work page arXiv 2025
[49]

Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 2, 8

work page arXiv 2021
[50]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive mod- els for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 2

work page internal anchor Pith review arXiv 2022
[51]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 8

work page internal anchor Pith review arXiv 2023
[52]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966,

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966,
[53]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025. 2, 8

2025
[54]

Ar- gus: A compact and versatile foundation model for vision

Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 2 VibeToken: Scaling 1D Image Tokeniz...

2025
[55]

256x256": 0.5,

The latter half of training uses an adversarial loss as well. Ablation models are trained at two resolutions, 256×256 and512×512. Final tokenizer.The scaledVibeTokenuses multi-vector quantization (MVQ) with 8 codebooks and 256 latent di- mensions per token, factorized into 8 sub-codes of 32 di- mensions each. The maximum latent length is variable in [32,2...

2050