arxiv: 2605.00503 · v2 · submitted 2026-05-01 · 💻 cs.CV · cs.LG

Recognition: unknown

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Bingliang Zhang, Jiaqi Han, Linjie Yang, Qiushan Guo, Wenda Chu, Yisong Yue, Yizhuo Li

Pith reviewed 2026-05-09 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords autoregressive image generation1D semantic tokenizerend-to-end trainingvisual tokenizationImageNet generationFID scorejoint optimizationgenerative models

0 comments

The pith

Jointly optimizing a 1D semantic tokenizer with autoregressive generation yields state-of-the-art image quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end training pipeline for autoregressive image models that jointly optimizes the visual tokenizer and the generator. This setup lets the generation loss directly influence the tokenizer, unlike prior two-stage methods that train each component separately. The authors also explore using vision foundation models to strengthen the 1D token representations. If the joint approach holds, it produces more effective latent sequences for sequential prediction, as evidenced by reaching an FID of 1.48 on ImageNet 256x256 without guidance. Readers care because the method streamlines training while improving synthesis fidelity on standard benchmarks.

Core claim

The authors establish that an end-to-end pipeline jointly optimizing reconstruction and generation losses for a 1D semantic tokenizer allows direct supervision from generation results back to the tokenizer, producing latent spaces better suited to autoregressive modeling than independently trained alternatives and delivering an FID score of 1.48 without guidance on ImageNet 256x256 generation.

What carries the argument

The end-to-end training pipeline that jointly optimizes reconstruction and generation losses on the 1D semantic tokenizer, which compresses images into compact sequences for autoregressive modeling.

If this is right

The tokenizer receives direct feedback from generation quality, aligning its representations more closely with autoregressive prediction needs.
Incorporating vision foundation models further refines the 1D token sequences for modeling.
The unified pipeline reaches an FID of 1.48 on ImageNet 256x256 without external guidance.
Training simplifies relative to two-stage approaches that separate tokenizer and generator optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-optimization pattern could extend to video or 3D generation where sequence length grows rapidly.
One could test whether the resulting tokenizers transfer better to downstream tasks like editing or inpainting.
Scaling the approach to higher resolutions might reveal whether the 1D compression remains efficient as image detail increases.

Load-bearing premise

Joint optimization of reconstruction and generation losses produces a tokenizer whose latent space is meaningfully better for autoregressive modeling than independently trained tokenizers without introducing new failure modes in the 1D sequence modeling.

What would settle it

A controlled experiment in which an independently trained tokenizer of identical architecture achieves equal or lower FID when paired with the same autoregressive model would falsify the claimed benefit of the joint optimization.

read the original abstract

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes an end-to-end autoregressive image generation pipeline centered on a 1D semantic tokenizer. The tokenizer is trained jointly by optimizing both a reconstruction loss and a generation loss, allowing direct supervision from the downstream autoregressive model back to the tokenizer. This is contrasted with prior two-stage pipelines that train the tokenizer and generator independently. The approach additionally leverages vision foundation models to improve the 1D tokenizer. The central empirical claim is a state-of-the-art FID score of 1.48 on ImageNet 256×256 generation without classifier-free guidance.

Significance. If the reported FID improvement is attributable to the joint optimization rather than scale or data choices, the work would be significant for simplifying the training of autoregressive image models and for demonstrating that generation-aware tokenization can yield better latent spaces for next-token prediction. The 1D formulation and foundation-model integration are concrete design choices that could be adopted more broadly.

major comments (2)

Abstract and §4 (Experiments): The claim of a state-of-the-art FID of 1.48 is presented without accompanying ablations that isolate the contribution of joint reconstruction+generation training from other factors such as model capacity, training data volume, or the specific vision foundation model used. Without these controls, it is impossible to verify the central hypothesis that end-to-end optimization is the load-bearing reason for the reported performance.
§3 (Method): The joint loss is described as combining reconstruction and generation objectives, yet no quantitative analysis is provided showing that the resulting latent space improves autoregressive modeling (e.g., lower perplexity, better codebook utilization, or reduced mode collapse) compared with an independently trained tokenizer of identical architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to include the requested analyses.

read point-by-point responses

Referee: Abstract and §4 (Experiments): The claim of a state-of-the-art FID of 1.48 is presented without accompanying ablations that isolate the contribution of joint reconstruction+generation training from other factors such as model capacity, training data volume, or the specific vision foundation model used. Without these controls, it is impossible to verify the central hypothesis that end-to-end optimization is the load-bearing reason for the reported performance.

Authors: We agree that isolating the effect of joint optimization requires explicit controls. While the manuscript compares against prior two-stage methods, it does not include a matched ablation with an independently trained tokenizer under identical capacity, data volume, and foundation-model usage. In the revision we will add such an ablation study, reporting FID and other metrics for both the end-to-end and two-stage settings with all other factors held fixed, thereby directly testing the contribution of joint training. revision: yes
Referee: §3 (Method): The joint loss is described as combining reconstruction and generation objectives, yet no quantitative analysis is provided showing that the resulting latent space improves autoregressive modeling (e.g., lower perplexity, better codebook utilization, or reduced mode collapse) compared with an independently trained tokenizer of identical architecture.

Authors: We acknowledge the value of direct quantitative evidence on latent-space quality. The current manuscript relies on the downstream FID improvement as indirect support. To address this, the revised version will include side-by-side measurements—autoregressive perplexity, codebook utilization statistics, and mode-coverage indicators—between the jointly trained tokenizer and an independently trained counterpart of identical architecture and training budget. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical pipeline

full rationale

The paper's central claim is an empirical FID score of 1.48 on ImageNet 256x256 from an end-to-end joint optimization of reconstruction and generation losses for a 1D semantic tokenizer plus autoregressive model. No derivation chain, equations, or first-principles results are presented that reduce a claimed prediction or uniqueness result to fitted inputs or self-citations by construction. The tokenizer design choices and joint training are concrete, externally evaluable mechanisms whose performance is measured directly rather than asserted via internal redefinition or renaming of known patterns. This is a standard empirical ML paper whose results stand or fall on the reported experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of autoregressive modeling and tokenizer reconstruction losses. No new physical entities or unproven mathematical axioms are introduced; the main additions are architectural choices and the joint training objective.

free parameters (2)

tokenizer codebook size and embedding dimension
Chosen to balance reconstruction quality and sequence length for AR modeling.
joint training loss weights between reconstruction and generation
Hyperparameters that control how much the generation signal affects the tokenizer.

axioms (1)

domain assumption Autoregressive factorization of the joint token distribution is a valid generative model for images
Standard assumption in AR image modeling literature.

pith-pipeline@v0.9.0 · 5404 in / 1296 out tokens · 17919 ms · 2026-05-09T19:37:48.684800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 25 canonical work pages · 11 internal anchors

[1]

Flextok: Resampling images into 1d token sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resampling images into 1d token sequences of flexible length. In Forty-secondInternational Conference on Machine Learning, 2025

2025
[2]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901
[4]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

2022
[5]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models.arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025
[6]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

2023
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[8]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URLhttps:/...

2021
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

2021
[11]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

work page arXiv 2023
[12]

Generative adversarial nets.Advancesin neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advancesin neural information processing systems, 27, 2014

2014
[13]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

2017
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review arXiv 2022
[15]

Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

Mengqi Huang, Zhendong Mao, Zhuowei Chen, and Yongdong Zhang. Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22596–22605, 2023

2023
[16]

Spectralar: Spectral autoregressive visual generation, 2025

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autoregressive visual generation.arXiv preprint arXiv:2506.10962, 2025

work page arXiv 2025
[17]

Guiding a diffusion model with a bad version of itself.Advancesin Neural Information Processing Systems, 37:52996–53021, 2024

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advancesin Neural Information Processing Systems, 37:52996–53021, 2024

2024
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Auto-encoding variational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013. 11

2013
[20]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

2022
[21]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[22]

Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

2024
[23]

Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

work page arXiv 2023
[24]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PmLR, 2021

2021
[27]

Generating diverse high-fidelity images with vq-vae-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advancesin neural information processing systems, 32, 2019

2019
[28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[29]

Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

2016
[30]

Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. InInternational conference on machine learning, pages 30105–30118. PMLR, 2023

2023
[31]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review arXiv 2002
[32]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025

2025
[33]

arXiv preprint arXiv:2510.15301 (2025)

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025
[34]

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[36]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review arXiv 2024
[37]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

2024
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas 12 Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review arXiv 2025
[39]

Regularizing generative adversarial networks under limited data

Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7921–7931, 2021

2021
[40]

Neural discrete representation learning.Advancesin neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neural information processing systems, 30, 2017

2017
[41]

Semanticist: Pca-guided visual tokenization.arXiv preprint arXiv:2503.08685, 2025

Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, and Xiaojuan Qi. " principal components" enable a new language of images.arXiv preprint arXiv:2503.08685, 2025

work page arXiv 2025
[42]

Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Alitok: Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

work page arXiv 2025
[43]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.arXiv preprint arXiv:2504.08736, 2025

work page arXiv 2025
[44]

Tian Ye, Zixin Li, Xin Chen, Zhihui Deng, Lin Chen, Peng Gao, Yong Zhao, and Ying Shan

Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423

work page arXiv 2025
[45]

Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

work page arXiv 2021
[46]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review arXiv 2023
[47]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autoregressive visual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025

2025
[48]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37: 128940–128966, 2025

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37: 128940–128966, 2025

2025
[49]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

2025
[50]

Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

2019
[51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[52]

Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026

Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, and Zhan Ma. Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2601.03955, 2026

work page arXiv 2026
[53]

Vision foundation models as effective visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2507.08441, 2025

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2507.08441, 2025

work page arXiv 2025
[54]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[55]

arXiv:2306.09305 , year=

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023

work page arXiv 2023
[56]

Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advancesin Neural Information Processing Systems, 37:12612–12635, 2024

Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advancesin Neural Information Processing Systems, 37:12612–12635, 2024. 13 Appendix A Implementation Details A.1 Model Architecture A.1.1 Tokenizer. Our 1D ViT tokenizer uses a similar architecture as TiTok [48] to compress images i...

2024