pith. machine review for the scientific record. sign in

arxiv: 2605.00503 · v2 · submitted 2026-05-01 · 💻 cs.CV · cs.LG

Recognition: unknown

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Bingliang Zhang, Jiaqi Han, Linjie Yang, Qiushan Guo, Wenda Chu, Yisong Yue, Yizhuo Li

Pith reviewed 2026-05-09 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords autoregressive image generation1D semantic tokenizerend-to-end trainingvisual tokenizationImageNet generationFID scorejoint optimizationgenerative models
0
0 comments X

The pith

Jointly optimizing a 1D semantic tokenizer with autoregressive generation yields state-of-the-art image quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end training pipeline for autoregressive image models that jointly optimizes the visual tokenizer and the generator. This setup lets the generation loss directly influence the tokenizer, unlike prior two-stage methods that train each component separately. The authors also explore using vision foundation models to strengthen the 1D token representations. If the joint approach holds, it produces more effective latent sequences for sequential prediction, as evidenced by reaching an FID of 1.48 on ImageNet 256x256 without guidance. Readers care because the method streamlines training while improving synthesis fidelity on standard benchmarks.

Core claim

The authors establish that an end-to-end pipeline jointly optimizing reconstruction and generation losses for a 1D semantic tokenizer allows direct supervision from generation results back to the tokenizer, producing latent spaces better suited to autoregressive modeling than independently trained alternatives and delivering an FID score of 1.48 without guidance on ImageNet 256x256 generation.

What carries the argument

The end-to-end training pipeline that jointly optimizes reconstruction and generation losses on the 1D semantic tokenizer, which compresses images into compact sequences for autoregressive modeling.

If this is right

  • The tokenizer receives direct feedback from generation quality, aligning its representations more closely with autoregressive prediction needs.
  • Incorporating vision foundation models further refines the 1D token sequences for modeling.
  • The unified pipeline reaches an FID of 1.48 on ImageNet 256x256 without external guidance.
  • Training simplifies relative to two-stage approaches that separate tokenizer and generator optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-optimization pattern could extend to video or 3D generation where sequence length grows rapidly.
  • One could test whether the resulting tokenizers transfer better to downstream tasks like editing or inpainting.
  • Scaling the approach to higher resolutions might reveal whether the 1D compression remains efficient as image detail increases.

Load-bearing premise

Joint optimization of reconstruction and generation losses produces a tokenizer whose latent space is meaningfully better for autoregressive modeling than independently trained tokenizers without introducing new failure modes in the 1D sequence modeling.

What would settle it

A controlled experiment in which an independently trained tokenizer of identical architecture achieves equal or lower FID when paired with the same autoregressive model would falsify the claimed benefit of the joint optimization.

read the original abstract

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes an end-to-end autoregressive image generation pipeline centered on a 1D semantic tokenizer. The tokenizer is trained jointly by optimizing both a reconstruction loss and a generation loss, allowing direct supervision from the downstream autoregressive model back to the tokenizer. This is contrasted with prior two-stage pipelines that train the tokenizer and generator independently. The approach additionally leverages vision foundation models to improve the 1D tokenizer. The central empirical claim is a state-of-the-art FID score of 1.48 on ImageNet 256×256 generation without classifier-free guidance.

Significance. If the reported FID improvement is attributable to the joint optimization rather than scale or data choices, the work would be significant for simplifying the training of autoregressive image models and for demonstrating that generation-aware tokenization can yield better latent spaces for next-token prediction. The 1D formulation and foundation-model integration are concrete design choices that could be adopted more broadly.

major comments (2)
  1. Abstract and §4 (Experiments): The claim of a state-of-the-art FID of 1.48 is presented without accompanying ablations that isolate the contribution of joint reconstruction+generation training from other factors such as model capacity, training data volume, or the specific vision foundation model used. Without these controls, it is impossible to verify the central hypothesis that end-to-end optimization is the load-bearing reason for the reported performance.
  2. §3 (Method): The joint loss is described as combining reconstruction and generation objectives, yet no quantitative analysis is provided showing that the resulting latent space improves autoregressive modeling (e.g., lower perplexity, better codebook utilization, or reduced mode collapse) compared with an independently trained tokenizer of identical architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to include the requested analyses.

read point-by-point responses
  1. Referee: Abstract and §4 (Experiments): The claim of a state-of-the-art FID of 1.48 is presented without accompanying ablations that isolate the contribution of joint reconstruction+generation training from other factors such as model capacity, training data volume, or the specific vision foundation model used. Without these controls, it is impossible to verify the central hypothesis that end-to-end optimization is the load-bearing reason for the reported performance.

    Authors: We agree that isolating the effect of joint optimization requires explicit controls. While the manuscript compares against prior two-stage methods, it does not include a matched ablation with an independently trained tokenizer under identical capacity, data volume, and foundation-model usage. In the revision we will add such an ablation study, reporting FID and other metrics for both the end-to-end and two-stage settings with all other factors held fixed, thereby directly testing the contribution of joint training. revision: yes

  2. Referee: §3 (Method): The joint loss is described as combining reconstruction and generation objectives, yet no quantitative analysis is provided showing that the resulting latent space improves autoregressive modeling (e.g., lower perplexity, better codebook utilization, or reduced mode collapse) compared with an independently trained tokenizer of identical architecture.

    Authors: We acknowledge the value of direct quantitative evidence on latent-space quality. The current manuscript relies on the downstream FID improvement as indirect support. To address this, the revised version will include side-by-side measurements—autoregressive perplexity, codebook utilization statistics, and mode-coverage indicators—between the jointly trained tokenizer and an independently trained counterpart of identical architecture and training budget. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical pipeline

full rationale

The paper's central claim is an empirical FID score of 1.48 on ImageNet 256x256 from an end-to-end joint optimization of reconstruction and generation losses for a 1D semantic tokenizer plus autoregressive model. No derivation chain, equations, or first-principles results are presented that reduce a claimed prediction or uniqueness result to fitted inputs or self-citations by construction. The tokenizer design choices and joint training are concrete, externally evaluable mechanisms whose performance is measured directly rather than asserted via internal redefinition or renaming of known patterns. This is a standard empirical ML paper whose results stand or fall on the reported experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of autoregressive modeling and tokenizer reconstruction losses. No new physical entities or unproven mathematical axioms are introduced; the main additions are architectural choices and the joint training objective.

free parameters (2)
  • tokenizer codebook size and embedding dimension
    Chosen to balance reconstruction quality and sequence length for AR modeling.
  • joint training loss weights between reconstruction and generation
    Hyperparameters that control how much the generation signal affects the tokenizer.
axioms (1)
  • domain assumption Autoregressive factorization of the joint token distribution is a valid generative model for images
    Standard assumption in AR image modeling literature.

pith-pipeline@v0.9.0 · 5404 in / 1296 out tokens · 17919 ms · 2026-05-09T19:37:48.684800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Flextok: Resampling images into 1d token sequences of flexible length

    Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. Flextok: Resampling images into 1d token sequences of flexible length. In Forty-secondInternational Conference on Machine Learning, 2025

  2. [2]

    VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  5. [5]

    Aligning visual foundation encoders to tokenizers for diffusion models

    Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models.arXiv preprint arXiv:2509.25162, 2025

  6. [6]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

  7. [7]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  8. [8]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URLhttps:/...

  10. [10]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  11. [11]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

  12. [12]

    Generative adversarial nets.Advancesin neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advancesin neural information processing systems, 27, 2014

  13. [13]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

  14. [14]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  15. [15]

    Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

    Mengqi Huang, Zhendong Mao, Zhuowei Chen, and Yongdong Zhang. Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22596–22605, 2023

  16. [16]

    Spectralar: Spectral autoregressive visual generation, 2025

    Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autoregressive visual generation.arXiv preprint arXiv:2506.10962, 2025

  17. [17]

    Guiding a diffusion model with a bad version of itself.Advancesin Neural Information Processing Systems, 37:52996–53021, 2024

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advancesin Neural Information Processing Systems, 37:52996–53021, 2024

  18. [18]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  19. [19]

    Auto-encoding variational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013. 11

  20. [20]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

  21. [21]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

  22. [22]

    Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

  23. [23]

    Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

  24. [24]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  25. [25]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PmLR, 2021

  27. [27]

    Generating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advancesin neural information processing systems, 32, 2019

  28. [28]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  29. [29]

    Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

  30. [30]

    Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

    Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. InInternational conference on machine learning, pages 30105–30118. PMLR, 2023

  31. [31]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  32. [32]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025

  33. [33]

    arXiv preprint arXiv:2510.15301 (2025)

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

  34. [34]

    Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

  35. [35]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  36. [36]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  37. [37]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  38. [38]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas 12 Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features....

  39. [39]

    Regularizing generative adversarial networks under limited data

    Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7921–7931, 2021

  40. [40]

    Neural discrete representation learning.Advancesin neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neural information processing systems, 30, 2017

  41. [41]

    Semanticist: Pca-guided visual tokenization.arXiv preprint arXiv:2503.08685, 2025

    Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, and Xiaojuan Qi. " principal components" enable a new language of images.arXiv preprint arXiv:2503.08685, 2025

  42. [42]

    Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

    Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Alitok: Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

  43. [43]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.arXiv preprint arXiv:2504.08736, 2025

  44. [44]

    Tian Ye, Zixin Li, Xin Chen, Zhihui Deng, Lin Chen, Peng Gao, Yong Zhao, and Ying Shan

    Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423

  45. [45]

    Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

  46. [46]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

  47. [47]

    Randomized autoregressive visual generation

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autoregressive visual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18431–18441, 2025

  48. [48]

    An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37: 128940–128966, 2025

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37: 128940–128966, 2025

  49. [49]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

  50. [50]

    Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin neural information processing systems, 32, 2019

  51. [51]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  52. [52]

    Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026

    Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, and Zhan Ma. Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2601.03955, 2026

  53. [53]

    Vision foundation models as effective visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2507.08441, 2025

    Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation.arXiv preprint arXiv:2507.08441, 2025

  54. [54]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

  55. [55]

    arXiv:2306.09305 , year=

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023

  56. [56]

    Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advancesin Neural Information Processing Systems, 37:12612–12635, 2024

    Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advancesin Neural Information Processing Systems, 37:12612–12635, 2024. 13 Appendix A Implementation Details A.1 Model Architecture A.1.1 Tokenizer. Our 1D ViT tokenizer uses a similar architecture as TiTok [48] to compress images i...