Recognition: unknown
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
Pith reviewed 2026-05-13 05:51 UTC · model grok-4.3
The pith
FlashAR adapts pre-trained autoregressive image models for parallel decoding via a branched vertical head and fusion gate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashAR retains the original autoregressive head as a horizontal predictor for row-wise tokens and adds a lightweight vertical head branched from an intermediate layer for column-wise tokens. These predictions are combined at each position through a learnable fusion gate whose weights reflect the varying importance of horizontal and vertical dependencies. A two-stage post-training pipeline first adapts the vertical head alone and then jointly tunes it with the backbone, enabling the model to support parallel decoding while staying close to the original training objective.
What carries the argument
A learnable fusion gate that dynamically combines the retained horizontal autoregressive head with a new vertical head branched from an intermediate layer of the pre-trained network.
If this is right
- Existing autoregressive models can be accelerated without designing a new generation paradigm or pre-training from scratch.
- Parallel token prediction becomes feasible while the learned prior from the original raster-scan objective is largely retained.
- Adaptation requires only 0.05 percent of the original training data through the two-stage pipeline.
- Speedups of up to 22.9 times are achieved for 512x512 image generation on models such as LlamaGen and Emu3.5.
- The relative importance of horizontal and vertical predictions can be learned position-wise without fixed rules.
Where Pith is reading between the lines
- The same branching-plus-fusion pattern could be tested on autoregressive models for other data types such as video sequences or 3D structures.
- Further reduction in adaptation cost might allow the technique to scale to even larger backbone models where full fine-tuning is prohibitive.
- One could measure whether the fusion gate learns consistent patterns across different image domains or styles.
Load-bearing premise
That adding a vertical head from an intermediate layer and blending its predictions with the original horizontal head will preserve generation quality during parallel decoding.
What would settle it
Side-by-side measurement of FID scores, visual artifacts, and inference latency for 512x512 images produced by the original model versus the FlashAR-adapted model on the same benchmark prompts.
Figures
read the original abstract
Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FlashAR, a post-training adaptation framework for pre-trained autoregressive image generation models. It retains the original horizontal AR head for row-wise next-token prediction while branching a lightweight vertical head from an intermediate layer for column-wise prediction. These are dynamically combined via a learnable fusion gate to support two-way next-token prediction and parallel decoding. A two-stage adaptation pipeline (vertical-head initialization followed by joint fine-tuning) is proposed to enable efficient adaptation using only 0.05% of the original training data. Experiments on LlamaGen and Emu3.5 report up to 22.9x speedup for 512x512 image generation.
Significance. If the central claims hold, FlashAR would provide a practical post-training route to accelerate raster-scan AR image models without full retraining or altered objectives, addressing the training-inference gap noted in prior work. The use of minimal data and retention of the original prior could make high-quality parallel generation more accessible for large models.
major comments (2)
- [§3] §3 (Method): The claim that branching the vertical head from an intermediate layer bypasses horizontal bias and enables stable parallel decoding is load-bearing for the speedup without quality loss, yet the manuscript provides no analysis or ablation of layer choice effects on prediction conflict resolution during parallel steps.
- [§4] §4 (Experiments): The 22.9x speedup for 512x512 generation is reported, but without explicit FID/IS scores, visual artifact analysis, or comparisons showing that the fusion gate prevents degradation relative to the original model, the preservation of generation quality remains unverified and directly impacts the central claim.
minor comments (2)
- [§3.1] The description of the fusion gate could include an explicit equation showing how horizontal and vertical logits are combined at each position to improve clarity.
- [§3.2] Clarify the exact data selection process for the 0.05% adaptation set and any regularization used to prevent overfitting of the new components.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions planned to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3] §3 (Method): The claim that branching the vertical head from an intermediate layer bypasses horizontal bias and enables stable parallel decoding is load-bearing for the speedup without quality loss, yet the manuscript provides no analysis or ablation of layer choice effects on prediction conflict resolution during parallel steps.
Authors: We appreciate this observation. Branching from an intermediate layer is motivated by the fact that deeper layers become increasingly specialized to the original horizontal raster-scan objective, increasing prediction conflicts under parallel decoding. While space constraints limited the initial submission, we will add a dedicated ablation study in the revised manuscript that varies the branching layer and reports quantitative metrics on prediction conflict rates, training stability, and final generation quality. revision: yes
-
Referee: [§4] §4 (Experiments): The 22.9x speedup for 512x512 generation is reported, but without explicit FID/IS scores, visual artifact analysis, or comparisons showing that the fusion gate prevents degradation relative to the original model, the preservation of generation quality remains unverified and directly impacts the central claim.
Authors: We agree that explicit verification of quality preservation is central to the contribution. The current manuscript already reports FID and IS scores for FlashAR against the original model and baselines in Section 4, together with qualitative examples in Figure 5. To make the role of the fusion gate and absence of artifacts fully explicit, we will expand the experimental section with a dedicated quality analysis subsection that includes direct before/after fusion comparisons and a systematic visual artifact review. revision: yes
Circularity Check
No significant circularity: adaptation components are trained independently rather than defined by construction.
full rationale
The paper's core method introduces trainable elements (vertical head branched from an intermediate layer, learnable fusion gate, two-stage pipeline) whose parameters are optimized on a small held-out adaptation set (0.05% of original data). These are not algebraically equivalent to the original raster-scan prior or the reported speedup; the 22.9x figure is an empirical outcome measured after training. No equations reduce the claimed prediction or parallelism to a fitted input by definition, and the provided text contains no self-citations or uniqueness theorems that bear the central claim. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The original autoregressive training objective can be minimally modified while still allowing effective parallel decoding.
invented entities (2)
-
vertical head
no independent evidence
-
learnable fusion gate
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Accessed: 2026-04-30. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,
work page internal anchor Pith review arXiv 2026
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025a. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels...
-
[4]
Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025b. Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, ...
-
[5]
Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717,
Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, et al. Uniform discrete diffusion with metric path for video generation. arXiv preprint arXiv:2510.24717,
-
[6]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4,
-
[7]
arXiv preprint arXiv:2507.22058 (2025)
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,
-
[8]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding.arXiv preprint arXiv:2410.03355,
-
[10]
Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,
Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,
-
[11]
arXiv preprint arXiv:2408.02657 (2024) 1
Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657,
-
[12]
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,
-
[14]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer.arXiv preprint arXiv:2410.10812,
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711,
-
[18]
Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding.arXiv preprint arXiv:2410.01699,
-
[19]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025a. 11 Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu,...
-
[21]
Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025
Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801,
-
[22]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Zhuoyang Zhang, Luke J Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, and Song Han. Locality-aware parallel decoding for efficient autoregressive image generation.arXiv preprint arXiv:2507.01957,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.