pith. machine review for the scientific record. sign in

arxiv: 2605.09430 · v2 · submitted 2026-05-10 · 💻 cs.CV

Recognition: unknown

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive image generationpost-training accelerationparallel decodingtwo-way next-token predictionfusion gatevertical headlightweight adaptation
0
0 comments X

The pith

FlashAR adapts pre-trained autoregressive image models for parallel decoding via a branched vertical head and fusion gate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a pre-trained raster-scan autoregressive model can be turned into a parallel generator by adding a complementary vertical prediction head without retraining from scratch. This matters because sequential next-token prediction makes high-resolution image generation too slow for many uses, while full new paradigms require expensive pre-training. FlashAR keeps the original horizontal head intact, branches a lightweight vertical head from an intermediate layer to avoid bias, and uses a learnable fusion gate to blend the two at each position. A two-stage adaptation process then fine-tunes this setup on just 0.05 percent of the original data. The result is stable two-way next-token prediction that supports much faster inference.

Core claim

FlashAR retains the original autoregressive head as a horizontal predictor for row-wise tokens and adds a lightweight vertical head branched from an intermediate layer for column-wise tokens. These predictions are combined at each position through a learnable fusion gate whose weights reflect the varying importance of horizontal and vertical dependencies. A two-stage post-training pipeline first adapts the vertical head alone and then jointly tunes it with the backbone, enabling the model to support parallel decoding while staying close to the original training objective.

What carries the argument

A learnable fusion gate that dynamically combines the retained horizontal autoregressive head with a new vertical head branched from an intermediate layer of the pre-trained network.

If this is right

  • Existing autoregressive models can be accelerated without designing a new generation paradigm or pre-training from scratch.
  • Parallel token prediction becomes feasible while the learned prior from the original raster-scan objective is largely retained.
  • Adaptation requires only 0.05 percent of the original training data through the two-stage pipeline.
  • Speedups of up to 22.9 times are achieved for 512x512 image generation on models such as LlamaGen and Emu3.5.
  • The relative importance of horizontal and vertical predictions can be learned position-wise without fixed rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same branching-plus-fusion pattern could be tested on autoregressive models for other data types such as video sequences or 3D structures.
  • Further reduction in adaptation cost might allow the technique to scale to even larger backbone models where full fine-tuning is prohibitive.
  • One could measure whether the fusion gate learns consistent patterns across different image domains or styles.

Load-bearing premise

That adding a vertical head from an intermediate layer and blending its predictions with the original horizontal head will preserve generation quality during parallel decoding.

What would settle it

Side-by-side measurement of FID scores, visual artifacts, and inference latency for 512x512 images produced by the original model versus the FlashAR-adapted model on the same benchmark prompts.

Figures

Figures reproduced from arXiv: 2605.09430 by Bohan Zhuang, Feng Chen, Junkang Zhou, Weijie Wang, Yefei He.

Figure 1
Figure 1. Figure 1: Generated samples from FlashAR. The first row shows 512 × 512 text-guided generation results, while the second row presents class-conditional generation samples at 384×384 and 256×256 resolutions. ABSTRACT Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the FlashAR framework. Initialized from a pre-trained raster-scan autoregres [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of linear probing experiments. (a) Schematic illustrating the aggregation of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on LlamaGen-L. (a) FID convergence trajectories across training epochs. (b) Final FID comparisons across component variants [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Complex text-guided image generation samples by Emu3.5-Image-FlashAR 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Class-conditional image generation samples produced by FlashAR-XXL on Imagenet 256 × 256 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FlashAR, a post-training adaptation framework for pre-trained autoregressive image generation models. It retains the original horizontal AR head for row-wise next-token prediction while branching a lightweight vertical head from an intermediate layer for column-wise prediction. These are dynamically combined via a learnable fusion gate to support two-way next-token prediction and parallel decoding. A two-stage adaptation pipeline (vertical-head initialization followed by joint fine-tuning) is proposed to enable efficient adaptation using only 0.05% of the original training data. Experiments on LlamaGen and Emu3.5 report up to 22.9x speedup for 512x512 image generation.

Significance. If the central claims hold, FlashAR would provide a practical post-training route to accelerate raster-scan AR image models without full retraining or altered objectives, addressing the training-inference gap noted in prior work. The use of minimal data and retention of the original prior could make high-quality parallel generation more accessible for large models.

major comments (2)
  1. [§3] §3 (Method): The claim that branching the vertical head from an intermediate layer bypasses horizontal bias and enables stable parallel decoding is load-bearing for the speedup without quality loss, yet the manuscript provides no analysis or ablation of layer choice effects on prediction conflict resolution during parallel steps.
  2. [§4] §4 (Experiments): The 22.9x speedup for 512x512 generation is reported, but without explicit FID/IS scores, visual artifact analysis, or comparisons showing that the fusion gate prevents degradation relative to the original model, the preservation of generation quality remains unverified and directly impacts the central claim.
minor comments (2)
  1. [§3.1] The description of the fusion gate could include an explicit equation showing how horizontal and vertical logits are combined at each position to improve clarity.
  2. [§3.2] Clarify the exact data selection process for the 0.05% adaptation set and any regularization used to prevent overfitting of the new components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions planned to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The claim that branching the vertical head from an intermediate layer bypasses horizontal bias and enables stable parallel decoding is load-bearing for the speedup without quality loss, yet the manuscript provides no analysis or ablation of layer choice effects on prediction conflict resolution during parallel steps.

    Authors: We appreciate this observation. Branching from an intermediate layer is motivated by the fact that deeper layers become increasingly specialized to the original horizontal raster-scan objective, increasing prediction conflicts under parallel decoding. While space constraints limited the initial submission, we will add a dedicated ablation study in the revised manuscript that varies the branching layer and reports quantitative metrics on prediction conflict rates, training stability, and final generation quality. revision: yes

  2. Referee: [§4] §4 (Experiments): The 22.9x speedup for 512x512 generation is reported, but without explicit FID/IS scores, visual artifact analysis, or comparisons showing that the fusion gate prevents degradation relative to the original model, the preservation of generation quality remains unverified and directly impacts the central claim.

    Authors: We agree that explicit verification of quality preservation is central to the contribution. The current manuscript already reports FID and IS scores for FlashAR against the original model and baselines in Section 4, together with qualitative examples in Figure 5. To make the role of the fusion gate and absence of artifacts fully explicit, we will expand the experimental section with a dedicated quality analysis subsection that includes direct before/after fusion comparisons and a systematic visual artifact review. revision: yes

Circularity Check

0 steps flagged

No significant circularity: adaptation components are trained independently rather than defined by construction.

full rationale

The paper's core method introduces trainable elements (vertical head branched from an intermediate layer, learnable fusion gate, two-stage pipeline) whose parameters are optimized on a small held-out adaptation set (0.05% of original data). These are not algebraically equivalent to the original raster-scan prior or the reported speedup; the 22.9x figure is an empirical outcome measured after training. No equations reduce the claimed prediction or parallelism to a fitted input by definition, and the provided text contains no self-citations or uniqueness theorems that bear the central claim. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that the pre-trained model's learned prior can be preserved with minimal objective change and on the introduction of new architectural components whose effectiveness is demonstrated only through the reported adaptation experiments.

axioms (1)
  • domain assumption The original autoregressive training objective can be minimally modified while still allowing effective parallel decoding.
    Invoked in the key insight that effective adaptation should minimize modifications to preserve the learned prior.
invented entities (2)
  • vertical head no independent evidence
    purpose: Complementary column-wise next-token prediction
    New lightweight head branched from an intermediate layer to enable two-way prediction.
  • learnable fusion gate no independent evidence
    purpose: Dynamically combine horizontal and vertical predictions at each position
    New component to handle varying relative importance of row-wise and column-wise dependencies.

pith-pipeline@v0.9.0 · 5600 in / 1529 out tokens · 33557 ms · 2026-05-13T05:51:16.852961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 9 internal anchors

  1. [1]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Accessed: 2026-04-30. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025a. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels...

  4. [4]

    OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

    Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025b. Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, ...

  5. [5]

    Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717,

    Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, et al. Uniform discrete diffusion with metric path for video generation. arXiv preprint arXiv:2510.24717,

  6. [6]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4,

  7. [7]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

  8. [8]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  9. [9]

    Lantern: Accelerating visual autoregressive models with relaxed speculative decoding.arXiv preprint arXiv:2410.03355,

    Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding.arXiv preprint arXiv:2410.03355,

  10. [10]

    Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,

    Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding.arXiv preprint arXiv:2503.10568,

  11. [11]

    arXiv preprint arXiv:2408.02657 (2024) 1

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657,

  12. [12]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606,

  13. [13]

    Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,

  14. [14]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

  15. [15]

    Hart: Efficient visual generation with hybrid autoregressive transformer.arXiv preprint arXiv:2410.10812,

    Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer.arXiv preprint arXiv:2410.10812,

  16. [16]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  17. [17]

    arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711,

  18. [18]

    Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding.arXiv preprint arXiv:2410.01699,

    Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding.arXiv preprint arXiv:2410.01699,

  19. [19]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  20. [20]

    Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025a. 11 Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu,...

  21. [21]

    Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

    Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801,

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  23. [23]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5,

  24. [24]

    Locality-aware parallel decoding for efficient autoregressive image generation.arXiv preprint arXiv:2507.01957,

    Zhuoyang Zhang, Luke J Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, and Song Han. Locality-aware parallel decoding for efficient autoregressive image generation.arXiv preprint arXiv:2507.01957,