pith. sign in

arxiv: 2606.05703 · v1 · pith:IKTQ4KXHnew · submitted 2026-06-04 · 💻 cs.CV

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

Pith reviewed 2026-06-28 02:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive image generationparallel decodingJacobi decodinginference accelerationspatial correlationsattention masktraining-free method
0
0 comments X

The pith

Parallel Jacobi Decoding expands draft tokens into 2D space to accelerate autoregressive image generation 4.8x-6.4x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive models generate images by predicting tokens sequentially, which creates long inference times. The paper shows that extending Jacobi decoding from one-dimensional sequences to the two-dimensional layout of an image allows many tokens to be refined in parallel. An adjusted attention mask limits error buildup during this parallel step. Experiments across models and datasets confirm the speedup holds while image quality stays competitive.

Core claim

The authors introduce Parallel Jacobi Decoding (PJD), a training-free method that grows draft tokens across the two-dimensional spatial grid of an image rather than along a single sequence. This spatial expansion, combined with a modified attention mask, enables simultaneous refinement of multiple positions and reduces the convergence problems that appear in standard Jacobi decoding of images.

What carries the argument

Parallel Jacobi Decoding (PJD), which performs spatially parallel refinement of draft tokens in the 2D image domain together with an adjusted attention mask to control error propagation.

If this is right

  • Inference latency drops by factors between 4.8x and 6.4x on existing autoregressive image models.
  • No retraining is required, so the method can be applied directly to already-trained models.
  • Generation quality remains competitive with the original sequential decoding.
  • The approach works across multiple datasets and model architectures tested in the experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spatial-parallel idea could be tested on other grid-structured generation tasks such as video frames or 3D voxel data.
  • If local correlations prove weaker at very high resolutions, the acceleration factor may shrink unless the mask schedule is further tuned.
  • Combining PJD with existing speculative decoding techniques might yield still larger speedups.

Load-bearing premise

Images contain strong local spatial correlations that support efficient parallel token refinement without large error spread once the attention mask is adjusted.

What would settle it

Measure whether generation quality collapses on image datasets deliberately constructed to lack local spatial correlations, such as scattered random patterns or highly abstract non-natural scenes.

Figures

Figures reproduced from arXiv: 2606.05703 by Boya Liao, Huan Wang, Siyong Jian, Ying Li.

Figure 1
Figure 1. Figure 1: Images generated by Lumina-mGPT [27] using vanilla autoregressive decoding (left) and our Parallel Jacobi Decoding (right). Our method reduces the required autoregressive steps by up to 6.8× while maintaining visual fidelity. Prompts for all examples are provided in the supplementary material. Abstract Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. H… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the attention maps of Lumina [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of one PJD iteration. (Left) Three rows become simultaneously active, each initializing three draft tokens. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Row-causal attention mask in PJD. For every active row, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of 768×768 image generation with Lumina-mGPT [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of Context Token Count c. A larger c improves image quality (lower FID) but reduces acceleration. 100 1000 2000 Top K 2× 4× 6× 8× Ste p C o m p r e s sio n ( ) Step Compression FID 30 31 32 33 F I D ( ) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of step compression ratios across different [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 15
Figure 15. Figure 15: Draft token count (lines) and accepted token count [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results generated by our method on Lumina-mGPT [ [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of 512×512 image generation results on LlamaGen-XL [49] using four decoding strategies: Vanilla AR, SJD, GSD, and our PJD method. Across all prompts, our approach achieves the fastest generation with the fewest sampling steps [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons of 384×384 image generation on Janus-Pro [5] across multiple prompts. For each pair, the left image is generated by Vanilla AR and the right image is generated by our method. Our approach significantly reduces the number of sampling steps while preserving comparable image quality [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative weak-locality image generation results of our method on Lumina-mGPT. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison with and without the proposed Row-Causal Mask (RCM). The top row shows results with RCM, while [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that Parallel Jacobi Decoding (PJD) is a training-free decoding method for autoregressive image generation models. It expands draft tokens in the 2D spatial domain (rather than 1D) to exploit local spatial correlations in images, adjusts the attention mask to mitigate error accumulation, and thereby achieves 4.8x–6.4x acceleration while retaining competitive generation quality on diverse datasets.

Significance. If the central empirical claim holds, the work would offer a practical, training-free route to faster inference for existing AR image generators by converting sequential decoding into spatially parallel refinement; the absence of free parameters and the focus on a decoding-only change are positive features.

major comments (1)
  1. [Abstract] Abstract and method description: the attention-mask adjustment that is asserted to 'mitigate error accumulation' is never given an explicit construction rule (e.g., which future positions remain visible, whether raster-order causality is preserved, or how the 2-D neighborhood is masked). Because this mask is the mechanism that is supposed to prevent the error-propagation saturation observed in prior 1-D Jacobi decoding, the 4.8x–6.4x speedup claim cannot be verified from the given text.
minor comments (1)
  1. The abstract states that 'extensive experiments … demonstrate the claimed speed-ups and quality retention' yet supplies no numerical metrics, baselines, or error bars; these must be added for the empirical claim to be inspectable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern regarding the lack of an explicit construction rule for the attention-mask adjustment is valid and will be addressed through revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the attention-mask adjustment that is asserted to 'mitigate error accumulation' is never given an explicit construction rule (e.g., which future positions remain visible, whether raster-order causality is preserved, or how the 2-D neighborhood is masked). Because this mask is the mechanism that is supposed to prevent the error-propagation saturation observed in prior 1-D Jacobi decoding, the 4.8x–6.4x speedup claim cannot be verified from the given text.

    Authors: We agree that the current manuscript provides only a high-level description of the attention-mask adjustment and does not supply an explicit construction rule. This omission prevents full verification of the mechanism. In the revised manuscript we will add a dedicated subsection that states the precise mask construction: which future positions remain visible to each draft token, confirmation that raster-order causality is strictly preserved outside the 2-D neighborhood, and the exact rule used to mask the 2-D spatial neighborhood. These additions will make the error-mitigation strategy reproducible and will directly support the reported speed-up claims. revision: yes

Circularity Check

0 steps flagged

No circularity; training-free method rests on empirical observation and experimental validation

full rationale

The paper frames PJD as a training-free decoding change driven by the observation of local spatial correlations in images, with draft expansion in 2D and an attention-mask adjustment. No equations, fitted parameters, or self-citations are shown that reduce the reported 4.8x-6.4x acceleration or convergence claim to a quantity defined by the method's own inputs. The acceleration is presented as an empirical outcome measured across models and datasets, keeping the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that local spatial correlations in natural images are strong enough to support parallel refinement; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Images exhibit strong local spatial correlations
    Invoked to justify expanding draft tokens in the two-dimensional spatial domain rather than one-dimensional sequence.

pith-pipeline@v0.9.1-grok · 5662 in / 1147 out tokens · 29038 ms · 2026-06-28T02:03:27.263647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 21 linked inside Pith

  1. [1]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Medusa: Simple llm inference acceleration framework with multiple decod- ing heads.arXiv preprint arXiv:2401.10774, 2024

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja- son D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decod- ing heads.arXiv preprint arXiv:2401.10774, 2024. 2, 3

  3. [3]

    Acceler- ating large language model decoding with speculative sam- pling.arXiv preprint arXiv:2302.01318, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean- Baptiste Lespiau, Laurent Sifre, and John Jumper. Acceler- ating large language model decoding with speculative sam- pling.arXiv preprint arXiv:2302.01318, 2023. 2, 3

  4. [4]

    Pixelsnail: An improved autoregressive generative model

    Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. InICML, 2018. 2

  5. [5]

    Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  6. [6]

    Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024

    Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1, 3

  7. [7]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

  8. [8]

    Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025

    Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 4

  9. [9]

    Can mllms guide me home? a benchmark study on fine- grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

    Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine- grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025. 4

  10. [10]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1

  11. [11]

    Zipar: Accelerating au- toregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062, 2024

    Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Accelerating au- toregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062, 2024. 3, 4

  12. [12]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 6

  13. [13]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1

  14. [14]

    Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Os- tendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InCVPR, 2023. 1

  15. [15]

    Vision-r1: Incentivizing reasoning capabil- ity in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capabil- ity in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 4

  16. [16]

    Lantern: Accelerating visual autoregressive models with relaxed speculative decoding

    Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sungyub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding. InICLR, 2025. 2, 3

  17. [17]

    Ssd: Spatial-semantic head de- coupling for efficient autoregressive image generation.arXiv preprint arXiv:2510.18716, 2025

    Siyong Jian and Huan Wang. Ssd: Spatial-semantic head de- coupling for efficient autoregressive image generation.arXiv preprint arXiv:2510.18716, 2025. 3

  18. [18]

    Pixelcnn models with auxiliary variables for natural image modeling

    Alexander Kolesnikov and Christoph H Lampert. Pixelcnn models with auxiliary variables for natural image modeling. InICML, 2017. 2

  19. [19]

    Cllms: Consistency large language models

    Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: Consistency large language models. InICML,

  20. [20]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In ICML, 2023. 2, 3, 5

  21. [21]

    Au- toregressive image generation with randomized parallel de- coding.arXiv preprint arXiv:2503.10568, 2025

    Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Au- toregressive image generation with randomized parallel de- coding.arXiv preprint arXiv:2503.10568, 2025. 3

  22. [22]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InNeurIPS, 2024. 3

  23. [23]

    Eagle: Speculative sampling requires rethinking feature un- certainty.arXiv preprint arXiv:2401.15077, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature un- certainty.arXiv preprint arXiv:2401.15077, 2024. 2, 3

  24. [24]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 5, 1

  25. [25]

    Evaluating text-to-visual generation with image-to-text gen- eration

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InECCV, 2024. 1

  26. [26]

    Deepseek-v3 technical report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

  27. [27]

    Lumina-mgpt: Illuminate flexible photorealistic text- to-image generation with multimodal generative pretraining

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text- to-image generation with multimodal generative pretraining. arXiv preprint arXiv:2408.02657, 2024. 1, 2, 3, 5, 7

  28. [28]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024. 3

  29. [29]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. 2022. 2

  30. [30]

    Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2

  31. [31]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 5

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1

  33. [33]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 6

  34. [34]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, 2021. 2

  35. [35]

    Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

  36. [36]

    Gener- ating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- ating diverse high-fidelity images with vq-vae-2. InNeurIPS,

  37. [37]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1

  38. [38]

    Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 2

  39. [39]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016. 6

  40. [40]

    Pixelcnn++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications

    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. 2

  41. [41]

    Accelerating transformer inference for translation via parallel decoding

    Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodola. Accelerating transformer inference for translation via parallel decoding. InACL, 2023. 2, 3

  42. [42]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024. 2

  43. [43]

    Grouped speculative decoding for autoregressive im- age generation

    Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive im- age generation. InICCV, 2025. 2, 3, 4, 6, 7, 1

  44. [44]

    Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1

  45. [45]

    Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1

  46. [46]

    Accelerating feedforward computation via parallel nonlinear equation solving

    Yang Song, Chenlin Meng, Renjie Liao, and Stefano Ermon. Accelerating feedforward computation via parallel nonlinear equation solving. InICML, 2021. 3

  47. [47]

    Consistency models.arXiv e-prints, 2023

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv e-prints, 2023. 2

  48. [48]

    Block- wise parallel decoding for deep autoregressive models

    Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Block- wise parallel decoding for deep autoregressive models. 2018. 2

  49. [49]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 1, 2, 3, 5

  50. [50]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 3

  51. [51]

    Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

  52. [52]

    Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding

    Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding. InICLR, 2025. 2, 3, 4, 5, 6, 7, 1

  53. [53]

    Conditional image gen- eration with pixelcnn decoders

    Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image gen- eration with pixelcnn decoders. InNeurIPS, 2016. 2

  54. [54]

    Pixel recurrent neural networks

    Aron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. InICML,

  55. [55]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 2, 3

  56. [56]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. 2

  57. [57]

    Parallelized autoregressive visual generation

    Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InCVPR,

  58. [58]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  59. [59]

    Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

    Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025. 3

  60. [60]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InNeurIPS, 2023. 1

  61. [61]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  62. [62]

    Scaling autoregres- sive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022. 2, 3, 5, 1

  63. [63]

    H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. InNeurIPS, 2023. 3

  64. [64]

    Locality-aware parallel decoding for efficient autoregressive image genera- tion.arXiv preprint arXiv:2507.01957, 2025

    Zhuoyang Zhang, Luke J Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, and Song Han. Locality-aware parallel decoding for efficient autoregressive image genera- tion.arXiv preprint arXiv:2507.01957, 2025. 3 Parallel Jacobi Decoding for Fast Autoregressive Image Generation Supplementary Material A. Additional Quantitative Results Additional Metrics.We ...