Multi-Scale Local Speculative Decoding for Image Generation

· 2026 · cs.CV · arXiv 2601.05149

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

representative citing papers

Knowledge Distillation for Visual Autoregressive Models

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

VarKD is a distillation framework for visual AR models that uses student samples and selective teacher supervision to reduce token ambiguity, outperforming prior baselines on ImageNet.

citing papers explorer

Showing 1 of 1 citing paper.

Knowledge Distillation for Visual Autoregressive Models cs.CV · 2026-06-04 · unverdicted · none · ref 24 · internal anchor
VarKD is a distillation framework for visual AR models that uses student samples and selective teacher supervision to reduce token ambiguity, outperforming prior baselines on ImageNet.

Multi-Scale Local Speculative Decoding for Image Generation

fields

years

verdicts

representative citing papers

citing papers explorer