super hub Mixed citations

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Bingyue Peng, Peize Sun, Ping Luo, Shilong Zhang, Shoufa Chen, Yi Jiang · 2024 · cs.CV · arXiv 2406.06525

Mixed citation behavior. Most common role is background (47%).

101 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 101 citing papers more from Bingyue Peng arXiv PDF

abstract

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 6 method 3

citation-polarity summary

background 9 baseline 6 use method 3 unclear 1

claims ledger

abstract We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio o

authors

Bingyue Peng Peize Sun Ping Luo Shilong Zhang Shoufa Chen Yi Jiang

co-cited works

representative citing papers

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

TopoGPT pre-trains an autoregressive transformer on serialized lane graphs from 3.3M scenes to learn geometry priors and uses a perception adapter to apply it to BEV features for improved lane graph prediction on OpenLane-V2.

Obliviate: Erasing Concepts from Autoregressive Image Generation Models

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

ChannelTok: Efficient Flexible-Length Vision Tokenization

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

cs.CV · 2026-05-20 · conditional · novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

cs.CV · 2026-05-20 · conditional · novelty 7.0

HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HIT with token overlap plus DPO regularization lets a 300M-param VAR model deliver state-of-the-art multi-scale ISR in one forward pass without external data.

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

cs.CV · 2026-05-14 · conditional · novelty 7.0 · 2 refs

HeatKV doubles KV-cache compression ratios over prior methods for VAR models by creating static head-specific pruning schedules from attention rankings on a calibration set, while preserving image quality on Infinity-2B.

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

Normalizing Trajectory Models

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Autoregressive Visual Generation Needs a Prologue

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstructions on COCO and OpenImages while running faster than diffusion approaches.

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

cs.CV · 2026-04-06 · conditional · novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

cs.CV · 2026-01-04 · unverdicted · novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

cs.CV · 2025-11-17 · conditional · novelty 7.0

VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.

citing papers explorer

Showing 50 of 101 citing papers.

GEAR: Guided End-to-End AutoRegression for Image Synthesis cs.CV · 2026-06-30 · unverdicted · none · ref 40 · internal anchor
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior cs.CV · 2026-06-30 · unverdicted · none · ref 41 · internal anchor
TopoGPT pre-trains an autoregressive transformer on serialized lane graphs from 3.3M scenes to learn geometry priors and uses a perception adapter to apply it to BEV features for improved lane graph prediction on OpenLane-V2.
Obliviate: Erasing Concepts from Autoregressive Image Generation Models cs.CV · 2026-06-26 · unverdicted · none · ref 38 · internal anchor
Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.
Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation cs.CV · 2026-06-26 · unverdicted · none · ref 11 · internal anchor
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
Balancing Image Compression and Generation with Bootstrapped Tokenization cs.LG · 2026-06-04 · unverdicted · none · ref 63 · internal anchor
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
ChannelTok: Efficient Flexible-Length Vision Tokenization cs.CV · 2026-06-03 · unverdicted · none · ref 22 · internal anchor
ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.
Imagine Before You Draw: Visual Prompt Engineering for Image Generation cs.CV · 2026-06-03 · unverdicted · none · ref 15 · internal anchor
VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 60 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation cs.CV · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion cs.AI · 2026-05-28 · unverdicted · none · ref 73 · internal anchor
Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution cs.CV · 2026-05-20 · conditional · none · ref 57 · internal anchor
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation cs.CV · 2026-05-20 · conditional · none · ref 31 · internal anchor
HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 62 · internal anchor
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
Hierarchical Image Tokenization for Multi-Scale Image Super Resolution cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
HIT with token overlap plus DPO regularization lets a 300M-param VAR model deliver state-of-the-art multi-scale ISR in one forward pass without external data.
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling cs.CV · 2026-05-14 · conditional · none · ref 14 · 2 links · internal anchor
HeatKV doubles KV-cache compression ratios over prior methods for VAR models by creating static head-specific pruning schedules from attention rankings on a calibration set, while preserving image quality on Infinity-2B.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 58 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models cs.CV · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Normalizing Trajectory Models cs.CV · 2026-05-08 · unverdicted · none · ref 18 · 2 links · internal anchor
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Autoregressive Visual Generation Needs a Prologue cs.CV · 2026-05-07 · unverdicted · none · ref 40 · 2 links · internal anchor
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models cs.CV · 2026-04-16 · unverdicted · none · ref 39 · internal anchor
Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstructions on COCO and OpenImages while running faster than diffusion approaches.
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens cs.CV · 2026-04-06 · conditional · none · ref 64 · internal anchor
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV · 2026-03-01 · unverdicted · none · ref 68 · internal anchor
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV · 2026-01-04 · unverdicted · none · ref 52 · internal anchor
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping cs.CV · 2025-11-17 · conditional · none · ref 26 · internal anchor
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching cs.LG · 2025-09-26 · conditional · none · ref 72 · internal anchor
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling cs.CV · 2025-09-15 · unverdicted · none · ref 21 · internal anchor
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models cs.CV · 2025-05-28 · unverdicted · none · ref 38 · internal anchor
PacTure uses view packing and next-scale autoregressive prediction to generate consistent multi-view PBR textures faster than prior sequential or cross-attention methods.
Distilling Specialized Orders for Visual Generation cs.CV · 2025-04-23 · unverdicted · none · ref 10 · internal anchor
OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 46 · internal anchor
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 73 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation cs.CV · 2026-06-29 · unverdicted · none · ref 50 · internal anchor
AVTok is a unified tokenizer that converts audio-video pairs into a compact 1D latent representation via dual-stream transformer and hierarchical training for improved reconstruction and cross-modal generation.
SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation cs.CV · 2026-06-29 · unverdicted · none · ref 38 · internal anchor
Introduces SciIR-82k dataset and SciIR-Bench for scientific image reasoning generation organized by Peirce's semiotic triad, with fine-tuning raising model score from 35% to 43%.
Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis cs.CV · 2026-06-29 · unverdicted · none · ref 31 · internal anchor
A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.
Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers cs.CV · 2026-06-27 · unverdicted · none · ref 28 · internal anchor
Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution cs.CV · 2026-06-25 · unverdicted · none · ref 12 · internal anchor
ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.
Data Provenance for Image Auto-Regressive Generation cs.CV · 2026-06-22 · unverdicted · none · ref 180 · internal anchor
A post-hoc detection framework exploits generation-induced patterns in autoregressive image outputs to enable provenance tracing across multiple IAR models without altering the generation process.
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models cs.CV · 2026-05-29 · unverdicted · none · ref 38 · internal anchor
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation cs.CV · 2026-05-28 · unverdicted · none · ref 37 · internal anchor
VPG is a training-free inference-time guidance technique that improves autoregressive image and video generation by contrasting model outputs under generated versus corrupted prefixes to strengthen next-step support for the prefix.
Channel-wise Vector Quantization cs.CV · 2026-05-25 · unverdicted · none · ref 33 · internal anchor
CVQ replaces patch-wise vector quantization with channel-wise quantization of feature maps, enabling a next-channel autoregressive model that reports 100% codebook utilization and text-to-image scores of DPG 86.7 and GenEval 0.79.
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion cs.CV · 2026-05-22 · unverdicted · none · ref 34 · internal anchor
PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 72 · internal anchor
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing cs.CV · 2026-05-17 · unverdicted · none · ref 51 · internal anchor
HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization cs.CV · 2026-05-15 · unverdicted · none · ref 36 · 2 links · internal anchor
FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation cs.CV · 2026-05-14 · conditional · none · ref 37 · internal anchor
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Do multimodal models imagine electric sheep? cs.CV · 2026-05-10 · conditional · none · ref 61 · internal anchor
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation cs.CV · 2026-05-10 · unverdicted · none · ref 14 · 2 links · internal anchor
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models cs.LG · 2026-05-10 · unverdicted · none · ref 66 · internal anchor
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality cs.CV · 2026-05-07 · unverdicted · none · ref 136 · internal anchor
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 36 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer