Autoregressive Image Generation Without Vector Quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He · 2024 · arXiv 2406.11838

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.

ELT: Elastic Looped Transformers for Visual Generation

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

Unified Video Action Model

cs.RO · 2025-02-28 · unverdicted · novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.

CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

physics.ins-det · 2026-05-12 · unverdicted · novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

citing papers explorer

Showing 6 of 6 citing papers.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization cs.LG · 2026-04-20 · unverdicted · none · ref 40
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 23
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
ELT: Elastic Looped Transformers for Visual Generation cs.CV · 2026-04-10 · unverdicted · none · ref 46
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 28
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation physics.ins-det · 2026-05-12 · unverdicted · none · ref 61
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 61
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Autoregressive Image Generation Without Vector Quantization

fields

years

verdicts

representative citing papers

citing papers explorer