Visual autoregressive mod- eling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang · 2024 · arXiv 2404.02905

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

citation-role summary

background 2 other 1

citation-polarity summary

unclear 2 background 1

representative citing papers

Structure over Pixels: Learning Variable-Length Visual Programs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.

Normalizing Trajectory Models

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Amplifying Membership Signal Through Chained Regeneration

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

MADreMIA amplifies membership inference signals by showing that memorized samples maintain higher coherence and slower degradation in chained regeneration trajectories than non-members.

SubdivAR: Autoregressive Next-Scale Prediction for Neural Mesh Subdivision

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

SubdivAR reformulates neural mesh subdivision as autoregressive next-scale coordinate prediction with a topology-aware transformer and reports 18.8% and 14.2% reductions in Hausdorff and Chamfer distance over baselines on a new 40K-mesh dataset.

OmniGen-AR: AutoRegressive Any-to-Image Generation

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB

GPIC: A Giant Permissive Image Corpus for Visual Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

SRC-Flow compresses RAE features via a Semantic Representation Compressor into a low-dimensional space, enabling normalizing flows to reach gFID 1.65 on ImageNet 256x256 and 2.07 on 512x512 while retaining exact likelihoods.

TRACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

TRACE is an autoregressive EEG pre-training framework using temporally adaptive cross-channel expert routing to learn transferable representations, achieving best results on several of eight downstream benchmarks.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

Autoregressive Video Generation without Vector Quantization

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

cs.RO · 2026-06-05 · unverdicted · novelty 5.0

Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

cs.CV · 2026-04-21 · unverdicted · novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10

citing papers explorer

Showing 1 of 1 citing paper after filters.

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 34
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

Visual autoregressive mod- eling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer