hub Mixed citations

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

· 2023 · cs.CV · arXiv 2310.05737

Mixed citation behavior. Most common role is background (50%).

55 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 55 citing papers arXiv PDF

abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 5 baseline 2

citation-polarity summary

background 7 use method 5 baseline 2

representative citing papers

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

ChannelTok: Efficient Flexible-Length Vision Tokenization

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstructions on COCO and OpenImages while running faster than diffusion approaches.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

cs.CV · 2025-09-15 · unverdicted · novelty 7.0

Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.

History-Guided Video Diffusion

cs.LG · 2025-02-10 · unverdicted · novelty 7.0

DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.

Lighting-Consistent Object Transfer Across Radiance Fields

cs.GR · 2026-06-21 · unverdicted · novelty 6.0

Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

cs.SD · 2026-06-11 · unverdicted · novelty 6.0

Self-guidance adds a lightweight feature-mapping loss to align decoder manifolds in VQ-VAE speech codecs, raising reconstruction metrics and allowing 4x codebook reduction with no fidelity loss.

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

UniTok tokenizes time series for an off-the-shelf LLM foundation model that unifies forecasting, generation, and classification through next-token prediction and training-free inference.

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

VPG is a training-free inference-time guidance technique that improves autoregressive image and video generation by contrasting model outputs under generated versus corrupted prefixes to strengthen next-step support for the prefix.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

SRC-Flow compresses RAE features via a Semantic Representation Compressor into a low-dimensional space, enabling normalizing flows to reach gFID 1.65 on ImageNet 256x256 and 2.07 on 512x512 while retaining exact likelihoods.

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

MSCoT uses multi-scale hierarchical token prediction, multi-scale guidance, and a token refiner to deliver SOTA text-to-motion control with 48% FID gain, 61% lower error, and 10x faster inference on HumanML3D.

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

cs.CV · 2026-05-14 · conditional · novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

q-bio.BM · 2026-05-11 · unverdicted · novelty 6.0

Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10x fewer parameters than ESM3.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation q-bio.BM · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10x fewer parameters than ESM3.
MMaDA: Multimodal Large Diffusion Language Models cs.CV · 2025-05-21 · unverdicted · none · ref 24 · internal anchor
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 37 · internal anchor
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CV · 2025-06-10 · unverdicted · none · ref 31 · internal anchor
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving cs.CV · 2026-04-03 · unreviewed · ref 49 · internal anchor

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer