Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning

Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, for Reasoning , author= · 2025 · arXiv 2505.07538

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Autoregressive Visual Generation Needs a Prologue

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

(1D) Ordered Tokens Enable Efficient Test-Time Search

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.

Your Data Manifold is Secretly a Reward Model: Shell-LCC for Text-to-Video Generation

cs.CV · 2026-06-29 · unverdicted · novelty 5.0

Shell-LCC models the high-quality data manifold as an isotropic shell to derive cost-free reward signals that improve realism and high-frequency details in text-to-video generation.

citing papers explorer

Showing 4 of 4 citing papers.

Autoregressive Visual Generation Needs a Prologue cs.CV · 2026-05-07 · unverdicted · none · ref 46 · 2 links
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 83
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
(1D) Ordered Tokens Enable Efficient Test-Time Search cs.CV · 2026-04-16 · unverdicted · none · ref 3
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
Your Data Manifold is Secretly a Reward Model: Shell-LCC for Text-to-Video Generation cs.CV · 2026-06-29 · unverdicted · none · ref 39
Shell-LCC models the high-quality data manifold as an isotropic shell to derive cost-free reward signals that improve realism and high-frequency details in text-to-video generation.

Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer