AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
Language model beats diffusion - tokenizer is key to visual generation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.
MADreMIA amplifies membership inference signals by showing that memorized samples maintain higher coherence and slower degradation in chained regeneration trajectories than non-members.
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
citing papers explorer
-
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
-
Autoregressive Visual Generation Needs a Prologue
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
-
Distilling Specialized Orders for Visual Generation
OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.
-
Amplifying Membership Signal Through Chained Regeneration
MADreMIA amplifies membership inference signals by showing that memorized samples maintain higher coherence and slower degradation in chained regeneration trajectories than non-members.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.