hub Canonical reference

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie · 2025 · cs.CV · arXiv 2510.11690

Canonical reference. 73% of citing Pith papers cite this work as background.

91 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 91 citing papers arXiv PDF

abstract

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 baseline 3 method 1 other 1

citation-polarity summary

background 16 baseline 3 support 1 unclear 1 use method 1

claims ledger

abstract Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we ex

co-cited works

representative citing papers

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

STREAM applies stochastic Riemannian flow matching on VFM-derived unit hypersphere latents with a novel anisotropic decoder to achieve SOTA reconstruction and generation on breast and colorectal cancer histopathology datasets.

Diff-CA: Separating Common and Salient Factors with Diffusion Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

A diffusion-based contrastive analysis method that decomposes conditioning into common and salient factors with weak supervision and proves identifiability of the additive model.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.

Let EEG Models Learn EEG

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising methods on three benchmarks.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

cs.CV · 2026-03-15 · unverdicted · novelty 7.0

Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.

Representation Distribution Matching for One-Step Visual Generation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

RDM trains one-step generators via MMD on large batches and multi-encoder representations, achieving SOTA SW_r14 of 1.30 on ImageNet and distilling FLUX.2 to one-step with gains on GenEval and PickScore.

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

Formalizes Fashion Detail Generation task, releases FDBench benchmark with 40K+ pairs, and proposes CFAD distillation method plus RL consistency reward that outperforms open-source baselines.

Flow Matching in Feature Space for Stochastic World Modeling

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

IDEAL improves discrete representation autoencoders by jointly aligning quantized tokens with shallow and deep VFM features, reporting 0.61 rFID on ImageNet and 1.89 gFID for autoregressive image generation.

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Echo-DM proposes a mask-free conditional latent diffusion framework with region-aware fusion for removing markers from ultrasound images while preserving anatomical fidelity.

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

Matching in semantic SSL feature space via Sinkhorn divergence enables effective one-step generation on ImageNet by inducing compact geometry for distribution matching, with training and evaluation features best kept distinct.

Representation Forcing for Bottleneck-Free Unified Multimodal Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.

citing papers explorer

Showing 1 of 1 citing paper after filters.

PixelGen: Improving Pixel Diffusion with Perceptual Supervision cs.CV · 2026-02-02 · accept · none · ref 28 · internal anchor
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

Diffusion Transformers with Representation Autoencoders

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer