hub Canonical reference

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li · 2025 · arXiv 2507.01467

Canonical reference. 88% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 88% of classified citations

read on arXiv browse 23 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 other 1

citation-polarity summary

background 7 unclear 1

representative citing papers

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

MaskAlign uses random token-subset alignment and pre-mask mixing to reduce diffusion models' reliance on complete clean-image token sets during representation alignment.

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Layer-wise representation alignment lets diffusion language models reuse semantic structures from frozen autoregressive models, accelerating training by up to 4x without architectural changes beyond the attention mask.

AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

AHPA adaptively aligns diffusion transformers to hierarchical VAE priors via a dynamic router that matches supervision granularity to the current noise level, improving convergence and quality.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

IDEAL improves discrete representation autoencoders by jointly aligning quantized tokens with shallow and deep VFM features, reporting 0.61 rFID on ImageNet and 1.89 gFID for autoregressive image generation.

Representation-Guided Discrete Molecular Graph Retrosynthesis

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

GRG achieves 58.6/77.2/83.4/87.1 top-1/3/5/10 accuracy and 15.5 diversity on USPTO-50k retrosynthesis, outperforming the base generator while reducing training time by 30%.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.

Efficient Image Synthesis with Sphere Latent Encoder

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

cs.CV · 2026-05-06 · unverdicted · novelty 6.0 · 3 refs

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.

Stage-adaptive audio diffusion modeling

cs.SD · 2026-05-06 · unverdicted · novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

cs.CV · 2025-10-21 · unverdicted · novelty 6.0

VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

cs.CV · 2026-07-02 · unverdicted · novelty 5.0

Attention Separation ablations show that gains from SRA to Self-Flow in diffusion transformers arise mainly from noise-dimension data augmentation rather than token-level self-supervision.

Improving Visual Representation Alignment Generation with GRPO

cs.CV · 2026-05-30 · unverdicted · novelty 5.0

VRPO applies generative representation policy optimization to dynamically align diffusion features with pretrained visual encoders, claiming +1.8 FID gains and 2.3x faster training versus REPA.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.

Elucidating Representation Degradation Problem in Diffusion Model Training

cs.LG · 2026-05-11 · unverdicted · novelty 4.0

Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.

TORA: Topological Representation Alignment for 3D Shape Assembly

cs.CV · 2026-04-05

citing papers explorer

Showing 7 of 7 citing papers after filters.

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment cs.LG · 2026-05-07 · unverdicted · none · ref 5
Layer-wise representation alignment lets diffusion language models reuse semantic structures from frozen autoregressive models, accelerating training by up to 4x without architectural changes beyond the attention mask.
Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 47
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 75
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 88
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CV · 2026-05-06 · unverdicted · none · ref 102 · 3 links
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 80
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
TORA: Topological Representation Alignment for 3D Shape Assembly cs.CV · 2026-04-05 · unreviewed · ref 51

Representation entanglement for generation: Training diffusion transformers is much easier than you think

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer