pith. sign in

arxiv: 2605.21981 · v1 · pith:NOUPZU5Snew · submitted 2026-05-21 · 💻 cs.CV

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Pith reviewed 2026-05-22 07:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsflow matchingDINOv2image generationrepresentation spaceImageNetFIDDiffusion Transformer
0
0 comments X

The pith

A vanilla Diffusion Transformer on DINOv2 features achieves state-of-the-art image generation on ImageNet using x-prediction in flow matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pretrained representation spaces like DINOv2 offer a more favorable distribution for flow-matching models than raw pixels or VAE latents. By comparing geometric properties, it shows DINOv2 has similar intrinsic dimensionality but much higher effective rank, better covariance conditioning, lower kurtosis, and lower interpolation error. These properties allow a simple vanilla Diffusion Transformer to regress clean data points effectively without special heads. The resulting RiT model reaches competitive FID scores on ImageNet with fewer parameters and can be solved in very few steps.

Core claim

By training a vanilla Diffusion Transformer with x-prediction on frozen DINOv2 features, augmented with a dimension-aware noise schedule and joint class-patch modeling, the model attains an FID of 1.45 without guidance and 1.14 with classifier-free guidance on ImageNet 256x256, outperforming more complex models like DiT with 19% fewer parameters while allowing efficient ODE solving at coarse discretizations.

What carries the argument

The Representation Image Transformer (RiT) which applies a standard Diffusion Transformer architecture directly in the frozen DINOv2 representation space using x-prediction for flow matching.

If this is right

  • The ODE solver requires only 5 Heun steps to reach FID 2.0 and 10 steps for 1.25 with guidance.
  • Classifier-free guidance further improves quality to FID 1.14.
  • Specialized prediction heads or Riemannian transport are unnecessary due to the favorable geometry.
  • Representation learning objectives provide advantages over mere compression in VAE latents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar geometric benefits might appear in other self-supervised representations for generative tasks.
  • Freezing the feature extractor might not be necessary if joint optimization could further improve results.
  • This could enable faster generation pipelines in applications requiring quick sampling.
  • The findings suggest prioritizing representation quality in designing future diffusion models.

Load-bearing premise

That the observed geometric properties of DINOv2 features directly cause the effectiveness of vanilla x-prediction rather than merely correlating with the low FID scores.

What would settle it

Training an identical vanilla Diffusion Transformer with x-prediction on pixel space or SD-VAE latents and measuring if it matches the FID performance of RiT would falsify the causal role of the representation geometry.

Figures

Figures reproduced from arXiv: 2605.21981 by Aishwarya Agrawal, Le Zhang, Ning Mang.

Figure 1
Figure 1. Figure 1: Manifold analysis across Pixel, SD-VAE, and DINOv2. (a) PCA spectrum: cumulative variance (top) and per-component variance on log scale (bottom); flatter decay indicates more uniform spread. (b) Condition number κ(Σt) along the transport path; DINOv2 stays 35× better conditioned than Pixel at t=0.9. (c) Interpolation reconstruction MSE; Pixel stays off-manifold while DINOv2 remains close throughout. The ma… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-class interpolation. Top row of each pair: pixel-space blending xt=(1−t)xa+txb (ghosting artifacts). Bottom row: interpolation in DI￾NOv2 representation space zt=(1−t)za+tzb, then decoded back to pixels via the RAE decoder (smooth semantic transitions). used as the latent space of latent diffusion models [23]. Pixels and DINOv2 share the same ambient dimensionality, enabling direct geometric comparis… view at source ↗
Figure 4
Figure 4. Figure 4: RiT Arch. Frozen RAE encoder/decoder (gray) bracket a vanilla DiT trained by x-prediction; [CLS] and patch tokens share self￾attention, with separate heads for zˆ0 and zˆcls,0. Guided by the geometry of Section 2, we instantiate the Rep￾resentation Image Transformer (RiT) ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Time sampling p(t) (top) and per-token SNR (bottom). The SNR of zt = tz0 + (1−t)ϵ is SNR(t) = t 2/(1−t) 2 , but the effective per-token SNR scales with the per-token dimension d [ √ 13]: for a d-dimensional token, the ℓ2 noise magnitude grows as d while the signal stays at unit scale, so higher-d tokens need lower t (more noise) to reach the same relative corruption. A DINOv2-Small token has d=384, 128× th… view at source ↗
Figure 6
Figure 6. Figure 6: Convergence comparison on ImageNet 2562. FID-50K vs training epochs. 80 100 200 400 800 Training Epochs 12.00 8.00 5.90 4.28 2.39 1.80 1.44 FID (w/o guidance) 7× faster 4× faster REPA-XL RAE-XL (DINOv2-B) REG-XL RJF (DINOv2-B) RAE-XL (DINOv2-S) RiT-XL (DINOv2-S) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Few-step FID at matched NFE. JiT-H (pixel) DiT DH-XL (DINOv2-B) RiT-XL (DINOv2-S) 1 2 5 10 20 50 FID-50K (log) 26.2 3.29 2.38 16.4 1.87 1.58 10 NFE 20 NFE 2 5 10 25 50 sampling steps K (Heun, ref Kref = 125) 10 1 10 2 tru ncatio n error x (K) x (ref) F 2.7 × 12.9 × decay 3.6 × decay K 2 (Heun 2 nd -order) RiT-XL (DINOv2-S) JiT-H/16 (pixel) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DINOv2 ODE converges in few Heun steps. Pixel-space truncation error ∥x (K)−x (ref)∥F vs step count K (mean ±1σ over 128 trajectories; each model vs its own Kref=125). RiT decays 12.9× (K=2→50) vs JiT’s 3.6×; late-K slope −1.33 vs −0.91, with the dashed line showing the Heun ∝ K−2 asymptote. Pixel-space truncation error measurement. For each model we generate 128 trajectories per space from matched (ϵ, y) … view at source ↗
Figure 9
Figure 9. Figure 9: Curated RiT-XL samples on ImageNet 2562 selected to span ImageNet categories. 4.4 Comparison with Prior Methods [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PyTorch-style pseudocode for RiT training (left) and sampling (right). The sampling code shows Euler for clarity; in all main experiments we use a 2nd-order Heun solver that additionally averages the velocity at t and the predicted next step. Key differences from standard flow matching: x-prediction and joint CLS modeling. E Encoder Size Ablation See Section 2 (main text) for the encoder size ablation and… view at source ↗
Figure 11
Figure 11. Figure 11: Uncurated RiT-XL samples on ImageNet 2562 . 28 samples across diverse categories: macaw, jellyfish, flamingo, king penguin, golden retriever, Siberian husky, arctic fox, lion, monarch butterfly, red panda, giant panda, balloon, space shuttle, ice cream, cheeseburger, pizza, cliff, coral reef, volcano, and daisy. Generated with Heun 100 steps, CFG scale 3.7. G Sampling Schedule Analysis [PITH_FULL_IMAGE:f… view at source ↗
Figure 12
Figure 12. Figure 12: Sampling schedule comparison. (a) Schedule functions t(i/K) for K=50 steps. (b, c) FID-50K vs. Heun step count without and with guidance. Coupled noise, RiT-XL on DINOv2-S, 800 epochs [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Random (non-curated) RiT-XL samples on ImageNet 2562 . Each row shows 8 independently generated samples for a single class. 24 classes shown: macaw, golden retriever, Siberian husky, arctic fox, lion, monarch butterfly, giant panda, balloon, ice cream, cheeseburger, pizza, cliff, coral reef, volcano, daisy, flamingo, king penguin, jellyfish, otter, red panda, cheetah, space shuttle, fountain, and loggerhe… view at source ↗
Figure 14
Figure 14. Figure 14: Layer-wise CLS–patch communication in RiT. Left: [CLS]-to-patch attention transi￾tions from coarse scene aggregation to semantically salient regions. Right: patch-to-[CLS] attention shows global information exchange followed by focused refinement. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes RiT, a vanilla Diffusion Transformer trained with x-prediction flow matching directly on frozen DINOv2 features for class-conditional image generation. It compares geometric properties (intrinsic dimension, effective rank, covariance conditioning, excess kurtosis, and on-manifold interpolation error) across pixel space, SD-VAE latents, and DINOv2 features, arguing that DINOv2's statistics render the regression well-conditioned and obviate specialized heads or Riemannian methods. On ImageNet 256×256, RiT reports FID 1.45 (unguided) and 1.14 (CFG), outperforming DiT^DH-XL with 19% fewer parameters, and achieves low FID with only 5–10 Heun steps.

Significance. If the performance numbers hold under full verification, the result indicates that pretrained representation spaces can simplify diffusion transformer design by supplying better-conditioned targets for standard x-prediction, reducing reliance on architectural specialization or heavy sampling tricks. The concrete FID values, parameter efficiency, and coarse-discretization sampling performance are strengths; the public code release further supports reproducibility of both the geometric measurements and the training pipeline.

major comments (1)
  1. [§3–4] §4 (Experiments) and §3 (Geometric Analysis): the central explanatory claim—that DINOv2's 7.3× higher effective rank, 35× better conditioning, 11.5× lower kurtosis, and 1.7× lower interpolation error causally enable vanilla x-prediction success—is not isolated from the dimension-aware noise schedule and joint [CLS]-patch modeling also introduced in RiT. No ablation holds the full RiT recipe fixed while swapping only the input representation (e.g., pixel or SD-VAE inputs under identical schedule and modeling choices) to test whether the reported FID 1.45/1.14 and 5-step Heun performance require DINOv2 geometry specifically. The current comparisons treat the four axes as explanatory rather than correlative.
minor comments (2)
  1. [Table 1] Table 1: the exact procedure for computing effective rank and excess kurtosis on the feature sets should be stated (e.g., number of samples, regularization, or eigenvalue threshold) to allow direct replication of the 7.3× and 11.5× factors.
  2. [§4.3] §4.3: the baseline DiT^DH-XL implementation details (exact hyper-parameters, feature extraction pipeline, and whether the same DINOv2 encoder is used) are referenced only by citation; a short paragraph or appendix entry would clarify the fairness of the 676 M vs. 839 M parameter comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern that the geometric advantages of DINOv2 are not isolated from the dimension-aware schedule and [CLS]-patch modeling in our experiments.

read point-by-point responses
  1. Referee: [§3–4] §4 (Experiments) and §3 (Geometric Analysis): the central explanatory claim—that DINOv2's 7.3× higher effective rank, 35× better conditioning, 11.5× lower kurtosis, and 1.7× lower interpolation error causally enable vanilla x-prediction success—is not isolated from the dimension-aware noise schedule and joint [CLS]-patch modeling also introduced in RiT. No ablation holds the full RiT recipe fixed while swapping only the input representation (e.g., pixel or SD-VAE inputs under identical schedule and modeling choices) to test whether the reported FID 1.45/1.14 and 5-step Heun performance require DINOv2 geometry specifically. The current comparisons treat the four axes as explanatory rather than correlative.

    Authors: We agree that a controlled ablation holding the RiT architecture, dimension-aware noise schedule, and joint [CLS]-patch modeling fixed while varying only the input representation would more directly test causality. The geometric measurements in §3 are performed independently on the frozen representations and show that DINOv2 features exhibit markedly better conditioning and lower kurtosis than pixels or SD-VAE latents despite similar intrinsic dimensionality; these statistics are presented as supporting evidence for why standard x-prediction succeeds without specialized heads. Nevertheless, we acknowledge the current design does not fully decouple the representation from the schedule and modeling choices. In the revised manuscript we will add experiments that apply the complete RiT training recipe to pixel-space and SD-VAE inputs under identical schedule and [CLS] settings, allowing direct comparison of the resulting FID and sampling efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from direct training and evaluation

full rationale

The paper computes geometric statistics (effective rank, covariance conditioning, excess kurtosis, interpolation error) directly on pixel, SD-VAE, and DINOv2 feature spaces and reports them as observations. It then trains a vanilla Diffusion Transformer on frozen DINOv2 features using x-prediction plus two auxiliary components (dimension-aware noise schedule, joint [CLS]-patch modeling) and measures FID on the held-out ImageNet validation set. No equation reduces the reported FID values or the claim of well-conditioned regression to a fitted parameter defined inside the paper; the performance numbers are obtained by standard model training and benchmark evaluation rather than by construction from the geometric axes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that DINOv2 provides a favorable manifold for flow matching and that standard diffusion training objectives transfer directly; no new entities are postulated.

free parameters (1)
  • dimension-aware noise schedule parameters
    The paper introduces a dimension-aware noise schedule whose exact functional form and any fitted constants are not detailed in the abstract.
axioms (1)
  • domain assumption DINOv2 features contain a low-dimensional manifold of intrinsic dimensionality comparable to pixel space but with superior statistical conditioning for regression
    Invoked to explain why vanilla x-prediction suffices without specialized heads.

pith-pipeline@v0.9.0 · 5888 in / 1440 out tokens · 34215 ms · 2026-05-22T07:46:57.372335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 14 internal anchors

  1. [1]

    Preconditioned Flow Matching

    Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, and Eldad Haber. Preconditioned score and flow matching.arXiv preprint arXiv:2603.02337, 2026

  2. [2]

    Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

    Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

  3. [3]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  6. [6]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021. 10

  7. [7]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  8. [8]

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

  9. [9]

    Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

  10. [10]

    One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

  11. [11]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

  12. [12]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  13. [13]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213– 13232. PMLR, 2023

  14. [14]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

  15. [15]

    Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

    Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025

  16. [16]

    Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

    Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

  17. [17]

    Maximum likelihood estimation of intrinsic dimension

    Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004

  18. [18]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  19. [19]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  20. [20]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  21. [21]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

  22. [22]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  23. [23]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  24. [24]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606–610. IEEE, 2007. 11

  25. [25]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  26. [26]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  27. [27]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

  28. [28]

    Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

  29. [29]

    Improved Techniques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

  30. [30]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  31. [31]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

  32. [32]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  33. [33]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  34. [34]

    Metamorph: Multimodal understanding and generation via instruction tuning

    Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

  35. [35]

    Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

    Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

  36. [36]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

  37. [37]

    Ddt: Decoupled diffusion transformer, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

  38. [38]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020

  39. [39]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

  40. [40]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  41. [41]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  42. [42]

    PixelDiT: Pixel Diffusion Transformers for Image Generation

    Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Anima Anandkumar, and Arash Vahdat. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645, 2025. 12

  43. [43]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  44. [44]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

  45. [45]

    Fast training of diffusion models with masked transformers.TMLR, 2023

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023. 13 A Limitations DINOv2 encoder bias.RiT inherits the inductive biases of the frozen DINOv2 encoder. DINOv2’s SSL objective emphasizes semantic content over photometric detail, and prior work has observed weaker feature reso...

  46. [46]

    adaLN modulation: timestep and class embeddings are summed (c=Emb(t) +Emb(y) ) and projected to per-layer scale/shift parameters via a shared SiLU–Linear layer

  47. [47]

    [CLS] and register tokens are excluded from RoPE

    Multi-head self-attentionwith QK-normalization (RMSNorm on Q and K before attention) and VisionRoPE for 2D spatial position encoding. [CLS] and register tokens are excluded from RoPE. 3.SwiGLU FFN: FFN(x) = (SiLU(xW 1)⊙xW 3)W2. The final layer uses adaLN-modulated RMSNorm followed by a linear projection tod output channels (384 for DINOv2-Small, 768 for D...