RiT: Vanilla Diffusion Transformers Suffice in Representation Space
Pith reviewed 2026-05-22 07:46 UTC · model grok-4.3
The pith
A vanilla Diffusion Transformer on DINOv2 features achieves state-of-the-art image generation on ImageNet using x-prediction in flow matching.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a vanilla Diffusion Transformer with x-prediction on frozen DINOv2 features, augmented with a dimension-aware noise schedule and joint class-patch modeling, the model attains an FID of 1.45 without guidance and 1.14 with classifier-free guidance on ImageNet 256x256, outperforming more complex models like DiT with 19% fewer parameters while allowing efficient ODE solving at coarse discretizations.
What carries the argument
The Representation Image Transformer (RiT) which applies a standard Diffusion Transformer architecture directly in the frozen DINOv2 representation space using x-prediction for flow matching.
If this is right
- The ODE solver requires only 5 Heun steps to reach FID 2.0 and 10 steps for 1.25 with guidance.
- Classifier-free guidance further improves quality to FID 1.14.
- Specialized prediction heads or Riemannian transport are unnecessary due to the favorable geometry.
- Representation learning objectives provide advantages over mere compression in VAE latents.
Where Pith is reading between the lines
- Similar geometric benefits might appear in other self-supervised representations for generative tasks.
- Freezing the feature extractor might not be necessary if joint optimization could further improve results.
- This could enable faster generation pipelines in applications requiring quick sampling.
- The findings suggest prioritizing representation quality in designing future diffusion models.
Load-bearing premise
That the observed geometric properties of DINOv2 features directly cause the effectiveness of vanilla x-prediction rather than merely correlating with the low FID scores.
What would settle it
Training an identical vanilla Diffusion Transformer with x-prediction on pixel space or SD-VAE latents and measuring if it matches the FID performance of RiT would falsify the causal role of the representation geometry.
Figures
read the original abstract
Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RiT, a vanilla Diffusion Transformer trained with x-prediction flow matching directly on frozen DINOv2 features for class-conditional image generation. It compares geometric properties (intrinsic dimension, effective rank, covariance conditioning, excess kurtosis, and on-manifold interpolation error) across pixel space, SD-VAE latents, and DINOv2 features, arguing that DINOv2's statistics render the regression well-conditioned and obviate specialized heads or Riemannian methods. On ImageNet 256×256, RiT reports FID 1.45 (unguided) and 1.14 (CFG), outperforming DiT^DH-XL with 19% fewer parameters, and achieves low FID with only 5–10 Heun steps.
Significance. If the performance numbers hold under full verification, the result indicates that pretrained representation spaces can simplify diffusion transformer design by supplying better-conditioned targets for standard x-prediction, reducing reliance on architectural specialization or heavy sampling tricks. The concrete FID values, parameter efficiency, and coarse-discretization sampling performance are strengths; the public code release further supports reproducibility of both the geometric measurements and the training pipeline.
major comments (1)
- [§3–4] §4 (Experiments) and §3 (Geometric Analysis): the central explanatory claim—that DINOv2's 7.3× higher effective rank, 35× better conditioning, 11.5× lower kurtosis, and 1.7× lower interpolation error causally enable vanilla x-prediction success—is not isolated from the dimension-aware noise schedule and joint [CLS]-patch modeling also introduced in RiT. No ablation holds the full RiT recipe fixed while swapping only the input representation (e.g., pixel or SD-VAE inputs under identical schedule and modeling choices) to test whether the reported FID 1.45/1.14 and 5-step Heun performance require DINOv2 geometry specifically. The current comparisons treat the four axes as explanatory rather than correlative.
minor comments (2)
- [Table 1] Table 1: the exact procedure for computing effective rank and excess kurtosis on the feature sets should be stated (e.g., number of samples, regularization, or eigenvalue threshold) to allow direct replication of the 7.3× and 11.5× factors.
- [§4.3] §4.3: the baseline DiT^DH-XL implementation details (exact hyper-parameters, feature extraction pipeline, and whether the same DINOv2 encoder is used) are referenced only by citation; a short paragraph or appendix entry would clarify the fairness of the 676 M vs. 839 M parameter comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the concern that the geometric advantages of DINOv2 are not isolated from the dimension-aware schedule and [CLS]-patch modeling in our experiments.
read point-by-point responses
-
Referee: [§3–4] §4 (Experiments) and §3 (Geometric Analysis): the central explanatory claim—that DINOv2's 7.3× higher effective rank, 35× better conditioning, 11.5× lower kurtosis, and 1.7× lower interpolation error causally enable vanilla x-prediction success—is not isolated from the dimension-aware noise schedule and joint [CLS]-patch modeling also introduced in RiT. No ablation holds the full RiT recipe fixed while swapping only the input representation (e.g., pixel or SD-VAE inputs under identical schedule and modeling choices) to test whether the reported FID 1.45/1.14 and 5-step Heun performance require DINOv2 geometry specifically. The current comparisons treat the four axes as explanatory rather than correlative.
Authors: We agree that a controlled ablation holding the RiT architecture, dimension-aware noise schedule, and joint [CLS]-patch modeling fixed while varying only the input representation would more directly test causality. The geometric measurements in §3 are performed independently on the frozen representations and show that DINOv2 features exhibit markedly better conditioning and lower kurtosis than pixels or SD-VAE latents despite similar intrinsic dimensionality; these statistics are presented as supporting evidence for why standard x-prediction succeeds without specialized heads. Nevertheless, we acknowledge the current design does not fully decouple the representation from the schedule and modeling choices. In the revised manuscript we will add experiments that apply the complete RiT training recipe to pixel-space and SD-VAE inputs under identical schedule and [CLS] settings, allowing direct comparison of the resulting FID and sampling efficiency. revision: yes
Circularity Check
No significant circularity; results from direct training and evaluation
full rationale
The paper computes geometric statistics (effective rank, covariance conditioning, excess kurtosis, interpolation error) directly on pixel, SD-VAE, and DINOv2 feature spaces and reports them as observations. It then trains a vanilla Diffusion Transformer on frozen DINOv2 features using x-prediction plus two auxiliary components (dimension-aware noise schedule, joint [CLS]-patch modeling) and measures FID on the held-out ImageNet validation set. No equation reduces the reported FID values or the claim of well-conditioned regression to a fitted parameter defined inside the paper; the performance numbers are obtained by standard model training and benchmark evaluation rather than by construction from the geometric axes.
Axiom & Free-Parameter Ledger
free parameters (1)
- dimension-aware noise schedule parameters
axioms (1)
- domain assumption DINOv2 features contain a low-dimensional manifold of intrinsic dimensionality comparable to pixel space but with superior statistical conditioning for regression
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Flow matching with x-prediction—regressing the clean data point rather than the ambient velocity—is known to exploit low-dimensional manifold structure effectively in pixel space [18].
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DINOv2 exhibits 7.3× higher effective rank, 35× better covariance conditioning, 11.5× lower excess kurtosis, and 1.7× lower on-manifold interpolation error
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, and Eldad Haber. Preconditioned score and flow matching.arXiv preprint arXiv:2603.02337, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023
Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023
-
[3]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
-
[4]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021. 10
work page 2021
-
[7]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[8]
Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017
work page 2017
-
[9]
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023
-
[10]
Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025
-
[11]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020
work page 2020
-
[12]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[13]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213– 13232. PMLR, 2023
work page 2023
-
[14]
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022
work page 2022
-
[15]
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025
-
[16]
Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026
-
[17]
Maximum likelihood estimation of intrinsic dimension
Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004
work page 2004
-
[18]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[21]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024
work page 2024
-
[22]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[24]
The effective rank: A measure of effective dimensionality
Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606–610. IEEE, 2007. 11
work page 2007
-
[25]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[27]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
-
[28]
Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
-
[29]
Improved Techniques for Training Consistency Models
Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[31]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023
work page 2023
-
[32]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[33]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Metamorph: Multimodal understanding and generation via instruction tuning
Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025
work page 2025
-
[35]
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026
-
[36]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
-
[37]
Ddt: Decoupled diffusion transformer, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025
work page 2025
-
[38]
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020
work page 2020
-
[39]
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025
-
[40]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
work page 2025
-
[41]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Anima Anandkumar, and Arash Vahdat. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
work page 2019
-
[44]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Fast training of diffusion models with masked transformers.TMLR, 2023
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023. 13 A Limitations DINOv2 encoder bias.RiT inherits the inductive biases of the frozen DINOv2 encoder. DINOv2’s SSL objective emphasizes semantic content over photometric detail, and prior work has observed weaker feature reso...
work page 2023
-
[46]
adaLN modulation: timestep and class embeddings are summed (c=Emb(t) +Emb(y) ) and projected to per-layer scale/shift parameters via a shared SiLU–Linear layer
-
[47]
[CLS] and register tokens are excluded from RoPE
Multi-head self-attentionwith QK-normalization (RMSNorm on Q and K before attention) and VisionRoPE for 2D spatial position encoding. [CLS] and register tokens are excluded from RoPE. 3.SwiGLU FFN: FFN(x) = (SiLU(xW 1)⊙xW 3)W2. The final layer uses adaLN-modulated RMSNorm followed by a linear projection tod output channels (384 for DINOv2-Small, 768 for D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.