pith. sign in

arxiv: 2606.00514 · v1 · pith:JWGB6JB6new · submitted 2026-05-30 · 💻 cs.LG · cs.CV

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

Pith reviewed 2026-06-28 18:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords one-step generationSinkhorn divergenceself-supervised learningsemantic featuresWasserstein distancedistribution matchingImageNetFID
0
0 comments X

The pith

One-step generators match data effectively when transport is performed in semantic SSL space

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one-step generative models improve dramatically when the matching objective is computed in frozen self-supervised learning feature space rather than raw pixels. The Sinkhorn divergence serves as a practical stand-in for Wasserstein distance in this space. Semantic features help because they suppress pixel-level nuisance details and create a simpler geometry where distributions can be aligned more reliably. On ImageNet this yields a 39-fold drop in FID, and the gains persist even when the training features differ from those used in standard evaluation metrics. The work identifies matching stability as a practical way to choose which SSL representations work best for generation.

Core claim

The Sinkhorn divergence computed in a semantically structured SSL feature space provides an effective training objective for one-step generation; the effectiveness arises because semantic features suppress nuisance reconstruction details and thereby induce a more compact geometry that renders distribution matching more tractable.

What carries the argument

Sinkhorn divergence computed in frozen SSL feature space as a tractable surrogate for Wasserstein distance

Load-bearing premise

Semantic SSL features suppress nuisance reconstruction details and thereby induce a more compact geometry that renders distribution matching more tractable.

What would settle it

A controlled experiment in which one-step models trained with SSL features show no gain in matching stability or sample quality over pixel-space baselines, or in which the reported FID reduction disappears when evaluation uses the identical feature extractor as training.

Figures

Figures reproduced from arXiv: 2606.00514 by Aviv Regev, Edward De Brouwer, Gabriele Scalia, Hugues Van Assel, Saeed Saremi.

Figure 1
Figure 1. Figure 1: Overview. A prior state u0 ∼ p0 is mapped in one step by gθ to a generated state u1 in generation space M, then decoded to the ambient sample xˆ = Dec(u1). In parallel, a real datum x ∼ p is encoded into the generation-space target u r = Enc(x). Both states are embedded by the frozen SSL feature extractor ϕ into feature vectors h = ϕ(u1) and h r = ϕ(u r ), and the training loss is a Sinkhorn transport prob… view at source ↗
Figure 2
Figure 2. Figure 2: Sinkhorn coupling matrices π ε between generated samples (rows) and real samples (columns) for five ImageNet classes, including a confusable pair (golden retriever, Labrador). Generated samples are fixed across all panels (from the MAE 50% model); only the featurizer ϕ varies, defining the transport cost Cij = ∥ϕ(ui) − ϕ(u + j )∥ 2 . Red outlines indicate the expected same-class blocks along the diagonal. … view at source ↗
Figure 3
Figure 3. Figure 3: Featurizer–evaluation alignment R2 for the six featurizers in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Uncurated class-conditional ImageNet samples from our method introduced in Section 3.2 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UMAP visualization of reference and generated cells for six configurations, colored by cell [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional uncurated samples (classes: macaw, hummingbird, koala, Arctic fox, lion, [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional uncurated samples (classes: rooster, loggerhead turtle, tarantula, pelican, [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional uncurated samples (classes: sea slug, spoonbill, sorrel, cannon, container ship, [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fr\'echet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies one-step generative modeling where samples are generated in pixel space but matched to data via Sinkhorn divergence computed in a frozen SSL feature space. It reports that semantically structured SSL features yield a 39× reduction in ImageNet FID relative to other choices, attributes the gain to a more compact geometry that renders distribution matching tractable, and shows that the optimal training features differ from those used by standard FID (Inception), thereby exposing a form of metric hacking. Experiments on ImageNet identify effective SSL feature families and propose matching stability as a selection criterion; code is released.

Significance. If the reported gains and mechanistic account hold, the work supplies a concrete, reproducible recipe for improving one-step generators and a practical criterion for choosing feature spaces that is distinct from evaluation metrics. The explicit contrast between training-time matching and FID computation, together with the public code, strengthens the contribution for the one-step and transport-based generation literature.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central 39× FID reduction is stated without error bars, standard deviations across random seeds, or an explicit statement of the exact baseline (e.g., which non-SSL feature extractor or pixel-space Sinkhorn). Because this number is the primary empirical support for the geometry claim, the absence of statistical characterization weakens the ability to judge whether the improvement is robust.
  2. [Discussion of mechanism] § on tracing behavior to matching estimation: the attribution of gains to 'more compact geometry' induced by semantic SSL features is presented as the primary explanation, yet no quantitative diagnostic (e.g., covariance condition number, effective dimension, or Sinkhorn convergence rate) is reported to distinguish this mechanism from alternatives such as improved semantic alignment or reduced sensitivity to nuisance factors.
  3. [Experiments] Experiments on feature-family comparison: while the manuscript identifies which SSL families perform best, the tables do not include an ablation that isolates the contribution of the Sinkhorn surrogate itself versus the choice of feature extractor (e.g., replacing Sinkhorn with a simpler moment-matching loss in the same SSL space). This comparison is load-bearing for the claim that the transport geometry, rather than the representation alone, drives the improvement.
minor comments (2)
  1. [Method] Notation for the Sinkhorn divergence and its entropic regularization parameter should be introduced once with a clear equation reference rather than re-defined inline in multiple sections.
  2. [Figures] Figure captions for the stability plots should explicitly state the number of Sinkhorn iterations and the batch size used, as these hyperparameters directly affect the reported stability criterion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central 39× FID reduction is stated without error bars, standard deviations across random seeds, or an explicit statement of the exact baseline (e.g., which non-SSL feature extractor or pixel-space Sinkhorn). Because this number is the primary empirical support for the geometry claim, the absence of statistical characterization weakens the ability to judge whether the improvement is robust.

    Authors: We agree that statistical characterization strengthens the claim. The 39× figure is computed against a pixel-space Sinkhorn baseline; in revision we will report standard deviations over multiple random seeds for the key FID values and explicitly restate the baseline in both the abstract and experiments section. revision: yes

  2. Referee: [Discussion of mechanism] § on tracing behavior to matching estimation: the attribution of gains to 'more compact geometry' induced by semantic SSL features is presented as the primary explanation, yet no quantitative diagnostic (e.g., covariance condition number, effective dimension, or Sinkhorn convergence rate) is reported to distinguish this mechanism from alternatives such as improved semantic alignment or reduced sensitivity to nuisance factors.

    Authors: The manuscript grounds the compact-geometry account in the observed matching-stability differences across feature families. We will add covariance condition numbers and effective-dimension estimates for the principal feature spaces to the appendix, providing a quantitative supplement to the stability criterion. revision: partial

  3. Referee: [Experiments] Experiments on feature-family comparison: while the manuscript identifies which SSL families perform best, the tables do not include an ablation that isolates the contribution of the Sinkhorn surrogate itself versus the choice of feature extractor (e.g., replacing Sinkhorn with a simpler moment-matching loss in the same SSL space). This comparison is load-bearing for the claim that the transport geometry, rather than the representation alone, drives the improvement.

    Authors: All reported comparisons hold the Sinkhorn objective fixed while varying only the feature extractor; this design isolates the representation's effect on transport matching. An ablation that swaps the divergence for moment matching would test a different modeling choice and lies outside the paper's scope. We will insert a clarifying sentence in the experiments section to make this experimental logic explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper reports direct experimental comparisons of Sinkhorn matching in different fixed SSL feature spaces on ImageNet, yielding a 39x FID reduction. These gains are measured against standard public extractors and evaluation metrics without any fitted parameters or equations that loop back to the same experiment's inputs. The interpretation that semantic features induce compact geometry is presented as a post-hoc explanation of observed stability and quality metrics, not as a derivation that reduces to self-definition or self-citation. No load-bearing step invokes a uniqueness theorem or ansatz from prior author work that would force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard properties of optimal transport (Sinkhorn as surrogate) and the empirical behavior of existing SSL models; no new free parameters, axioms, or postulated entities are introduced.

axioms (1)
  • standard math Sinkhorn divergence provides a tractable surrogate for the Wasserstein distance
    Explicitly stated in the abstract as the justification for using it in feature space.

pith-pipeline@v0.9.1-grok · 5806 in / 1340 out tokens · 28044 ms · 2026-06-28T18:56:32.748530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    arXiv preprint arXiv:2112.058142(3), 4 (2021)

    Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2021. URL https://arxiv.org/abs/2112. 05814

  2. [2]

    Single-cell concept bottleneck generative models for interpretable and controllable cellular editing

    Alma Andersson, Aya Abdelsalam Ismail, Edward De Brouwer, Doron Haviv, Tommaso Bian- calani, Kyunghyun Cho, Gabriele Scalia, Aicha BenTaieb, and Hector Corrada Bravo. Single-cell concept bottleneck generative models for interpretable and controllable cellular editing. InICLR 2026 Workshop on Machine Learning for Genomics Explorations, 2026

  3. [3]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational Conference on Machine Learning, pages 214–223. PMLR, 2017

  4. [4]

    stable-pretraining- v1: Foundation model research made simple.arXiv preprint arXiv:2511.19484, 2025

    Randall Balestriero, Hugues Van Assel, Sami BuGhanem, and Lucas Maes. stable-pretraining- v1: Foundation model research made simple.arXiv preprint arXiv:2511.19484, 2025. URL https://arxiv.org/abs/2511.19484

  5. [5]

    VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025. URL https://arxiv.org/abs/2510.18457

  6. [6]

    AlignTok: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models.arXiv preprint arXiv:2509.25162, 2025

    Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. AlignTok: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models.arXiv preprint arXiv:2509.25162, 2025. URLhttps://arxiv.org/abs/2509. 25162

  7. [7]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, pages 2292–2300, 2013. URLhttps://proceedings.neurips.cc/ paper/2013/hash/af21d0c97db2e27e13572cbf59eb343d-Abstract.html

  8. [8]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

  9. [9]

    Density estimation using real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview. net/forum?id=HkpbnH9lx

  10. [10]

    D. C. Dowson and B. V. Landau. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982

  11. [11]

    Interpolating between optimal transport and mmd using sinkhorn divergences

    Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. InThe 22nd international conference on artificial intelligence and statistics, pages 2681–2690. PMLR, 2019

  12. [12]

    One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. URL https://arxiv.org/abs/2512.07829

  13. [13]

    On a formula for theL2 wasserstein metric between measures on euclidean and hilbert spaces.Mathematische Nachrichten, 147(1):185–203, 1990

    Matthias Gelbrich. On a formula for theL2 wasserstein metric between measures on euclidean and hilbert spaces.Mathematische Nachrichten, 147(1):185–203, 1990. doi: 10.1002/mana. 19901470121

  14. [14]

    Learning generative models with sinkhorn divergences

    Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. InInternational Conference on Artificial Intelligence and Statistics, pages 1608–1617. PMLR, 2018. 11

  15. [15]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680, 2014. URL...

  16. [16]

    Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent – A new approach to self-supervised learning. InAdvances in Neural Information Processi...

  17. [17]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9726–9735, 2020

  18. [18]

    Otaduy, and Dan Casas

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979– 15988. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01553. URLhttps://doi.org/10.1109/ CVPR52688...

  19. [19]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 6626–6637....

  20. [20]

    Denoising diffusion probabilistic mod- els

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic mod- els. InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

  21. [21]

    Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.Nature Biotechnology, 36(1): 89–94, 2018

    Hyun Min Kang, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, Simon Wong, Lauren Byrnes, Cristina M Lanata, et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.Nature Biotechnology, 36(1): 89–94, 2018

  22. [22]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URLhttp://arxiv.org/abs/1312.6114

  23. [23]

    There is no VAE: End-to-end pixel-space generative modeling via self-supervised pre- training.arXiv preprint arXiv:2510.12586, 2025

    Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no VAE: End-to-end pixel-space generative modeling via self-supervised pre- training.arXiv preprint arXiv:2510.12586, 2025. URLhttps://arxiv.org/abs/2510.12586

  24. [24]

    arXiv preprint arXiv:2504.10483 (2025)

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers.arXiv preprint arXiv:2504.10483, 2025. URLhttps://arxiv.org/abs/2504.10483

  25. [25]

    Generative moment matching networks

    Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. InInterna- tional Conference on Machine Learning, pages 1718–1727. PMLR, 2015. 12

  26. [26]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

  27. [27]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

  28. [28]

    Deep generative modeling for single-cell transcriptomics.Nature Methods, 15(12):1053–1058, 2018

    Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics.Nature Methods, 15(12):1053–1058, 2018

  29. [29]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, 2019. URLhttps: //openreview.net/forum?id=Bkg6RiCqY7

  30. [30]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick Lab...

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4172–4182, 2023

  32. [32]

    and Cuturi, M

    Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019. doi: 10.1561/2200000073. URLhttps://doi.org/ 10.1561/2200000073

  33. [33]

    Entropic estimation of optimal transport maps.arXiv preprint arXiv:2109.12004, 2021

    Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps.arXiv preprint arXiv:2109.12004, 2021. URLhttps://arxiv.org/abs/2109.12004

  34. [34]

    Modeling complex system dynamics with flow matching across time and conditions

    Martin Rohbeck, Edward De Brouwer, Charlotte Bunne, Jan-Christian Huetter, Anne Biton, Kelvin Chen, Aviv Regev, and Romain Lopez. Modeling complex system dynamics with flow matching across time and conditions. InInternational Conference on Learning Representations, volume 2025, pages 12164–12188, 2025

  35. [35]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022. URLhttps://arxiv.org/abs/2112.10752

  36. [37]

    URLhttps://arxiv.org/abs/2508.10104

  37. [38]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025. URLhttps://arxiv.org/abs/2512.10794

  38. [39]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS. 13

  39. [40]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. IEEE Computer Society, 2016. URL https://doi.org/10.1109/CVPR.2016.308

  40. [41]

    arXiv preprint arXiv:2505.12477 , year=

    Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero. Joint embedding vs reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning. InAdvances in Neural Information Processing Systems, 2025. URL https://arxiv.org/abs/2505.12477

  41. [42]

    Optimal transport, volume 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]

    Cédric Villani.Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008. doi: 10.1007/978-3-540-71050-9

  42. [43]

    Diffuse and Disperse: Image Generation with Representation Regularization.arXiv preprint arXiv:2506.09027, 2025

    Runqian Wang and Kaiming He. Diffuse and Disperse: Image Generation with Representation Regularization.arXiv preprint arXiv:2506.09027, 2025. URL https://arxiv.org/abs/2506. 09027

  43. [44]

    REPA works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025

    Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, and Yang You. REPA works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025. URLhttps://arxiv.org/abs/2505.16792

  44. [45]

    Video-bench: Human-aligned video generation benchmark

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 15703–15712. Computer Vision Foundation / IEEE, 2025. doi: 10.1109/CVPR52734.2025.01464. URL https://opena...

  45. [46]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think.arXiv preprint arXiv:2410.06940, 2024. URLhttps://arxiv.org/ abs/2410.06940

  46. [47]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion Transformers with Representation Autoencoders.arXiv preprint arXiv:2510.11690, 2025. URLhttps://arxiv. org/abs/2510.11690. 14 Table of Contents •Section A: Notation.Summary of symbols used in the main text. • Section B: Sinkhorn Algorithms and Gradient.Log-domain Sinkhorn iterations, symme...

  47. [48]

    Sample a batch ofNc class labels and per-class guidance weightswc

  48. [49]

    For each classc: generateNneg images, drawNpos positives andNunc unconditional samples from queues

  49. [50]

    Extract multi-scale features on all generated, positive, and unconditional samples. 17

  50. [51]

    Normalize features; compute the conditional cross-, unconditional cross-, and self-transport targets; normalize

  51. [52]

    Compute the Sinkhorn divergence loss summed over all classes and features

  52. [53]

    The remainder of this section elaborates on each component

    Run backpropagation and update parameters; update EMA. The remainder of this section elaborates on each component. Data loading, queues, and batching.At each training step, Nc class labels are sampled uniformly at random andNneg images are generated per class, yielding an effective batch size B =Nc×Nneg. Real samples are provided by a standard DataLoader ...