arxiv: 2604.02751 · v1 · submitted 2026-04-03 · 💻 cs.LG

Recognition: no theorem link

Understanding Latent Diffusability via Fisher Geometry

Dongmian Zou, Gilad Lerman, Jing Gu, Morteza Mardani, Wonjun Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords latent diffusionfisher informationfisher information rateautoencodersgeometric distortiondiffusion modelsVAEMMSE

0 comments

The pith

Diffusion models in latent spaces succeed when the encoder preserves Fisher information rate through controlled local geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusability in latent spaces for diffusion models can be quantified by tracking how the minimum mean squared error changes along the diffusion path. This rate breaks down into a global Fisher information term that requires isometry and a Fisher information rate term that depends on the encoder's local geometry. By separating the geometric distortions into dimensional compression, tangential distortion, and curvature injection, the authors provide measurable penalties that predict when latent diffusion will fail. A sympathetic reader would care because this turns the black-box problem of choosing a good VAE or autoencoder for diffusion into a set of diagnostic calculations that can be checked before training.

Core claim

We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability.

What carries the argument

The Fisher Information Rate (FIR), which measures the local rate of change in information along diffusion trajectories and is determined by the encoder's geometric properties in the latent space.

If this is right

If the encoder meets the theoretical conditions for FIR preservation, then latent diffusion models will avoid degradation in performance.
Computing the three distortion penalties provides a practical way to diagnose and select suitable latent spaces for diffusion training.
Global isometry alone is insufficient; local properties must also be considered to maintain diffusability.
The FI and FIR metrics serve as efficient diagnostics for identifying latent diffusion failures across different autoencoding architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizing autoencoders to minimize the three specific penalties could lead to better latent spaces for diffusion models without relying on post-hoc fixes.
Similar geometric analysis might apply to other generative processes that rely on latent representations, such as in variational autoencoders for other tasks.
Testing the framework on new architectures like transformer-based encoders could reveal whether the conditions generalize beyond the tested models.

Load-bearing premise

The separation of the MMSE rate of change into distinct Fisher information and Fisher information rate contributions is valid for the diffusion trajectories and encoder mappings considered.

What would settle it

Observing a latent space where the encoder violates the derived FIR preservation conditions but the diffusion model still trains without performance loss compared to the data space would falsify the framework.

Figures

Figures reproduced from arXiv: 2604.02751 by Dongmian Zou, Gilad Lerman, Jing Gu, Morteza Mardani, Wonjun Lee.

**Figure 2.** Figure 2: Values of I (left) and R (right) plotted versus the noise variance τ , computed from tiny diffusion models trained on different data representations. Pixel curves correspond to models trained on x ∼ N (0, I2). Latent curves correspond to models trained on encoded data z = E(x), where the pointwise activation E is indicated in the legend; for Leaky ReLU, α denotes the negative slope. 4.1.2 Results [PITH_FU… view at source ↗

**Figure 3.** Figure 3: FIR deviation DR vs. (a) δ0, (b) d, and (c) ε0 in toy settings. Data y ∼ N (0, I2) are embedded as x = (y1, y2, 0, . . . , 0) ∈ R D and encoded to z = E(x) ∈ R d . We compute DR from R(D) (µτ ) and R(d) ((µZ)τ ) using diffusion models trained on x and z. Curves denote fixed noise variance τ . Solid lines y = 1.25δ0 (a) and y = D−d D−2 (b) serve as linear references. Encoder E setups: (a) D = d = 2, E(x) = … view at source ↗

**Figure 4.** Figure 4: Values of (a) I and (b) R plotted versus the noise variance τ , computed from diffusion models trained on different data representations. The pixel curves correspond to models trained directly on FFHQ images. The latent curves correspond to models trained on latent representations of an image encoder (GPE or VAE) pretrained on FFHQ. We show √ τ ∈ [0.01, 80], excluding smaller τ due to numerical instability… view at source ↗

**Figure 5.** Figure 5: (a) Samples from a diffusion model trained directly on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Values of (a) I and (b) R plotted versus the noise variance τ for diffusion models trained on different data representations. Pixel and latent curves denote models trained on FFHQ images and their pretrained NVAE latents, respectively. NVAE was pretrained on FFHQ, with spatial size √ 20 × dz × dz (dz indicated in the legend). We show τ ∈ [0.01, 80], excluding smaller τ due to numerical instability. from in… view at source ↗

**Figure 7.** Figure 7: FIR deviation DR vs. (a) τ and (b) δ0 in toy settings. Data y ∼ N (0, I2) are embedded as x = (y1, y2, 0, . . . , 0) ∈ R D and encoded to z = E(x) ∈ R d . We consider D = d = 2 and E(x) = Ax with A = diag(√ 1 + δ0, √ 1 − δ0). We compute DR from R(D) (µτ ) and R(d) ((µZ)τ ) using diffusion models trained on x and z. Solid line y = 1.25δ0 in (b) serve as linear references. (a) (b) [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 8.** Figure 8: FIR deviation DR vs. (a) τ and (b) d in toy settings. Data y ∼ N (0, I2) are embedded as x = (y1, y2, 0, . . . , 0) ∈ R D with D = 512, and z = (y1, y2, 0, . . . , 0) ∈ R d . We compute DR from R(D) (µτ ) and R(d) ((µZ)τ ) using diffusion models trained on x and z. Solid line y = D−d D−2 in (b) serve as linear references. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: FIR deviation DR vs. (a) τ and (b) ε0 in toy settings. Data y ∼ N (0, I2) are embedded as x = (y1, y2, 0, . . . , 0) ∈ R D and encoded to z = E(x) ∈ R d . We consider D = d = 3 and E((x1, x2, 0), ε0) = [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: (a) Samples from a diffusion model trained directly on [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Values of (a) I and (b) R plotted versus the noise variance τ , computed from diffusion models trained on different data representations. The diffusion models use the U-Net architecture within the EDM framework. The pixel curves correspond to models trained directly on FFHQ images. The latent curves correspond to models trained on latent representations of an image encoder (GPE or VAE) pretrained on FFHQ.… view at source ↗

**Figure 12.** Figure 12: Values of (a) I and (b) R plotted versus the noise variance τ for diffusion models trained on different data representations. The diffusion models use the U-Net architecture within the EDM framework. Pixel and latent curves denote models trained on FFHQ images and their pretrained NVAE latents, respectively. NVAE was pretrained on FFHQ, with spatial size 20 × dz × dz (dz indicated in the legend). We show … view at source ↗

**Figure 13.** Figure 13: Power spectra of FFHQ images and latent representations. (a) Power spectrum of FFHQ images. (b)–(d) [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

read the original abstract

Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits MMSE rate into FI and FIR terms tied to three encoder penalties, but the split rests on unstated regularity conditions that limit how far the claims reach.

read the letter

The main new piece here is the decomposition of the MMSE rate along diffusion trajectories into Fisher Information and Fisher Information Rate contributions, with FIR then linked to the encoder's local geometry through three explicit penalties: dimensional compression, tangential distortion, and curvature injection. They also supply conditions meant to preserve FIR when moving between spaces. This gives a concrete diagnostic framing for why latent diffusion often underperforms, and the experiments on multiple autoencoder architectures show the metrics can flag problems in practice. That part is useful for anyone tuning latent diffusion pipelines. The experiments appear to treat the metrics as off-the-shelf checks rather than post-hoc fits, which is a plus. The soft spot is exactly the one the stress-test flagged: the decomposition requires interchanging differentiation and integration over the probability paths and the encoder map. This step needs continuous differentiability of the densities, Lipschitz continuity of the encoder, and enough smoothness in the diffusion schedule. The abstract presents the split and the three penalties as generally valid without stating or checking these conditions. If they do not hold, the penalties do not separate cleanly and the preservation theorems lose their grounding. There is also a mild circularity risk because FI and FIR are defined inside the same diffusion process they diagnose, though the paper seems to treat them as diagnostic rather than predictive. Overall the work is aimed at people who build or debug latent diffusion models and want geometric tools rather than purely empirical fixes. A reader who cares about Fisher geometry or stability diagnostics would get something out of the metrics and the proposed conditions. It deserves peer review because the idea is specific enough to be checked and the experiments give a starting point for validation, even if the theory section will need explicit assumptions and tighter proofs.

Referee Report

2 major / 2 minor

Summary. The paper claims to quantify latent-space diffusability for diffusion models by tracking the time derivative of the Minimum Mean Squared Error (MMSE) along diffusion trajectories. It decomposes this derivative into separate Fisher Information (FI) and Fisher Information Rate (FIR) contributions, shows that global isometry aligns FI while local encoder geometry controls FIR, decouples latent geometric distortion into the three penalties of dimensional compression, tangential distortion, and curvature injection, derives theoretical conditions for FIR preservation, and validates the resulting FI/FIR metrics as diagnostics via experiments on diverse autoencoding architectures.

Significance. If the decomposition and preservation conditions hold under the requisite regularity assumptions, the framework supplies a concrete geometric diagnostic suite (FI and FIR) for diagnosing and mitigating diffusion degradation in latent spaces. The explicit three-penalty decoupling and the link between encoder local geometry and FIR could inform VAE design choices and provide falsifiable predictions for when latent diffusion succeeds or fails.

major comments (2)

[Abstract] Abstract (and the central derivation): the decomposition of the MMSE rate of change into separate FI and FIR terms requires interchanging differentiation and integration over the probability path p(x,t) and the encoder map. This interchange holds only under regularity conditions (continuous differentiability of the density in appropriate Sobolev norms, Lipschitz continuity of the encoder, and sufficient smoothness of the diffusion schedule) that are neither stated nor verified. Because this split is load-bearing for the subsequent decoupling into the three geometric penalties and for the FIR-preservation theorems, the claims remain non-rigorous without these conditions.
[Abstract] Abstract: the manuscript asserts that global isometry ensures FI alignment while FIR is governed by local encoder properties, yet provides no explicit statement of the isometry assumption or the precise local geometric quantities (e.g., the metric tensor or Jacobian factors) used to derive the three penalties. Without these definitions, it is impossible to confirm that the penalties are exhaustive or that the preservation conditions follow.

minor comments (2)

[Abstract] The abstract contains no equations, proof sketches, or experimental details (e.g., architectures, datasets, or quantitative metrics), which makes immediate assessment of the validation experiments difficult.
Consider adding a short table or figure that reports the measured FI and FIR values across the tested autoencoders together with the observed diffusion performance, to make the diagnostic utility concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on rigor and definitional clarity. We address each point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract (and the central derivation): the decomposition of the MMSE rate of change into separate FI and FIR terms requires interchanging differentiation and integration over the probability path p(x,t) and the encoder map. This interchange holds only under regularity conditions (continuous differentiability of the density in appropriate Sobolev norms, Lipschitz continuity of the encoder, and sufficient smoothness of the diffusion schedule) that are neither stated nor verified. Because this split is load-bearing for the subsequent decoupling into the three geometric penalties and for the FIR-preservation theorems, the claims remain non-rigorous without these conditions.

Authors: We acknowledge that the interchange of differentiation and integration requires explicit regularity conditions for rigor. In the revised manuscript we will add a dedicated paragraph in the theoretical framework (new Section 2.3) stating the assumptions: continuous differentiability of p(x,t) in the requisite Sobolev norms, Lipschitz continuity of the encoder map, and C^2 smoothness of the diffusion schedule. We will also include a short verification that these conditions are satisfied by the standard variance-preserving Gaussian diffusion process used in all experiments. This addition directly supports the subsequent derivations without changing any results. revision: yes
Referee: [Abstract] Abstract: the manuscript asserts that global isometry ensures FI alignment while FIR is governed by local encoder properties, yet provides no explicit statement of the isometry assumption or the precise local geometric quantities (e.g., the metric tensor or Jacobian factors) used to derive the three penalties. Without these definitions, it is impossible to confirm that the penalties are exhaustive or that the preservation conditions follow.

Authors: We agree that the isometry assumption and the precise local geometric quantities must be stated explicitly. We will revise the abstract and expand Section 3 to define global isometry as the encoder preserving the data manifold metric up to a constant scaling factor, and to specify the local quantities as the Jacobian matrix of the encoder together with the induced Riemannian metric tensor on the latent space. Under this definition the three penalties arise exhaustively from the first-order expansion of the FIR term, and the FIR-preservation theorems follow as direct corollaries. These clarifications will make the claims verifiable. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper defines diffusability via the MMSE rate of change along diffusion trajectories and decomposes it into FI and FIR contributions using information-geometric identities. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or tautological redefinition of the target quantity. The three geometric penalties and FIR-preservation conditions are derived from the decomposition without circular grounding. The framework remains independent of its inputs and is validated experimentally against external autoencoding architectures.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated, so the ledger is populated with the minimal domain assumptions required by the described framework.

axioms (1)

domain assumption The rate of change of MMSE along the diffusion trajectory admits a decomposition into Fisher Information and Fisher Information Rate terms
This decomposition is the load-bearing step that allows the subsequent geometric analysis.

invented entities (1)

FIR preservation conditions no independent evidence
purpose: Theoretical conditions ensuring diffusability is maintained when mapping from data space to latent space
Derived in the paper but no external falsifiable prediction or independent evidence is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5442 in / 1445 out tokens · 41041 ms · 2026-05-13T19:58:34.543047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Amari, S.i.: Information Geometry and Its Applications, Applied Mathematical Sciences, vol. 194. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8

work page doi:10.1007/978-4-431-55978-8 2016
[2]

In: International Conference on Learning Representations (2018), https://openreview.net/forum?id= r1Y7pYxRb

Arvanitidis, G., Hansen, L.K., Hauberg, S.: Latent space oddity: on the curvature of deep generative models. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id= r1Y7pYxRb

work page 2018
[3]

Latent forcing: Reordering the diffusion trajectory for pixel-space image generation

Baade, A., Chan, E.R., Sargent, K., Chen, C., Johnson, J., Adeli, E., Fei-Fei, L.: Latent forcing: Reordering the diffusion trajectory for pixel-space image generation. arXiv preprint arXiv:2602.11401 (2026)

work page arXiv 2026
[4]

In: S´eminaire de Probabilit´es XIX 1983/84

Bakry, D., ´Emery, M.: Diffusions hypercontractives. In: S´eminaire de Probabilit´es XIX 1983/84. pp. 177–206. Springer (1985)

work page 1983
[5]

In: International Conference on Machine Learning (2020)

Chen, N., Klushyn, A., Ferroni, F., Bayer, J., Van Der Smagt, P.: Learning flat latent manifolds with V AEs. In: International Conference on Machine Learning (2020)

work page 2020
[6]

Advances in Neural Information Processing Systems36, 569–588 (2023)

Cho, S., Lee, J., Kim, D.: Hyperbolic V AE via latent Gaussian distributions. Advances in Neural Information Processing Systems36, 569–588 (2023)

work page 2023
[7]

IEEE Transactions on Information Theory31(6), 751–760 (1985)

Costa, M.: A new entropy power inequality. IEEE Transactions on Information Theory31(6), 751–760 (1985)

work page 1985
[8]

Journal of the American Statistical Association106(496), 1602– 1614 (2011)

Efron, B.: Tweedie’s formula and selection bias. Journal of the American Statistical Association106(496), 1602– 1614 (2011). https://doi.org/10.1198/jasa.2011.tm11181, https://doi.org/10.1198/jasa.2011.tm11181, pMID: 22505788 11 Understanding Latent Diffusability via Fisher Geometry

work page doi:10.1198/jasa.2011.tm11181 2011
[9]

IEEE Transactions on Information Theory51(4), 1261–1282 (2005)

Guo, D., Shamai, S., Verdu, S.: Mutual information and minimum mean-square error in Gaussian channels. IEEE Transactions on Information Theory51(4), 1261–1282 (2005). https://doi.org/10.1109/TIT.2005.844072

work page doi:10.1109/tit.2005.844072 2005
[10]

arXiv preprint arXiv:2602.17270 (2026)

Heek, J., Hoogeboom, E., Mensink, T., Salimans, T.: Unified latents (ul): How to train your latents. arXiv preprint arXiv:2602.17270 (2026)

work page arXiv 2026
[11]

In: Larochelle, H., Ranzato, M., Had- sell, R., Balcan, M., Lin, H

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Had- sell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–

work page
[12]

(2020), https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[13]

Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)

Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)

work page 1989
[14]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=qCsbYJZRA5

Karczewski, R., Heinonen, M., Pouplin, A., Hauberg, S., Garg, V .K.: The spacetime of diffusion models: An information geometry perspective. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=qCsbYJZRA5

work page 2026
[15]

Advances in neural information processing systems35, 26565–26577 (2022)

Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems35, 26565–26577 (2022)

work page 2022
[16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

work page 2019
[17]

In: International Conference on Learning Represen- tations (2014)

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Represen- tations (2014)

work page 2014
[18]

In: Forty-second International Conference on Machine Learning (2025)

Kouzelis, T., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: EQ-V AE: Equivariance regularized latent space for improved generative image modeling. In: Forty-second International Conference on Machine Learning (2025)

work page 2025
[19]

arXiv preprint arXiv:2504.17219 (2025)

Lee, H., Kim, M., Jang, S., Jeong, J., Hwang, S.J.: Enhancing variational autoencoders with smooth robust latent encoding. arXiv preprint arXiv:2504.17219 (2025)

work page arXiv 2025
[20]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lee, J., Shin, J., Choi, H., Lee, J.: Latent diffusion models with masked autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17422–17431 (2025)

work page 2025
[21]

Lee, W., O’Neill, R.C.W., Zou, D., Calder, J., Lerman, G.: Geometry-preserving encoder/decoder in latent generative models (2025),https://arxiv.org/abs/2501.09876

work page arXiv 2025
[22]

In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id= H8JTsbG4KW

Lobashev, A., Guskov, D., Larchenko, M., Tamm, M.: Hessian geometry of latent space in generative models. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id= H8JTsbG4KW

work page 2025
[23]

Advances in Neural Information Processing Systems36, 38370–38403 (2023)

S´aez de Oc´ariz Borde, H., Arroyo, A., Morales, I., Posner, I., Dong, X.: Neural latent geometry search: Product manifold inference via gromov-hausdorff-informed bayesian optimization. Advances in Neural Information Processing Systems36, 38370–38403 (2023)

work page 2023
[24]

In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=DoDXFkF10S

Palma, A., Rybakov, S., Hetzel, L., G¨unnemann, S., Theis, F.J.: Enforcing latent euclidean geometry in single- cell V AEs for manifold interpolation. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=DoDXFkF10S

work page 2025
[25]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023
[26]

In: International conference on machine learning

Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y ., Courville, A.: On the spectral bias of neural networks. In: International conference on machine learning. pp. 5301–5310. PMLR (2019)

work page 2019
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022)

work page 2022
[28]

arXiv preprint arXiv:2510.15301 (2025)

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder. arXiv preprint arXiv:2510.15301 (2025)

work page arXiv 2025
[29]

In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=2hEDcA7xy4

Skorokhodov, I., Girish, S., Hu, B., Menapace, W., Li, Y ., Abdal, R., Tulyakov, S., Siarohin, A.: Improving the diffusability of autoencoders. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=2hEDcA7xy4

work page 2025
[30]

In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=PxTIG12RRHS 12 Understanding Latent Diffusability via Fisher Geometry

Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=PxTIG12RRHS 12 Understanding Latent Diffusability via Fisher Geometry

work page 2021
[31]

Advances in neural information processing systems33, 19667–19679 (2020)

Vahdat, A., Kautz, J.: NV AE: A deep hierarchical variational autoencoder. Advances in neural information processing systems33, 19667–19679 (2020)

work page 2020
[32]

Villani, C.: Optimal transport: old and new, vol. 338. Springer Science & Business Media (2009)

work page 2009
[33]

arXiv preprint arXiv:1901.06523 , year=

Xu, Z.Q.J., Zhang, Y ., Luo, T., Xiao, Y ., Ma, Z.: Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 (2019)

work page arXiv 1901
[34]

In: The Four- teenth International Conference on Learning Representations (2026), https://openreview.net/forum?id= 1jBsi98fVe

Yang, J., Li, T., Fan, L., Tian, Y ., Wang, Y .: Latent denoising makes good visual tokenizers. In: The Four- teenth International Conference on Learning Representations (2026), https://openreview.net/forum?id= 1jBsi98fVe

work page 2026
[35]

Yue, K., Jia, M., Hou, J., Goldstein, T.: Image generation with a sphere encoder (2026),https://arxiv.org/ abs/2602.15030

work page arXiv 2026
[36]

diffusability

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. In: The Four- teenth International Conference on Learning Representations (2026), https://openreview.net/forum?id= 0u1LigJaab 13 Understanding Latent Diffusability via Fisher Geometry Supplementary Material The appendices are organized as follows: Appendix A rev...

work page 2026
[37]

For ε, the remainder R(x) =∥E(x)−E(0)− ∇E(0)x∥ 2 is analyzed in L1(M, µ)

Inversion givesδ∝ϵ 1/(4(m+5)) GP E . For ε, the remainder R(x) =∥E(x)−E(0)− ∇E(0)x∥ 2 is analyzed in L1(M, µ). Taylor’s theorem provides R(x)≤ L2 E 4 ∥x∥4. Integrating overMsplit intoA 0 (good slice) andB 0 (bad slice): Z B0 R(x)dµ(x)≤ L2 ED4 M 4 µ(B0)≤ L2 ED4 M 4 ϵ1/4 GP E. For A0, we use the spatial expansion R(x)≤2(e ϵ1/4 GP E −1) + 2(e ϵ1/4 GP E + 1 +...

work page
[38]

Diffusion models are then trained separately on the GPE and V AE latent representations

These vectors are reshaped into16×16 tensors using a serpentine ordering and normalized using Gaussian statistics prior to diffusion training. Diffusion models are then trained separately on the GPE and V AE latent representations. 6https://github.com/wonjunee/GPE_codes 20 Understanding Latent Diffusability via Fisher Geometry The diffusion model architec...

work page