pith. machine review for the scientific record. sign in

arxiv: 2604.11026 · v3 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Optimal Stability of KL Divergence under Gaussian Perturbations

Jialu Pan, Ji Wang, Keqin Li, Nan Hu, Yufeng Zhang, Zhenbang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KL divergenceGaussian perturbationsstability boundsout-of-distribution detectionflow-based modelsrelaxed triangle inequalitymoment conditions
0
0 comments X

The pith

For any distribution P with finite second moments, KL(P to N2) is at least KL(P to N1) minus O(sqrt(ε)) whenever two Gaussians N1 and N2 differ by at most ε in KL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a stability bound showing that small changes to a Gaussian reference distribution affect the KL divergence from an arbitrary P by at most order sqrt of the change size. This holds as long as P has finite second moments and does not require P to be Gaussian itself. The authors further establish that the sqrt(ε) dependence is tight by exhibiting matching lower bounds even inside the Gaussian family. The result removes the Gaussian-only restriction that limited earlier relaxed triangle inequalities for KL. It directly supports KL-based out-of-distribution detection in flow-based models and other non-Gaussian settings common in deep learning.

Core claim

Let P be any distribution with finite second moment and let N1, N2 be multivariate Gaussians. If KL(P || N1) is large and KL(N1 || N2) ≤ ε, then KL(P || N2) ≥ KL(P || N1) − O(√ε). The paper also proves that this √ε rate cannot be improved in general, even when P itself is Gaussian.

What carries the argument

Relaxed triangle inequality for KL divergence under Gaussian perturbations, derived from moment bounds and the specific geometry of Gaussians.

If this is right

  • KL-based out-of-distribution scoring becomes rigorous for flow-based generative models that are not purely Gaussian.
  • KL reasoning can be applied directly in reinforcement learning and deep learning pipelines without forcing Gaussian assumptions on the target distribution.
  • The tightness of the √ε rate limits how large a Gaussian perturbation can be tolerated before the stability guarantee breaks.
  • Classical Gaussian-only stability results are now special cases of a more general statement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same moment condition might yield analogous stability results for other f-divergences.
  • Numerical checks on simple non-Gaussian mixtures could verify whether the O(√ε) constant is sharp in practice.
  • In variational inference the bound suggests that small perturbations to an approximate posterior Gaussian incur only modest extra KL cost when the true posterior has finite variance.

Load-bearing premise

P must have finite second moment; without it the stated bound can fail.

What would settle it

A concrete distribution P with infinite second moment together with Gaussians N1, N2 where KL(P||N2) drops by more than any fixed multiple of √ε below KL(P||N1) for arbitrarily small ε.

Figures

Figures reproduced from arXiv: 2604.11026 by Jialu Pan, Ji Wang, Keqin Li, Nan Hu, Yufeng Zhang, Zhenbang Chen.

Figure 1
Figure 1. Figure 1: Distribution of model log-likelihood. Glow model trained on CIFAR [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: KL divergence-based analysis for OOD detection for the Gaussian [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: f(x) = x − log x − 1. positive answer. Specifically, we derive a tight lower bound showing that KL(P||N2) must also be large under the stated conditions. This theoretical result provides a more general tool in KL-based analysis for OOD detection and other applica￾tions. V. THEORETICAL RESULT In this section, we first give some lemmas and then present the main theorem. A. Lemmas We note function f(x) = x − … view at source ↗
Figure 4
Figure 4. Figure 4: KL divergence between an arbitrary distribution [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $\epsilon$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrt{\epsilon})$. Moreover, we prove that this $\sqrt{\epsilon}$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper claims to establish a sharp stability result for KL divergence under Gaussian perturbations: for any distribution P with finite second moment and multivariate Gaussians N1, N2, if KL(P || N1) is large and KL(N1 || N2) ≤ ε then KL(P || N2) ≥ KL(P || N1) − O(√ε). The √ε rate is shown to be optimal even when restricting to the Gaussian family. The result removes the all-Gaussian assumption from prior relaxed triangle inequalities and is applied to justify KL-based OOD detection in flow-based generative models.

Significance. If the central derivation holds, the result is significant because it supplies the first explicit, optimal one-sided stability bound that applies to non-Gaussian P under only a second-moment condition. The explicit Gaussian constructions establishing optimality and the careful treatment of KL asymmetry are strengths that directly support applications in deep generative models and reinforcement learning where strong Gaussian assumptions are unrealistic.

minor comments (4)
  1. Abstract and §1: the qualifier “KL(P||N1) is large” is used without a precise threshold; the main theorem statement should clarify whether the O(√ε) bound holds for all finite-second-moment P or only when KL(P||N1) exceeds a quantity depending on ε and the second moments of P.
  2. §3 (proof of the lower bound): the argument that the difference of Gaussian log-densities is a quadratic polynomial controlled by KL(N1||N2) ≤ ε is sketched but the explicit constant in the O(√ε) term is not displayed; adding the dependence on dimension d and the second-moment bound of P would make the result more usable.
  3. §4 (optimality construction): the mean-shift example is convincing, yet it would help to state the exact scaling of the mean displacement with √ε and to confirm that the resulting KL(P||N2) drop is exactly Θ(√ε) rather than o(√ε) for the chosen sequence of ε.
  4. Notation: the symbols n1 and n2 for the Gaussian densities are introduced without a global definition; a short notation table or consistent use of N1(x), N2(x) would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and for recommending minor revision. The referee's summary accurately captures the central result: a sharp one-sided stability bound for KL(P || N2) in terms of KL(P || N1) when N1 and N2 are Gaussians, P has finite second moments, and the bound is optimal even within the Gaussian family. As the report contains no specific major comments, we have no revisions to propose at this stage.

Circularity Check

0 steps flagged

No significant circularity; self-contained mathematical derivation

full rationale

The paper derives the claimed stability bound KL(P||N2) ≥ KL(P||N1) - O(√ε) directly from the explicit form of Gaussian log-densities and the finite-second-moment assumption on P, which guarantees that the expectation of the quadratic difference exists and can be controlled by the parameter distance induced by KL(N1||N2) ≤ ε. The one-sided lower bound is obtained by discarding the positive part of the difference; optimality follows from explicit mean-shift constructions within the Gaussian family that achieve a matching Θ(√ε) drop. No fitted parameters are renamed as predictions, no self-citations serve as load-bearing premises for the core inequality, and the result does not reduce to a redefinition or ansatz imported from prior author work. The derivation is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central result rests on the domain assumption of finite second moment for P; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption P has finite second moment
    Invoked to control the tail behavior needed for the stability inequality.

pith-pipeline@v0.9.0 · 5598 in / 1054 out tokens · 33564 ms · 2026-05-10T15:41:14.425561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Kullback,Information theory and statistics

    S. Kullback,Information theory and statistics. Courier Corporation, 1997

  2. [2]

    C. M. Bishop,Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006

  3. [3]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio,Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2

  4. [4]

    Pardo,Statistical inference based on divergence measures

    L. Pardo,Statistical inference based on divergence measures. CRC press, 2018

  5. [5]

    T. M. Cover and J. A. Thomas,Elements of information theory. John Wiley & Sons, 2012

  6. [6]

    On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,

    Y . Zhang, J. Pan, L. K. Li, W. Liu, Z. Chen, X. Liu, and J. Wang, “On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,”Advances in neural information processing systems, vol. 36, pp. 58 152–58 165, 2023

  7. [7]

    The kullback–leibler divergence between lattice gaussian distributions,

    F. Nielsen, “The kullback–leibler divergence between lattice gaussian distributions,”Journal of the Indian Institute of Science, vol. 102, no. 4, pp. 1177–1188, 2022

  8. [8]

    Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,

    J. . Durrieu, J. . Thiran, and F. Kelly, “Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,” in2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4833–4836

  9. [9]

    Approximating the Kullback-Leibler di- vergence between gaussian mixture models,

    J. R. Hershey and P. A. Olsen, “Approximating the Kullback-Leibler di- vergence between gaussian mixture models,” in2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4. IEEE, 2007, pp. IV–317

  10. [10]

    The Kullback-Leibler diver- gence rate between markov sources,

    Z. Rached, F. Alajaji, and L. L. Campbell, “The Kullback-Leibler diver- gence rate between markov sources,”IEEE Transactions on Information Theory, vol. 50, no. 5, pp. 917–921, 2004

  11. [11]

    Relaxed triangle inequality for kullback-leibler divergence between multivariate gaussian distributions,

    S. Xiao, Y . Zhang, C. Liu, Y . Ding, K. Li, and K. Li, “Relaxed triangle inequality for kullback-leibler divergence between multivariate gaussian distributions,” 2026. [Online]. Available: https: //arxiv.org/abs/2602.02577

  12. [12]

    Constrained variational policy optimization for safe reinforcement learning,

    Z. Liu, Z. Cen, V . Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao, “Constrained variational policy optimization for safe reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 13 644–13 668

  13. [13]

    Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes,

    H. Ashtiani, S. Ben-David, N. J. A. Harvey, C. Liaw, A. Mehrabian, and Y . Plan, “Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes,”J. ACM, vol. 67, no. 6, oct 2020. [Online]. Available: https://doi.org/10.1145/3417994

  14. [14]

    Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,

    Y . Zhang, J. Pan, W. Liu, Z. Chen, K. Li, J. Wang, Z. Liu, and H. Wei, “Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1683–1697, 2024

  15. [15]

    OpenAI, “Glow,” https://github.com/openai/glow, 2018

  16. [16]

    Optimism in reinforcement learning and kullback-leibler divergence,

    S. Filippi, O. Capp ´e, and A. Garivier, “Optimism in reinforcement learning and kullback-leibler divergence,” in2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010, pp. 115–122

  17. [17]

    Online Markov decision processes with Kullback–Leibler control cost,

    P. Guan, M. Raginsky, and R. M. Willett, “Online Markov decision processes with Kullback–Leibler control cost,”IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1423–1438, 2014

  18. [18]

    Learning a distance metric by balancing kl-divergence for imbalanced datasets,

    L. Feng, H. Wang, B. Jin, H. Li, M. Xue, and L. Wang, “Learning a distance metric by balancing kl-divergence for imbalanced datasets,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 12, pp. 2384–2395, 2019

  19. [19]

    Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,

    N. Bouhlel and A. Dziri, “Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,”IEEE Signal Processing Letters, vol. 26, no. 7, pp. 1021–1025, 2019

  20. [20]

    On the Kullback-Leibler divergence between discrete nor- mal distributions,

    F. Nielsen, “On the Kullback-Leibler divergence between discrete nor- mal distributions,”arXiv preprint arXiv:2109.14920, 2021

  21. [21]

    The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,

    L. M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,”USSR computational mathematics and mathematical physics, vol. 7, no. 3, pp. 200–217, 1967

  22. [22]

    Clustering with Bregman divergences,

    A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,”Journal of Machine Learning Research, vol. 6, no. 58, pp. 1705–1749, 2005. [Online]. Available: http: //jmlr.org/papers/v6/banerjee05b.html

  23. [23]

    Functional Bregman divergence,

    B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman divergence,” in2008 IEEE International Symposium on Information Theory, 2008, pp. 1681–1685

  24. [24]

    A general class of coefficients of divergence of one distribution from another,

    S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  25. [25]

    Optimal bounds between f-divergences and integral probability metrics,

    R. Agrawal and T. Horel, “Optimal bounds between f-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 22, no. 128, pp. 1–59, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-867.html

  26. [26]

    On measures of entropy and information,

    A. R ´enyi, “On measures of entropy and information,” inProceedings of the fourth Berkeley symposium on mathematical statistics and probabil- ity, volume 1: contributions to the theory of statistics, vol. 4. University of California Press, 1961, pp. 547–562

  27. [27]

    Non-negative latent factor model based onβ-divergence for recommender systems,

    L. Xin, Y . Yuan, M. Zhou, Z. Liu, and M. Shang, “Non-negative latent factor model based onβ-divergence for recommender systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 8, pp. 4612–4623, 2021

  28. [28]

    (f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,

    J. Birrell, P. Dupuis, M. A. Katsoulakis, Y . Pantazis, and L. Rey- Bellet, “(f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 23, no. 39, pp. 1–70, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-0100.html

  29. [29]

    The jensen- shannon divergence,

    M. L. Men ´endez, J. A. Pardo, L. Pardo, and M. d. C. Pardo, “The jensen- shannon divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997

  30. [30]

    A class of wasserstein metrics for probability distributions

    C. R. Givens and R. M. Shortt, “A class of wasserstein metrics for probability distributions.”Michigan Mathematical Journal, vol. 31, no. 2, pp. 231–240, 1984

  31. [31]

    Generalized out-of-distribution detection: A survey,

    J. Yang, K. Zhou, Y . Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,”Int. J. Comput. Vis., vol. 132, no. 12, pp. 5635–5662, 2024. [Online]. Available: https://doi.org/10.1007/ s11263-024-02117-4

  32. [32]

    Anomaly detection: A survey,

    V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, Jul. 2009

  33. [33]

    Deep learning for anomaly detection: A review,

    G. Pang, C. Shen, L. Cao, and A. V . D. Hengel, “Deep learning for anomaly detection: A review,”ACM Comput. Surv., vol. 54, no. 2, Mar. 2021

  34. [34]

    Survey on applying gan for anomaly detection,

    B. J. Beula Rani and L. Sumathi M. E, “Survey on applying gan for anomaly detection,” in2020 International Conference on Computer Communication and Informatics (ICCCI), 2020, pp. 1–5

  35. [35]

    Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,

    R. Kumar, R. R. Hossain, S. Talukder, A. Jena, and A. T. A. Ghazo, “Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 11, pp. 7123–7133, 2022

  36. [36]

    Group deviation detection methods: A survey,

    E. Toth and S. Chawla, “Group deviation detection methods: A survey,” ACM Comput. Surv., vol. 51, no. 4, Jul. 2018

  37. [37]

    Detecting out-of-distribution inputs to deep generative models using typicality,

    E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan, “Detecting out-of-distribution inputs to deep generative models using typicality,”4th workshop on Bayesian Deep Learning (NeurIPS 2019), 2019

  38. [38]

    Density estimation using real nvp,

    L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” inProceedings of the International Conference on Learning Representations (ICLR), 2017

  39. [39]

    Residual flows for invertible generative modeling,

    R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, “Residual flows for invertible generative modeling,”Advances in neural informa- tion processing systems, vol. 32, 2019

  40. [40]

    Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024

    S. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. A. Bautista, N. Jaitly, and J. Susskind, “Normalizing flows are capable generative models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.06329

  41. [41]

    Glow: Generative flow with invertible 1x1 convolutions,

    D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” inAdvances in Neural Information Processing Sys- tems, 2018, pp. 10 215–10 224

  42. [42]

    An elementary introduction to information geometry,

    F. Nielsen, “An elementary introduction to information geometry,” Entropy, vol. 22, no. 10, 2020. [Online]. Available: https://www.mdpi. com/1099-4300/22/10/1100

  43. [43]

    Normalizing flows for probabilistic modeling and inference,

    G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2019

  44. [44]

    Bhatia,Matrix analysis

    R. Bhatia,Matrix analysis. Springer Science & Business Media, 2013

  45. [45]

    Normalizing flows: An introduction and review of current methods,

    I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3964– 3979, 2021

  46. [46]

    Trust region policy optimization,

    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

  47. [47]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  48. [48]

    Auto-Encoding Variational Bayes

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/1312.6114

  49. [49]

    beta-vae: Learning basic visual concepts with a constrained variational framework

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.”ICLR, vol. 2, no. 5, p. 6, 2017

  50. [50]

    Understanding disentangling inβ-vae,

    C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner, “Understanding disentangling inβ-vae,” inWorkshop on Learning Disentangled Representations at the 31st Conference on Neural Information Processing Systems, 2018

  51. [51]

    Wasserstein auto-encoders.arXiv preprint arXiv:1711.01558,

    I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch ¨olkopf, “Wasserstein auto-encoders,”ArXiv, vol. abs/1711.01558, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3833554

  52. [52]

    Csisz ´ar and J

    I. Csisz ´ar and J. K ¨orner,Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition. Cambridge University Press, 2011. [Online]. Available: https://doi.org/10.1017/ CBO9780511921889