arxiv: 2604.11026 · v3 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Optimal Stability of KL Divergence under Gaussian Perturbations

Jialu Pan, Ji Wang, Keqin Li, Nan Hu, Yufeng Zhang, Zhenbang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KL divergenceGaussian perturbationsstability boundsout-of-distribution detectionflow-based modelsrelaxed triangle inequalitymoment conditions

0 comments

The pith

For any distribution P with finite second moments, KL(P to N2) is at least KL(P to N1) minus O(sqrt(ε)) whenever two Gaussians N1 and N2 differ by at most ε in KL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a stability bound showing that small changes to a Gaussian reference distribution affect the KL divergence from an arbitrary P by at most order sqrt of the change size. This holds as long as P has finite second moments and does not require P to be Gaussian itself. The authors further establish that the sqrt(ε) dependence is tight by exhibiting matching lower bounds even inside the Gaussian family. The result removes the Gaussian-only restriction that limited earlier relaxed triangle inequalities for KL. It directly supports KL-based out-of-distribution detection in flow-based models and other non-Gaussian settings common in deep learning.

Core claim

Let P be any distribution with finite second moment and let N1, N2 be multivariate Gaussians. If KL(P || N1) is large and KL(N1 || N2) ≤ ε, then KL(P || N2) ≥ KL(P || N1) − O(√ε). The paper also proves that this √ε rate cannot be improved in general, even when P itself is Gaussian.

What carries the argument

Relaxed triangle inequality for KL divergence under Gaussian perturbations, derived from moment bounds and the specific geometry of Gaussians.

If this is right

KL-based out-of-distribution scoring becomes rigorous for flow-based generative models that are not purely Gaussian.
KL reasoning can be applied directly in reinforcement learning and deep learning pipelines without forcing Gaussian assumptions on the target distribution.
The tightness of the √ε rate limits how large a Gaussian perturbation can be tolerated before the stability guarantee breaks.
Classical Gaussian-only stability results are now special cases of a more general statement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moment condition might yield analogous stability results for other f-divergences.
Numerical checks on simple non-Gaussian mixtures could verify whether the O(√ε) constant is sharp in practice.
In variational inference the bound suggests that small perturbations to an approximate posterior Gaussian incur only modest extra KL cost when the true posterior has finite variance.

Load-bearing premise

P must have finite second moment; without it the stated bound can fail.

What would settle it

A concrete distribution P with infinite second moment together with Gaussians N1, N2 where KL(P||N2) drops by more than any fixed multiple of √ε below KL(P||N1) for arbitrarily small ε.

Figures

Figures reproduced from arXiv: 2604.11026 by Jialu Pan, Ji Wang, Keqin Li, Nan Hu, Yufeng Zhang, Zhenbang Chen.

**Figure 2.** Figure 2: KL divergence-based analysis for OOD detection for the Gaussian [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: f(x) = x − log x − 1. positive answer. Specifically, we derive a tight lower bound showing that KL(P||N2) must also be large under the stated conditions. This theoretical result provides a more general tool in KL-based analysis for OOD detection and other applications. V. THEORETICAL RESULT In this section, we first give some lemmas and then present the main theorem. A. Lemmas We note function f(x) = x − … view at source ↗

**Figure 4.** Figure 4: KL divergence between an arbitrary distribution [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $\epsilon$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrt{\epsilon})$. Moreover, we prove that this $\sqrt{\epsilon}$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper claims to establish a sharp stability result for KL divergence under Gaussian perturbations: for any distribution P with finite second moment and multivariate Gaussians N1, N2, if KL(P || N1) is large and KL(N1 || N2) ≤ ε then KL(P || N2) ≥ KL(P || N1) − O(√ε). The √ε rate is shown to be optimal even when restricting to the Gaussian family. The result removes the all-Gaussian assumption from prior relaxed triangle inequalities and is applied to justify KL-based OOD detection in flow-based generative models.

Significance. If the central derivation holds, the result is significant because it supplies the first explicit, optimal one-sided stability bound that applies to non-Gaussian P under only a second-moment condition. The explicit Gaussian constructions establishing optimality and the careful treatment of KL asymmetry are strengths that directly support applications in deep generative models and reinforcement learning where strong Gaussian assumptions are unrealistic.

minor comments (4)

Abstract and §1: the qualifier “KL(P||N1) is large” is used without a precise threshold; the main theorem statement should clarify whether the O(√ε) bound holds for all finite-second-moment P or only when KL(P||N1) exceeds a quantity depending on ε and the second moments of P.
§3 (proof of the lower bound): the argument that the difference of Gaussian log-densities is a quadratic polynomial controlled by KL(N1||N2) ≤ ε is sketched but the explicit constant in the O(√ε) term is not displayed; adding the dependence on dimension d and the second-moment bound of P would make the result more usable.
§4 (optimality construction): the mean-shift example is convincing, yet it would help to state the exact scaling of the mean displacement with √ε and to confirm that the resulting KL(P||N2) drop is exactly Θ(√ε) rather than o(√ε) for the chosen sequence of ε.
Notation: the symbols n1 and n2 for the Gaussian densities are introduced without a global definition; a short notation table or consistent use of N1(x), N2(x) would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and for recommending minor revision. The referee's summary accurately captures the central result: a sharp one-sided stability bound for KL(P || N2) in terms of KL(P || N1) when N1 and N2 are Gaussians, P has finite second moments, and the bound is optimal even within the Gaussian family. As the report contains no specific major comments, we have no revisions to propose at this stage.

Circularity Check

0 steps flagged

No significant circularity; self-contained mathematical derivation

full rationale

The paper derives the claimed stability bound KL(P||N2) ≥ KL(P||N1) - O(√ε) directly from the explicit form of Gaussian log-densities and the finite-second-moment assumption on P, which guarantees that the expectation of the quadratic difference exists and can be controlled by the parameter distance induced by KL(N1||N2) ≤ ε. The one-sided lower bound is obtained by discarding the positive part of the difference; optimality follows from explicit mean-shift constructions within the Gaussian family that achieve a matching Θ(√ε) drop. No fitted parameters are renamed as predictions, no self-citations serve as load-bearing premises for the core inequality, and the result does not reduce to a redefinition or ansatz imported from prior author work. The derivation is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central result rests on the domain assumption of finite second moment for P; no free parameters or invented entities are introduced.

axioms (1)

domain assumption P has finite second moment
Invoked to control the tail behavior needed for the stability inequality.

pith-pipeline@v0.9.0 · 5598 in / 1054 out tokens · 33564 ms · 2026-05-10T15:41:14.425561+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Kullback,Information theory and statistics

S. Kullback,Information theory and statistics. Courier Corporation, 1997

1997
[2]

C. M. Bishop,Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006

2006
[3]

Goodfellow, Y

I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio,Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2

2016
[4]

Pardo,Statistical inference based on divergence measures

L. Pardo,Statistical inference based on divergence measures. CRC press, 2018

2018
[5]

T. M. Cover and J. A. Thomas,Elements of information theory. John Wiley & Sons, 2012

2012
[6]

On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,

Y . Zhang, J. Pan, L. K. Li, W. Liu, Z. Chen, X. Liu, and J. Wang, “On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,”Advances in neural information processing systems, vol. 36, pp. 58 152–58 165, 2023

2023
[7]

The kullback–leibler divergence between lattice gaussian distributions,

F. Nielsen, “The kullback–leibler divergence between lattice gaussian distributions,”Journal of the Indian Institute of Science, vol. 102, no. 4, pp. 1177–1188, 2022

2022
[8]

Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,

J. . Durrieu, J. . Thiran, and F. Kelly, “Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,” in2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4833–4836

2012
[9]

Approximating the Kullback-Leibler di- vergence between gaussian mixture models,

J. R. Hershey and P. A. Olsen, “Approximating the Kullback-Leibler di- vergence between gaussian mixture models,” in2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4. IEEE, 2007, pp. IV–317

2007
[10]

The Kullback-Leibler diver- gence rate between markov sources,

Z. Rached, F. Alajaji, and L. L. Campbell, “The Kullback-Leibler diver- gence rate between markov sources,”IEEE Transactions on Information Theory, vol. 50, no. 5, pp. 917–921, 2004

2004
[11]

Relaxed triangle inequality for kullback-leibler divergence between multivariate gaussian distributions,

S. Xiao, Y . Zhang, C. Liu, Y . Ding, K. Li, and K. Li, “Relaxed triangle inequality for kullback-leibler divergence between multivariate gaussian distributions,” 2026. [Online]. Available: https: //arxiv.org/abs/2602.02577

work page arXiv 2026
[12]

Constrained variational policy optimization for safe reinforcement learning,

Z. Liu, Z. Cen, V . Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao, “Constrained variational policy optimization for safe reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 13 644–13 668

2022
[13]

Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes,

H. Ashtiani, S. Ben-David, N. J. A. Harvey, C. Liaw, A. Mehrabian, and Y . Plan, “Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes,”J. ACM, vol. 67, no. 6, oct 2020. [Online]. Available: https://doi.org/10.1145/3417994

work page doi:10.1145/3417994 2020
[14]

Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,

Y . Zhang, J. Pan, W. Liu, Z. Chen, K. Li, J. Wang, Z. Liu, and H. Wei, “Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1683–1697, 2024

2024
[15]

OpenAI, “Glow,” https://github.com/openai/glow, 2018

2018
[16]

Optimism in reinforcement learning and kullback-leibler divergence,

S. Filippi, O. Capp ´e, and A. Garivier, “Optimism in reinforcement learning and kullback-leibler divergence,” in2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010, pp. 115–122

2010
[17]

Online Markov decision processes with Kullback–Leibler control cost,

P. Guan, M. Raginsky, and R. M. Willett, “Online Markov decision processes with Kullback–Leibler control cost,”IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1423–1438, 2014

2014
[18]

Learning a distance metric by balancing kl-divergence for imbalanced datasets,

L. Feng, H. Wang, B. Jin, H. Li, M. Xue, and L. Wang, “Learning a distance metric by balancing kl-divergence for imbalanced datasets,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 12, pp. 2384–2395, 2019

2019
[19]

Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,

N. Bouhlel and A. Dziri, “Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,”IEEE Signal Processing Letters, vol. 26, no. 7, pp. 1021–1025, 2019

2019
[20]

On the Kullback-Leibler divergence between discrete nor- mal distributions,

F. Nielsen, “On the Kullback-Leibler divergence between discrete nor- mal distributions,”arXiv preprint arXiv:2109.14920, 2021

work page arXiv 2021
[21]

The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,

L. M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,”USSR computational mathematics and mathematical physics, vol. 7, no. 3, pp. 200–217, 1967

1967
[22]

Clustering with Bregman divergences,

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,”Journal of Machine Learning Research, vol. 6, no. 58, pp. 1705–1749, 2005. [Online]. Available: http: //jmlr.org/papers/v6/banerjee05b.html

2005
[23]

Functional Bregman divergence,

B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman divergence,” in2008 IEEE International Symposium on Information Theory, 2008, pp. 1681–1685

2008
[24]

A general class of coefficients of divergence of one distribution from another,

S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

1966
[25]

Optimal bounds between f-divergences and integral probability metrics,

R. Agrawal and T. Horel, “Optimal bounds between f-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 22, no. 128, pp. 1–59, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-867.html

2021
[26]

On measures of entropy and information,

A. R ´enyi, “On measures of entropy and information,” inProceedings of the fourth Berkeley symposium on mathematical statistics and probabil- ity, volume 1: contributions to the theory of statistics, vol. 4. University of California Press, 1961, pp. 547–562

1961
[27]

Non-negative latent factor model based onβ-divergence for recommender systems,

L. Xin, Y . Yuan, M. Zhou, Z. Liu, and M. Shang, “Non-negative latent factor model based onβ-divergence for recommender systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 8, pp. 4612–4623, 2021

2021
[28]

(f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,

J. Birrell, P. Dupuis, M. A. Katsoulakis, Y . Pantazis, and L. Rey- Bellet, “(f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 23, no. 39, pp. 1–70, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-0100.html

2022
[29]

The jensen- shannon divergence,

M. L. Men ´endez, J. A. Pardo, L. Pardo, and M. d. C. Pardo, “The jensen- shannon divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997

1997
[30]

A class of wasserstein metrics for probability distributions

C. R. Givens and R. M. Shortt, “A class of wasserstein metrics for probability distributions.”Michigan Mathematical Journal, vol. 31, no. 2, pp. 231–240, 1984

1984
[31]

Generalized out-of-distribution detection: A survey,

J. Yang, K. Zhou, Y . Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,”Int. J. Comput. Vis., vol. 132, no. 12, pp. 5635–5662, 2024. [Online]. Available: https://doi.org/10.1007/ s11263-024-02117-4

2024
[32]

Anomaly detection: A survey,

V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, Jul. 2009

2009
[33]

Deep learning for anomaly detection: A review,

G. Pang, C. Shen, L. Cao, and A. V . D. Hengel, “Deep learning for anomaly detection: A review,”ACM Comput. Surv., vol. 54, no. 2, Mar. 2021

2021
[34]

Survey on applying gan for anomaly detection,

B. J. Beula Rani and L. Sumathi M. E, “Survey on applying gan for anomaly detection,” in2020 International Conference on Computer Communication and Informatics (ICCCI), 2020, pp. 1–5

2020
[35]

Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,

R. Kumar, R. R. Hossain, S. Talukder, A. Jena, and A. T. A. Ghazo, “Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 11, pp. 7123–7133, 2022

2022
[36]

Group deviation detection methods: A survey,

E. Toth and S. Chawla, “Group deviation detection methods: A survey,” ACM Comput. Surv., vol. 51, no. 4, Jul. 2018

2018
[37]

Detecting out-of-distribution inputs to deep generative models using typicality,

E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan, “Detecting out-of-distribution inputs to deep generative models using typicality,”4th workshop on Bayesian Deep Learning (NeurIPS 2019), 2019

2019
[38]

Density estimation using real nvp,

L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” inProceedings of the International Conference on Learning Representations (ICLR), 2017

2017
[39]

Residual flows for invertible generative modeling,

R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, “Residual flows for invertible generative modeling,”Advances in neural informa- tion processing systems, vol. 32, 2019

2019
[40]

Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024

S. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. A. Bautista, N. Jaitly, and J. Susskind, “Normalizing flows are capable generative models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.06329

work page arXiv 2025
[41]

Glow: Generative flow with invertible 1x1 convolutions,

D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” inAdvances in Neural Information Processing Sys- tems, 2018, pp. 10 215–10 224

2018
[42]

An elementary introduction to information geometry,

F. Nielsen, “An elementary introduction to information geometry,” Entropy, vol. 22, no. 10, 2020. [Online]. Available: https://www.mdpi. com/1099-4300/22/10/1100

2020
[43]

Normalizing flows for probabilistic modeling and inference,

G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2019

2019
[44]

Bhatia,Matrix analysis

R. Bhatia,Matrix analysis. Springer Science & Business Media, 2013

2013
[45]

Normalizing flows: An introduction and review of current methods,

I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3964– 3979, 2021

2021
[46]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

2015
[47]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014
[49]

beta-vae: Learning basic visual concepts with a constrained variational framework

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.”ICLR, vol. 2, no. 5, p. 6, 2017

2017
[50]

Understanding disentangling inβ-vae,

C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner, “Understanding disentangling inβ-vae,” inWorkshop on Learning Disentangled Representations at the 31st Conference on Neural Information Processing Systems, 2018

2018
[51]

Wasserstein auto-encoders.arXiv preprint arXiv:1711.01558,

I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch ¨olkopf, “Wasserstein auto-encoders,”ArXiv, vol. abs/1711.01558, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3833554

work page arXiv 2017
[52]

Csisz ´ar and J

I. Csisz ´ar and J. K ¨orner,Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition. Cambridge University Press, 2011. [Online]. Available: https://doi.org/10.1017/ CBO9780511921889

2011