arxiv: 2605.10668 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

A Spectral Framework for Closed-Form Relative Density Estimation

Francis Bach (SIERRA)

Pith reviewed 2026-05-12 04:02 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.STstat.TH

keywords relative density estimationKL divergencespectral methodsclosed-form estimationlog-density ratiof-divergenceslinear modelsmoment-based estimators

0 comments

The pith

Relative log-density estimation reduces to closed-form spectral formulas from first and second feature moments in linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Kullback-Leibler divergence between two linearly parameterized distributions can be rewritten as an integral of weighted chi-squared divergences. This rewriting converts the estimation task into a family of least-squares problems whose solutions depend only on the first and second moments of the chosen features. The resulting spectral formulas give explicit, non-iterative estimators for both the divergence value and the underlying log-density ratio. Because the formulas remain closed-form even for unnormalized and conditional models, they avoid the optimization loops required by standard variational approaches. The same construction extends to a larger family of f-divergences while preserving the moment-based character of the estimators.

Core claim

By expressing the KL divergence as an integral of weighted chi-squared divergences, the authors obtain an explicit spectral formula that depends solely on first- and second-order feature moments and directly supplies closed-form estimators for both divergences and log-density potentials in linearly parameterized models, including unnormalized and conditional cases.

What carries the argument

The spectral formula obtained by converting the integral representation of KL divergence into a sequence of least-squares problems whose solutions are expressed through feature moments.

If this is right

Estimators for a broad class of f-divergences are obtained in the same closed-form manner.
The method combines directly with kernelization or with features learned by neural networks.
Convergence guarantees apply to the resulting moment-based estimators.
Empirical performance can be compared against logistic and softmax regression baselines on synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moment-based route may simplify training of energy-based models by replacing variational bounds with explicit formulas.
Conditional models that currently rely on softmax normalization could be replaced by direct closed-form ratio estimators in some settings.
The framework suggests a moment-matching route to density-ratio estimation that could be tested on high-dimensional data with learned features.

Load-bearing premise

The probabilistic models must be linear in the chosen features and the chosen integral representation of the KL divergence must hold without further restrictions on those features.

What would settle it

Compute the closed-form estimator on a pair of simple Gaussians with known closed-form KL value and check whether the numerical result matches the analytic KL to machine precision.

Figures

Figures reproduced from arXiv: 2605.10668 by Francis Bach (SIERRA).

**Figure 1.** Figure 1: Left and Center: Comparison of quantum, regular, and debiased estimators of the KL divergence for q uniform on [0, 1] and p with a density having one square-integrable derivative, with a kernel equivalent to κ = 0 in Prop. 3, with a fixed n-dependent schedule for λ. Left: comparison of estimators with increasing n (mean and standard errors over 64 replications). Center: scaling laws with a power-law fit (d… view at source ↗

**Figure 2.** Figure 2: Left and Center: Comparison of softmax regression (equivalent here to the variational method, see App. J) and the spectral estimator, with data in two dimensions and class-conditional distributions that are combinations of a few cosines. Left: we perform PCA on a large set of random ReLU features, and then learn with an increasing number of features, comparing softmax regression (solved with Newton’s metho… view at source ↗

**Figure 3.** Figure 3: Left: Number of latent dimensions needed for estimating mutual information. Right: Comparison of estimators of v and w through their normalized divergence estimates (lower is better), by selecting the best hyperparameter on validation data over 4 replications (X = R d , with d ∈ {2, 4, 8, 16}, n = 10000 training observations, and m = 50 hidden neurons). 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Estimation of log(dp/dq(x)) for x ∈ [0, 1], and feature spaces composed of frequencies up to order k (space of dimension 2k + 1), with sines and cosines, with n = 100000 (thus approaching the population limit). From left to right: estimation of 1 − dp/dq(x) using our closed-form estimator for the KL divergence, then estimation of log(dp/dq(x)) with the KL estimator, the Pearson estimator (aimed at estimati… view at source ↗

**Figure 5.** Figure 5: Estimation of log(dp(x1, x2)/dp(x1)dp(x2)) for x1 ∈ [0, 1] and x2 ∈ {1, 2, 3}, corresponding to a softmax regression problem, with feature spaces composed of frequencies up to order k (space of dimension 2k + 1), with sines and cosines, and n = 100000 (thus approaching the population limit), with one plot for each value of x2. From left to right: KL estimator, Pearson estimator, then classical variational … view at source ↗

read the original abstract

We propose a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models, including unnormalized and conditional models. This is achieved by representing the Kullback-Leibler (KL) divergence as an integral of weighted chi-squared divergences, converting KL estimation into a family of least-squares problems. We derive an explicit spectral formula based only on first- and second-order feature moments, yielding closed-form estimators of both divergences and log-density potentials for fixed features. The framework extends to a broad class of f-divergences and can be combined with kernelization or feature learning with neural networks. We prove convergence guarantees for the resulting estimators and empirically compare them on synthetic data with optimization-based variational formulations, including logistic and softmax regression for normalized conditional models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives closed-form spectral estimators for relative log-densities by turning KL into an integral of chi-squared terms solved from feature moments, but the conditions on those features need explicit verification.

read the letter

The main contribution here is a spectral method that gives closed-form solutions for estimating relative log-densities and divergences in linear models. By expressing the KL divergence as an integral of weighted chi-squared divergences, they convert the problem into least-squares tasks that can be solved explicitly using only the first and second moments of the features. This yields estimators for both the divergences and the log-density potentials without iterative optimization. The paper also extends the idea to other f-divergences and notes ways to combine it with kernels or neural features. They include convergence guarantees and synthetic experiments that compare against standard variational baselines like logistic and softmax regression for conditional models. That combination of closed form, proofs, and direct comparisons is what stands out as useful. The linear parameterization keeps things tractable but also limits the scope, even with the suggested extensions. The central step is the integral representation itself. The paper should state the precise conditions on the feature functions and measures that make the identity hold without extra cross-terms or boundary contributions. If those conditions are not met for common feature choices, the closed-form estimators lose their unbiasedness and the convergence results do not carry over. The stress-test concern about unstated restrictions is worth checking in the theorems, though the abstract suggests the authors have addressed the linear case. This work is for researchers in statistical machine learning who work on density ratio estimation or variational inference and prefer explicit formulas over numerical optimization. Readers focused on spectral or moment-based methods will see the most direct value. The derivations and experiments give it enough substance to deserve a serious referee, who should pay particular attention to the representation step and the range of features where it applies. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper proposes a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models (including unnormalized and conditional ones). It represents the KL divergence as an integral of weighted chi-squared divergences, converts the problem into a family of least-squares problems, and derives an explicit spectral formula using only first- and second-order moments of fixed features. This yields closed-form estimators for divergences and log-density potentials. The framework extends to f-divergences, supports kernelization or neural feature learning, includes convergence guarantees, and is empirically compared to optimization-based variational methods on synthetic data.

Significance. If the central derivations hold, the work provides an efficient closed-form alternative to variational optimization for density ratio and divergence estimation in a broad class of models. The explicit use of moment-based spectral formulas, combined with proved convergence rates and direct empirical comparisons (including logistic/softmax regression), represents a practical and theoretically grounded contribution that could simplify computations in unnormalized models and integrate with modern feature learning techniques.

major comments (1)

[§3] §3 (derivation of the spectral formula): The conversion of KL(p||q) into an integral of weighted chi-squared divergences is load-bearing for the closed-form claim, yet the manuscript does not explicitly state the required conditions on the feature functions (e.g., boundedness, completeness of the span, or absolute continuity to eliminate residual cross-terms). Without these, the subsequent least-squares spectral estimator may not be unbiased for the true KL, undermining the convergence guarantees stated in §4.

minor comments (2)

[§3] Notation for the weighting measure in the integral representation (around Eq. (5)) is introduced without a clear definition of its support or normalization, which could confuse readers when extending to conditional models.
[§5] The experimental section compares against logistic and softmax regression but does not report the exact feature dimension or regularization values used in the spectral estimator, making reproducibility of the reported error curves difficult.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and insightful comments on our manuscript. We address the major comment point by point below, and we will revise the paper to incorporate the necessary clarifications.

read point-by-point responses

Referee: [§3] §3 (derivation of the spectral formula): The conversion of KL(p||q) into an integral of weighted chi-squared divergences is load-bearing for the closed-form claim, yet the manuscript does not explicitly state the required conditions on the feature functions (e.g., boundedness, completeness of the span, or absolute continuity to eliminate residual cross-terms). Without these, the subsequent least-squares spectral estimator may not be unbiased for the true KL, undermining the convergence guarantees stated in §4.

Authors: We agree that explicitly stating the conditions on the feature functions is important for rigor. In the derivation of the spectral formula in §3, we implicitly rely on the feature functions being bounded to allow for the interchange of integration and expectation, and on the linear span of the features being complete in the L2 space with respect to the distribution q to ensure that the weighted chi-squared divergences fully represent the KL divergence without residual cross-terms. Absolute continuity is ensured by the setup of the probabilistic models considered. In the revised manuscript, we will add a dedicated paragraph in §3 to explicitly list these assumptions and discuss how they guarantee the unbiasedness of the estimator. This will also strengthen the justification for the convergence rates in §4. We do not believe this changes the main results but improves the clarity and completeness of the presentation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from integral representation and moments

full rationale

The paper starts from the representation of KL as an integral of weighted chi-squared divergences, converts this to a family of least-squares problems over linearly parameterized models, and derives the explicit spectral formula directly from first- and second-order feature moments. No step reduces the final estimator or log-density potential to a fitted parameter by construction, nor relies on load-bearing self-citations or imported uniqueness theorems. The abstract and claimed chain treat the spectral solution as an algebraic consequence of the moment matrices, with convergence guarantees stated separately; this is a standard non-circular derivation from stated assumptions to closed-form output.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the linear parameterization of the models and the validity of rewriting KL as an integral of weighted chi-squared divergences; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption KL divergence can be expressed as an integral of weighted chi-squared divergences
Invoked to convert the estimation problem into least-squares problems
domain assumption Models are linearly parameterized in the chosen features
Required for the closed-form spectral solution to apply directly

pith-pipeline@v0.9.0 · 5421 in / 1088 out tokens · 32746 ms · 2026-05-12T04:02:06.617150+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; RCLCombiner_isCoupling_iff echoes
f(t) = ½(t−1)²/(ρt+1−ρ) … D(p∥q) = ∫ … dν(ρ) … spectral formula F(p∥q,φ) = Σ f(λi)/(λi−1)² [(μp−μq)⊤vi]²
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines
operator-convex functions … f(1)=f'(1)=0, f''(1)=1 … closed-form from first- and second-order feature moments

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

[1]

Kullback-Leibler divergence estimation of continuous distributions

Fernando Pérez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. InInterna- tional Symposium on Information Theory, pages 1666–1670, 2008.(cited on page 1)

work page 2008
[2]

Cambridge University Press, 2025

Yury Polyanskiy and Yihong Wu.Information Theory: From Coding to Learning. Cambridge University Press, 2025. (cited on pages 1, 7, and 50)

work page 2025
[3]

Minimization ofφ-divergences on sets of signed measures

Michel Broniatowski and Amor Keziou. Minimization ofφ-divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403–442, 2006.(cited on pages 1 and 2)

work page 2006
[4]

Wainwright, and Michael I

XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.(cited on pages 1, 2, and 51)

work page 2010
[5]

Fisher III

Qiang Liu, Jian Peng, Alexander Ihler, and John W. Fisher III. Estimating the partition function by discriminance sampling. InConference on Uncertainty in Artificial Intelligence, 2015. (cited on page 2)

work page 2015
[6]

Provable benefits of annealing for estimating normalizing constants: Importance sampling, noise-contrastive estimation, and beyond

Omar Chehab, Aapo Hyvarinen, and Andrej Risteski. Provable benefits of annealing for estimating normalizing constants: Importance sampling, noise-contrastive estimation, and beyond. InAdvances in Neural Information Processing Systems, 2023. (cited on page 2)

work page 2023
[7]

Penalized discriminant analysis.The Annals of Statistics, 23(1):73–102, 1995.(cited on page 2)

Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis.The Annals of Statistics, 23(1):73–102, 1995.(cited on page 2)

work page 1995
[8]

Testing for homogeneity with kernel Fisher discrimi- nant analysis

Zaid Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discrimi- nant analysis. Technical Report 0804.1026, arXiv, 2008.(cited on pages 2, 4, 5, 6, and 29)

work page arXiv 2008
[9]

A least-squares approach to direct importance estimation

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.(cited on pages 2, 4, and 19)

work page 2009
[10]

Cambridge University Press, 2012.(cited on pages 2 and 4)

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori.Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.(cited on pages 2 and 4)

work page 2012
[11]

MIT Press, 2024.(cited on pages 2, 30, 42, 44, 46, and 47) 13

Francis Bach.Learning Theory from First Principles. MIT Press, 2024.(cited on pages 2, 30, 42, 44, 46, and 47) 13

work page 2024
[12]

Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.(cited on pages 2 and 6)

work page 2002
[13]

Nelder.Generalized Linear Models

Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall, 1989.(cited on page 2)

work page 1989
[14]

Relative density-ratio estimation for robust distribution comparison.Neural Computation, 25(5):1324–1370,

Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison.Neural Computation, 25(5):1324–1370,

work page
[15]

Monotone Riemannian metrics and relative entropy on noncommutative probability spaces.Journal of Mathematical Physics, 40(11):5702–5724, 1999.(cited on pages 3 and 24)

Andrew Lesniewski and Mary Beth Ruskai. Monotone Riemannian metrics and relative entropy on noncommutative probability spaces.Journal of Mathematical Physics, 40(11):5702–5724, 1999.(cited on pages 3 and 24)

work page 1999
[16]

Francis Bach. Sum-of-squares relaxations for information theory and variational inference.Foundations of Computational Mathematics, 25(3):865–903, 2025.(cited on pages 3, 4, 5, 6, 7, 13, 18, 19, and 23)

work page 2025
[17]

Information theory with kernel methods.IEEE Transactions on Information Theory, 69(2):752–775, 2023.(cited on pages 3, 4, 6, 7, 9, 13, 19, 22, 29, and 36)

Francis Bach. Information theory with kernel methods.IEEE Transactions on Information Theory, 69(2):752–775, 2023.(cited on pages 3, 4, 6, 7, 9, 13, 19, 22, 29, and 36)

work page 2023
[18]

On-line expectation–maximization algorithm for latent data models

Olivier Cappé and Eric Moulines. On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):593–613, 2009.(cited on pages 3 and 8)

work page 2009
[19]

Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.(cited on pages 3, 9, 10, 11, 35, 40, 41, and 43)

Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.(cited on pages 3, 9, 10, 11, 35, 40, 41, and 43)

work page 2017
[20]

Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4), 2005.(cited on page 4)

Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4), 2005.(cited on page 4)

work page 2005
[21]

Gutmann and Aapo Hyvärinen

Michael U. Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.Journal of Machine Learning Research, 13(11):307– 361, 2012. (cited on page 4)

work page 2012
[22]

Regularizedf-divergence kernel tests

Mónica Ribero, Antonin Schrab, and Arthur Gretton. Regularizedf-divergence kernel tests. Technical Report 2601.19755, arXiv, 2026.(cited on pages 4, 5, 13, 29, and 54)

work page arXiv 2026
[23]

Reid and Robert C

Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(22):731–817, 2011.(cited on pages 4 and 20)

work page 2011
[24]

World Scientific, 2020

Didier Henrion, Milan Korda, and Jean Bernard Lasserre.The Moment-SOS Hierarchy: Lectures in Probability, Statistics, Computational Geometry, Control and Nonlinear PDEs. World Scientific, 2020. (cited on page 5)

work page 2020
[25]

Testing for homogeneity with kernel Fisher discrimi- nant analysis

Zaid Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discrimi- nant analysis. InAdvances in Neural Information Processing Systems, 2007. (cited on page 5)

work page 2007
[26]

Francis Bach and Michael I. Jordan. Kernel independent component analysis.Journal of Machine Learning Research, 3(Jul):1–48, 2002.(cited on pages 6, 13, 23, and 50) 14

work page 2002
[27]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(1):723–773, 2012.(cited on page 6)

work page 2012
[28]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2008. (cited on page 6)

work page 2008
[29]

Generalization properties of learning with random features

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, 2017. (cited on page 6)

work page 2017
[30]

Mahoney and Petros Drineas

Michael W. Mahoney and Petros Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.(cited on page 6)

work page 2009
[31]

Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.(cited on page 6)

work page 2020
[32]

Using the Nyström method to speed up kernel machines

Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2000. (cited on page 6)

work page 2000
[33]

Less is more: Nyström computational regularization

Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, 2015. (cited on page 6)

work page 2015
[34]

On the relationship between multivariate splines and infinitely-wide neural networks

Francis Bach. On the relationship between multivariate splines and infinitely-wide neural networks. Technical Report 2302.03459, arXiv, 2023.(cited on pages 6 and 10)

work page arXiv 2023
[35]

Information-type measures of difference of probability distributions and indirect observa- tion

Imre Csiszár. Information-type measures of difference of probability distributions and indirect observa- tion. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.(cited on page 6)

work page 1967
[36]

Letizia, Nicola Novello, and Andrea M

Nunzio A. Letizia, Nicola Novello, and Andrea M. Tonello. Mutual information estimation viaf- divergence and data derangements. InAdvances in Neural Information Processing Systems, 2024. (cited on page 6)

work page 2024
[37]

Sufficient dimension reduction via squared-loss mutual information estimation

Taiji Suzuki and Masashi Sugiyama. Sufficient dimension reduction via squared-loss mutual information estimation. In International Conference on Artificial Intelligence and Statistics, 2010. (cited on page 6)

work page 2010
[38]

Mutual information neural estimation

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. InInternational Conference on Machine Learning, 2018. (cited on pages 6 and 54)

work page 2018
[39]

On variational bounds of mutual information

Ben Poole, Sherjil Ozair, Aaron van den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. InInternational Conference on Machine Learning, 2019. (cited on pages 6 and 54)

work page 2019
[40]

Francis Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. InInternational Conference on Machine Learning, 2004. (cited on page 7)

work page 2004
[41]

SimpleMKL.Journal of Machine Learning Research, 9:2491–2521, 2008.(cited on page 7)

Alain Rakotomamonjy, Francis Bach, Stéphane Canu, and Yves Grandvalet. SimpleMKL.Journal of Machine Learning Research, 9:2491–2521, 2008.(cited on page 7)

work page 2008
[42]

Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268, 2011.(cited on page 7) 15

Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268, 2011.(cited on page 7) 15

work page 2011
[43]

Enhanced feature learning via regularisation: Integrating neural networks and kernel methods.Journal of Machine Learning Research, 26(172):1–56, 2025.(cited on page 7)

Bertille Follain and Francis Bach. Enhanced feature learning via regularisation: Integrating neural networks and kernel methods.Journal of Machine Learning Research, 26(172):1–56, 2025.(cited on page 7)

work page 2025
[44]

A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004.(cited on page 8)

David R Hunter and Kenneth Lange. A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004.(cited on page 8)

work page 2004
[45]

Stochastic majorization-minimization algorithms for large-scale optimization

Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. InAdvances in Neural Information Processing Systems, 2013. (cited on page 8)

work page 2013
[46]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 1996. (cited on page 8)

work page 1996
[47]

Sriperumbudur, Kenji Fukumizu, and Gert R

Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures.Journal of Machine Learning Research, 12(70), 2011.(cited on page 9)

work page 2011
[48]

Efficient estimation of integral functionals of a density.The Annals of Statistics, 24(2):659–681, 1996.(cited on pages 9, 10, and 36)

Béatrice Laurent. Efficient estimation of integral functionals of a density.The Annals of Statistics, 24(2):659–681, 1996.(cited on pages 9, 10, and 36)

work page 1996
[49]

Lee.U-Statistics: Theory and Practice

Alan J. Lee.U-Statistics: Theory and Practice. Routledge, 2019. (cited on pages 10 and 36)

work page 2019
[50]

On the global convergence of gradient descent for over-parameterized models using optimal transport

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, 2018. (cited on page 10)

work page 2018
[51]

Numerically stable parallel computation of (co-)variance

Erich Schubert and Michael Gertz. Numerically stable parallel computation of (co-)variance. In International Conference on Scientific and Statistical Database Management, 2018. (cited on page 13)

work page 2018
[52]

Sketching data sets for large-scale learning: Keeping only what you need.IEEE Signal Processing Magazine, 38(5):12–36, 2021.(cited on page 13)

Rémi Gribonval, Antoine Chatalic, Nicolas Keriven, Vincent Schellekens, Laurent Jacques, and Philip Schniter. Sketching data sets for large-scale learning: Keeping only what you need.IEEE Signal Processing Magazine, 38(5):12–36, 2021.(cited on page 13)

work page 2021
[53]

Wainwright and Michael I

Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.(cited on page 13)

work page 2008
[54]

Least-squaresindependentcomponentanalysis

TaijiSuzukiandMasashiSugiyama. Least-squaresindependentcomponentanalysis. Neural Computation, 23(1):284–301, 2011.(cited on page 13)

work page 2011
[55]

Direct importance estimation with model selection and its application to covariate shift adaptation

Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, 2007. (cited on page 19)

work page 2007
[56]

f-GAN: Training generative neural samplers using variational divergence minimization

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems, 2016. (cited on page 20)

work page 2016
[57]

Binary losses for density ratio estimation

Werner Zellinger. Binary losses for density ratio estimation. InInternational Conference on Learning Representations, 2025. (cited on page 21) 16

work page 2025
[58]

Pereverzyev

Werner Zellinger, Stefan Kindermann, and Sergei V. Pereverzyev. Adaptive learning of density ratios in RKHS. Journal of Machine Learning Research, 24(395):1–28, 2023.(cited on page 21)

work page 2023
[59]

Linking losses for density ratio and class-probability estimation

Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, 2016. (cited on page 21)

work page 2016
[60]

Cambridge University Press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004. (cited on page 22)

work page 2004
[61]

Concavity of certain maps on positive definite matrices and applications to Hadamard products

Tsuyoshi Ando. Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear Algebra and its Applications, 26:203–241, 1979.(cited on page 22)

work page 1979
[62]

Adrian S. Lewis. Derivatives of spectral functions.Mathematics of Operations Research, 21(3):576–588,

work page
[63]

Optimal rates for the regularized least-squares algorithm

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.(cited on page 29)

work page 2007
[64]

Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces.Applied and Computational Harmonic Analysis, 48(3):868–890, 2020.(cited on pages 29 and 30)

Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces.Applied and Computational Harmonic Analysis, 48(3):868–890, 2020.(cited on pages 29 and 30)

work page 2020
[65]

Sharp analysis of low-rank kernel matrix approximations

Francis Bach. Sharp analysis of low-rank kernel matrix approximations. InConference on Learning Theory, 2013. (cited on page 29)

work page 2013
[66]

Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. InAdvances in Neural Information Processing Systems, 2015. (cited on page 29)

work page 2015
[67]

Joel A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.(cited on page 30)

work page 2015
[68]

On converse and saturation results for Tikhonov regularization of linear ill-posed problems

Andreas Neubauer. On converse and saturation results for Tikhonov regularization of linear ill-posed problems. SIAM Journal on Numerical Analysis, 34(2):517–527, 1997.(cited on page 35)

work page 1997
[69]

Sobolev norm learning rates for regularized least-squares algorithms

Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1–38, 2020.(cited on page 35)

work page 2020
[70]

Bounds on rates of variable-basis and neural-network approxi- mation

Vera Kurková and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approxi- mation. IEEE Transactions on Information Theory, 47(6):2659–2665, 2001.(cited on page 41)

work page 2001
[71]

A vector-contraction inequality for Rademacher complexities

Andreas Maurer. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, 2016. (cited on page 45)

work page 2016
[72]

Revisiting Frank-Wolfe: Projection-free sparse convex optimization

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. InInternational Conference on Machine Learning, 2013. (cited on page 47)

work page 2013
[73]

Anderson.An Introduction to Multivariate Statistical Analysis

Theodore W. Anderson.An Introduction to Multivariate Statistical Analysis. Wiley, 2003. (cited on page 50)

work page 2003
[74]

Estimation of entropy and other functionals of a multivariate density.Annals of the Institute of Statistical Mathematics, 41(4):683–697, 1989.(cited on page 54) 17

Harry Joe. Estimation of entropy and other functionals of a multivariate density.Annals of the Institute of Statistical Mathematics, 41(4):683–697, 1989.(cited on page 54) 17

work page 1989
[75]

Kulkarni, and Sergio Verdú

Qing Wang, Sanjeev R. Kulkarni, and Sergio Verdú. Divergence estimation of continuous distributions based on data-dependent partitions.IEEE Transactions on Information Theory, 51(9):3064–3074, 2005. (cited on page 54)

work page 2005
[76]

quadratic+linear

Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, 2015. (cited on page 54) A Additional examples of f -divergences We provide in Table 1 below a list of operator-conv...

work page 2015
[77]

It has both multiplicative and additive terms

We have: E (Fλ(ˆp∥ˆq, φ) − D(p∥q))2 1/2 ⩽ 32df(λ) n + 8R2 φ λ √ 8 n2ξ + λ2r Z 1 0 ∥θρ∥2dν(ρ) (12) +16 p ξ r dfmax(λ) n p log(n) · D(p∥q) + 8 r 2D(p∥q) αn . It has both multiplicative and additive terms. Throughout the proof, we will use the notationΣ(ρ) = ρΣp + (1 − ρ)Σq, as well as its empirical counterpart ˆΣ(ρ) = ρˆΣp + (1 − ρ)ˆΣq. We will also useM(ρ)...

work page
[78]

30 Proof

Then, with t = 4√ξ p dfmax(λ)/n p log(2n) ∈ (0, 3/4], we haveP(Ac) ⩽ 8 n4ξ. 30 Proof. We prove the bound forp; the proof forq is identical. Let A = (Σp + λI)−1/2(Σp − ˆΣp)(Σp + λI)−1/2 op ≤ t . By the intrinsic-dimension matrix Bernstein bound [67, Theorem 7.7.1] applied to (Σp + λI)−1/2 φ(x)φ(x)⊤ − Σp (Σp + λI)−1/2, provided t ≥ s dfmax p (λ) np + dfmax ...

work page 2048