pith. machine review for the scientific record. sign in

arxiv: 2605.10668 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

A Spectral Framework for Closed-Form Relative Density Estimation

Francis Bach (SIERRA)

Pith reviewed 2026-05-12 04:02 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.STstat.TH
keywords relative density estimationKL divergencespectral methodsclosed-form estimationlog-density ratiof-divergenceslinear modelsmoment-based estimators
0
0 comments X

The pith

Relative log-density estimation reduces to closed-form spectral formulas from first and second feature moments in linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Kullback-Leibler divergence between two linearly parameterized distributions can be rewritten as an integral of weighted chi-squared divergences. This rewriting converts the estimation task into a family of least-squares problems whose solutions depend only on the first and second moments of the chosen features. The resulting spectral formulas give explicit, non-iterative estimators for both the divergence value and the underlying log-density ratio. Because the formulas remain closed-form even for unnormalized and conditional models, they avoid the optimization loops required by standard variational approaches. The same construction extends to a larger family of f-divergences while preserving the moment-based character of the estimators.

Core claim

By expressing the KL divergence as an integral of weighted chi-squared divergences, the authors obtain an explicit spectral formula that depends solely on first- and second-order feature moments and directly supplies closed-form estimators for both divergences and log-density potentials in linearly parameterized models, including unnormalized and conditional cases.

What carries the argument

The spectral formula obtained by converting the integral representation of KL divergence into a sequence of least-squares problems whose solutions are expressed through feature moments.

If this is right

  • Estimators for a broad class of f-divergences are obtained in the same closed-form manner.
  • The method combines directly with kernelization or with features learned by neural networks.
  • Convergence guarantees apply to the resulting moment-based estimators.
  • Empirical performance can be compared against logistic and softmax regression baselines on synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same moment-based route may simplify training of energy-based models by replacing variational bounds with explicit formulas.
  • Conditional models that currently rely on softmax normalization could be replaced by direct closed-form ratio estimators in some settings.
  • The framework suggests a moment-matching route to density-ratio estimation that could be tested on high-dimensional data with learned features.

Load-bearing premise

The probabilistic models must be linear in the chosen features and the chosen integral representation of the KL divergence must hold without further restrictions on those features.

What would settle it

Compute the closed-form estimator on a pair of simple Gaussians with known closed-form KL value and check whether the numerical result matches the analytic KL to machine precision.

Figures

Figures reproduced from arXiv: 2605.10668 by Francis Bach (SIERRA).

Figure 1
Figure 1. Figure 1: Left and Center: Comparison of quantum, regular, and debiased estimators of the KL divergence for q uniform on [0, 1] and p with a density having one square-integrable derivative, with a kernel equivalent to κ = 0 in Prop. 3, with a fixed n-dependent schedule for λ. Left: comparison of estimators with increasing n (mean and standard errors over 64 replications). Center: scaling laws with a power-law fit (d… view at source ↗
Figure 2
Figure 2. Figure 2: Left and Center: Comparison of softmax regression (equivalent here to the variational method, see App. J) and the spectral estimator, with data in two dimensions and class-conditional distributions that are combinations of a few cosines. Left: we perform PCA on a large set of random ReLU features, and then learn with an increasing number of features, comparing softmax regression (solved with Newton’s metho… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Number of latent dimensions needed for estimating mutual information. Right: Comparison of estimators of v and w through their normalized divergence estimates (lower is better), by selecting the best hyperparameter on validation data over 4 replications (X = R d , with d ∈ {2, 4, 8, 16}, n = 10000 training observations, and m = 50 hidden neurons). 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Estimation of log(dp/dq(x)) for x ∈ [0, 1], and feature spaces composed of frequencies up to order k (space of dimension 2k + 1), with sines and cosines, with n = 100000 (thus approaching the population limit). From left to right: estimation of 1 − dp/dq(x) using our closed-form estimator for the KL divergence, then estimation of log(dp/dq(x)) with the KL estimator, the Pearson estimator (aimed at estimati… view at source ↗
Figure 5
Figure 5. Figure 5: Estimation of log(dp(x1, x2)/dp(x1)dp(x2)) for x1 ∈ [0, 1] and x2 ∈ {1, 2, 3}, corresponding to a softmax regression problem, with feature spaces composed of frequencies up to order k (space of dimension 2k + 1), with sines and cosines, and n = 100000 (thus approaching the population limit), with one plot for each value of x2. From left to right: KL estimator, Pearson estimator, then classical variational … view at source ↗
read the original abstract

We propose a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models, including unnormalized and conditional models. This is achieved by representing the Kullback-Leibler (KL) divergence as an integral of weighted chi-squared divergences, converting KL estimation into a family of least-squares problems. We derive an explicit spectral formula based only on first- and second-order feature moments, yielding closed-form estimators of both divergences and log-density potentials for fixed features. The framework extends to a broad class of f-divergences and can be combined with kernelization or feature learning with neural networks. We prove convergence guarantees for the resulting estimators and empirically compare them on synthetic data with optimization-based variational formulations, including logistic and softmax regression for normalized conditional models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models (including unnormalized and conditional ones). It represents the KL divergence as an integral of weighted chi-squared divergences, converts the problem into a family of least-squares problems, and derives an explicit spectral formula using only first- and second-order moments of fixed features. This yields closed-form estimators for divergences and log-density potentials. The framework extends to f-divergences, supports kernelization or neural feature learning, includes convergence guarantees, and is empirically compared to optimization-based variational methods on synthetic data.

Significance. If the central derivations hold, the work provides an efficient closed-form alternative to variational optimization for density ratio and divergence estimation in a broad class of models. The explicit use of moment-based spectral formulas, combined with proved convergence rates and direct empirical comparisons (including logistic/softmax regression), represents a practical and theoretically grounded contribution that could simplify computations in unnormalized models and integrate with modern feature learning techniques.

major comments (1)
  1. [§3] §3 (derivation of the spectral formula): The conversion of KL(p||q) into an integral of weighted chi-squared divergences is load-bearing for the closed-form claim, yet the manuscript does not explicitly state the required conditions on the feature functions (e.g., boundedness, completeness of the span, or absolute continuity to eliminate residual cross-terms). Without these, the subsequent least-squares spectral estimator may not be unbiased for the true KL, undermining the convergence guarantees stated in §4.
minor comments (2)
  1. [§3] Notation for the weighting measure in the integral representation (around Eq. (5)) is introduced without a clear definition of its support or normalization, which could confuse readers when extending to conditional models.
  2. [§5] The experimental section compares against logistic and softmax regression but does not report the exact feature dimension or regularization values used in the spectral estimator, making reproducibility of the reported error curves difficult.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and insightful comments on our manuscript. We address the major comment point by point below, and we will revise the paper to incorporate the necessary clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (derivation of the spectral formula): The conversion of KL(p||q) into an integral of weighted chi-squared divergences is load-bearing for the closed-form claim, yet the manuscript does not explicitly state the required conditions on the feature functions (e.g., boundedness, completeness of the span, or absolute continuity to eliminate residual cross-terms). Without these, the subsequent least-squares spectral estimator may not be unbiased for the true KL, undermining the convergence guarantees stated in §4.

    Authors: We agree that explicitly stating the conditions on the feature functions is important for rigor. In the derivation of the spectral formula in §3, we implicitly rely on the feature functions being bounded to allow for the interchange of integration and expectation, and on the linear span of the features being complete in the L2 space with respect to the distribution q to ensure that the weighted chi-squared divergences fully represent the KL divergence without residual cross-terms. Absolute continuity is ensured by the setup of the probabilistic models considered. In the revised manuscript, we will add a dedicated paragraph in §3 to explicitly list these assumptions and discuss how they guarantee the unbiasedness of the estimator. This will also strengthen the justification for the convergence rates in §4. We do not believe this changes the main results but improves the clarity and completeness of the presentation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from integral representation and moments

full rationale

The paper starts from the representation of KL as an integral of weighted chi-squared divergences, converts this to a family of least-squares problems over linearly parameterized models, and derives the explicit spectral formula directly from first- and second-order feature moments. No step reduces the final estimator or log-density potential to a fitted parameter by construction, nor relies on load-bearing self-citations or imported uniqueness theorems. The abstract and claimed chain treat the spectral solution as an algebraic consequence of the moment matrices, with convergence guarantees stated separately; this is a standard non-circular derivation from stated assumptions to closed-form output.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the linear parameterization of the models and the validity of rewriting KL as an integral of weighted chi-squared divergences; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption KL divergence can be expressed as an integral of weighted chi-squared divergences
    Invoked to convert the estimation problem into least-squares problems
  • domain assumption Models are linearly parameterized in the chosen features
    Required for the closed-form spectral solution to apply directly

pith-pipeline@v0.9.0 · 5421 in / 1088 out tokens · 32746 ms · 2026-05-12T04:02:06.617150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

  1. [1]

    Kullback-Leibler divergence estimation of continuous distributions

    Fernando Pérez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. InInterna- tional Symposium on Information Theory, pages 1666–1670, 2008.(cited on page 1)

  2. [2]

    Cambridge University Press, 2025

    Yury Polyanskiy and Yihong Wu.Information Theory: From Coding to Learning. Cambridge University Press, 2025. (cited on pages 1, 7, and 50)

  3. [3]

    Minimization ofφ-divergences on sets of signed measures

    Michel Broniatowski and Amor Keziou. Minimization ofφ-divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403–442, 2006.(cited on pages 1 and 2)

  4. [4]

    Wainwright, and Michael I

    XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.(cited on pages 1, 2, and 51)

  5. [5]

    Fisher III

    Qiang Liu, Jian Peng, Alexander Ihler, and John W. Fisher III. Estimating the partition function by discriminance sampling. InConference on Uncertainty in Artificial Intelligence, 2015. (cited on page 2)

  6. [6]

    Provable benefits of annealing for estimating normalizing constants: Importance sampling, noise-contrastive estimation, and beyond

    Omar Chehab, Aapo Hyvarinen, and Andrej Risteski. Provable benefits of annealing for estimating normalizing constants: Importance sampling, noise-contrastive estimation, and beyond. InAdvances in Neural Information Processing Systems, 2023. (cited on page 2)

  7. [7]

    Penalized discriminant analysis.The Annals of Statistics, 23(1):73–102, 1995.(cited on page 2)

    Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis.The Annals of Statistics, 23(1):73–102, 1995.(cited on page 2)

  8. [8]

    Testing for homogeneity with kernel Fisher discrimi- nant analysis

    Zaid Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discrimi- nant analysis. Technical Report 0804.1026, arXiv, 2008.(cited on pages 2, 4, 5, 6, and 29)

  9. [9]

    A least-squares approach to direct importance estimation

    Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.(cited on pages 2, 4, and 19)

  10. [10]

    Cambridge University Press, 2012.(cited on pages 2 and 4)

    Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori.Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.(cited on pages 2 and 4)

  11. [11]

    MIT Press, 2024.(cited on pages 2, 30, 42, 44, 46, and 47) 13

    Francis Bach.Learning Theory from First Principles. MIT Press, 2024.(cited on pages 2, 30, 42, 44, 46, and 47) 13

  12. [12]

    Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.(cited on pages 2 and 6)

  13. [13]

    Nelder.Generalized Linear Models

    Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall, 1989.(cited on page 2)

  14. [14]

    Relative density-ratio estimation for robust distribution comparison.Neural Computation, 25(5):1324–1370,

    Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison.Neural Computation, 25(5):1324–1370,

  15. [15]

    Monotone Riemannian metrics and relative entropy on noncommutative probability spaces.Journal of Mathematical Physics, 40(11):5702–5724, 1999.(cited on pages 3 and 24)

    Andrew Lesniewski and Mary Beth Ruskai. Monotone Riemannian metrics and relative entropy on noncommutative probability spaces.Journal of Mathematical Physics, 40(11):5702–5724, 1999.(cited on pages 3 and 24)

  16. [16]

    Francis Bach. Sum-of-squares relaxations for information theory and variational inference.Foundations of Computational Mathematics, 25(3):865–903, 2025.(cited on pages 3, 4, 5, 6, 7, 13, 18, 19, and 23)

  17. [17]

    Information theory with kernel methods.IEEE Transactions on Information Theory, 69(2):752–775, 2023.(cited on pages 3, 4, 6, 7, 9, 13, 19, 22, 29, and 36)

    Francis Bach. Information theory with kernel methods.IEEE Transactions on Information Theory, 69(2):752–775, 2023.(cited on pages 3, 4, 6, 7, 9, 13, 19, 22, 29, and 36)

  18. [18]

    On-line expectation–maximization algorithm for latent data models

    Olivier Cappé and Eric Moulines. On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):593–613, 2009.(cited on pages 3 and 8)

  19. [19]

    Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.(cited on pages 3, 9, 10, 11, 35, 40, 41, and 43)

    Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.(cited on pages 3, 9, 10, 11, 35, 40, 41, and 43)

  20. [20]

    Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4), 2005.(cited on page 4)

    Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4), 2005.(cited on page 4)

  21. [21]

    Gutmann and Aapo Hyvärinen

    Michael U. Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.Journal of Machine Learning Research, 13(11):307– 361, 2012. (cited on page 4)

  22. [22]

    Regularizedf-divergence kernel tests

    Mónica Ribero, Antonin Schrab, and Arthur Gretton. Regularizedf-divergence kernel tests. Technical Report 2601.19755, arXiv, 2026.(cited on pages 4, 5, 13, 29, and 54)

  23. [23]

    Reid and Robert C

    Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(22):731–817, 2011.(cited on pages 4 and 20)

  24. [24]

    World Scientific, 2020

    Didier Henrion, Milan Korda, and Jean Bernard Lasserre.The Moment-SOS Hierarchy: Lectures in Probability, Statistics, Computational Geometry, Control and Nonlinear PDEs. World Scientific, 2020. (cited on page 5)

  25. [25]

    Testing for homogeneity with kernel Fisher discrimi- nant analysis

    Zaid Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discrimi- nant analysis. InAdvances in Neural Information Processing Systems, 2007. (cited on page 5)

  26. [26]

    Francis Bach and Michael I. Jordan. Kernel independent component analysis.Journal of Machine Learning Research, 3(Jul):1–48, 2002.(cited on pages 6, 13, 23, and 50) 14

  27. [27]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(1):723–773, 2012.(cited on page 6)

  28. [28]

    Random features for large-scale kernel machines

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2008. (cited on page 6)

  29. [29]

    Generalization properties of learning with random features

    Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, 2017. (cited on page 6)

  30. [30]

    Mahoney and Petros Drineas

    Michael W. Mahoney and Petros Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.(cited on page 6)

  31. [31]

    Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.(cited on page 6)

  32. [32]

    Using the Nyström method to speed up kernel machines

    Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2000. (cited on page 6)

  33. [33]

    Less is more: Nyström computational regularization

    Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, 2015. (cited on page 6)

  34. [34]

    On the relationship between multivariate splines and infinitely-wide neural networks

    Francis Bach. On the relationship between multivariate splines and infinitely-wide neural networks. Technical Report 2302.03459, arXiv, 2023.(cited on pages 6 and 10)

  35. [35]

    Information-type measures of difference of probability distributions and indirect observa- tion

    Imre Csiszár. Information-type measures of difference of probability distributions and indirect observa- tion. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.(cited on page 6)

  36. [36]

    Letizia, Nicola Novello, and Andrea M

    Nunzio A. Letizia, Nicola Novello, and Andrea M. Tonello. Mutual information estimation viaf- divergence and data derangements. InAdvances in Neural Information Processing Systems, 2024. (cited on page 6)

  37. [37]

    Sufficient dimension reduction via squared-loss mutual information estimation

    Taiji Suzuki and Masashi Sugiyama. Sufficient dimension reduction via squared-loss mutual information estimation. In International Conference on Artificial Intelligence and Statistics, 2010. (cited on page 6)

  38. [38]

    Mutual information neural estimation

    Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. InInternational Conference on Machine Learning, 2018. (cited on pages 6 and 54)

  39. [39]

    On variational bounds of mutual information

    Ben Poole, Sherjil Ozair, Aaron van den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. InInternational Conference on Machine Learning, 2019. (cited on pages 6 and 54)

  40. [40]

    Francis Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. InInternational Conference on Machine Learning, 2004. (cited on page 7)

  41. [41]

    SimpleMKL.Journal of Machine Learning Research, 9:2491–2521, 2008.(cited on page 7)

    Alain Rakotomamonjy, Francis Bach, Stéphane Canu, and Yves Grandvalet. SimpleMKL.Journal of Machine Learning Research, 9:2491–2521, 2008.(cited on page 7)

  42. [42]

    Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268, 2011.(cited on page 7) 15

    Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268, 2011.(cited on page 7) 15

  43. [43]

    Enhanced feature learning via regularisation: Integrating neural networks and kernel methods.Journal of Machine Learning Research, 26(172):1–56, 2025.(cited on page 7)

    Bertille Follain and Francis Bach. Enhanced feature learning via regularisation: Integrating neural networks and kernel methods.Journal of Machine Learning Research, 26(172):1–56, 2025.(cited on page 7)

  44. [44]

    A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004.(cited on page 8)

    David R Hunter and Kenneth Lange. A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004.(cited on page 8)

  45. [45]

    Stochastic majorization-minimization algorithms for large-scale optimization

    Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. InAdvances in Neural Information Processing Systems, 2013. (cited on page 8)

  46. [46]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 1996. (cited on page 8)

  47. [47]

    Sriperumbudur, Kenji Fukumizu, and Gert R

    Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures.Journal of Machine Learning Research, 12(70), 2011.(cited on page 9)

  48. [48]

    Efficient estimation of integral functionals of a density.The Annals of Statistics, 24(2):659–681, 1996.(cited on pages 9, 10, and 36)

    Béatrice Laurent. Efficient estimation of integral functionals of a density.The Annals of Statistics, 24(2):659–681, 1996.(cited on pages 9, 10, and 36)

  49. [49]

    Lee.U-Statistics: Theory and Practice

    Alan J. Lee.U-Statistics: Theory and Practice. Routledge, 2019. (cited on pages 10 and 36)

  50. [50]

    On the global convergence of gradient descent for over-parameterized models using optimal transport

    Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, 2018. (cited on page 10)

  51. [51]

    Numerically stable parallel computation of (co-)variance

    Erich Schubert and Michael Gertz. Numerically stable parallel computation of (co-)variance. In International Conference on Scientific and Statistical Database Management, 2018. (cited on page 13)

  52. [52]

    Sketching data sets for large-scale learning: Keeping only what you need.IEEE Signal Processing Magazine, 38(5):12–36, 2021.(cited on page 13)

    Rémi Gribonval, Antoine Chatalic, Nicolas Keriven, Vincent Schellekens, Laurent Jacques, and Philip Schniter. Sketching data sets for large-scale learning: Keeping only what you need.IEEE Signal Processing Magazine, 38(5):12–36, 2021.(cited on page 13)

  53. [53]

    Wainwright and Michael I

    Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.(cited on page 13)

  54. [54]

    Least-squaresindependentcomponentanalysis

    TaijiSuzukiandMasashiSugiyama. Least-squaresindependentcomponentanalysis. Neural Computation, 23(1):284–301, 2011.(cited on page 13)

  55. [55]

    Direct importance estimation with model selection and its application to covariate shift adaptation

    Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, 2007. (cited on page 19)

  56. [56]

    f-GAN: Training generative neural samplers using variational divergence minimization

    Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems, 2016. (cited on page 20)

  57. [57]

    Binary losses for density ratio estimation

    Werner Zellinger. Binary losses for density ratio estimation. InInternational Conference on Learning Representations, 2025. (cited on page 21) 16

  58. [58]

    Pereverzyev

    Werner Zellinger, Stefan Kindermann, and Sergei V. Pereverzyev. Adaptive learning of density ratios in RKHS. Journal of Machine Learning Research, 24(395):1–28, 2023.(cited on page 21)

  59. [59]

    Linking losses for density ratio and class-probability estimation

    Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, 2016. (cited on page 21)

  60. [60]

    Cambridge University Press, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004. (cited on page 22)

  61. [61]

    Concavity of certain maps on positive definite matrices and applications to Hadamard products

    Tsuyoshi Ando. Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear Algebra and its Applications, 26:203–241, 1979.(cited on page 22)

  62. [62]

    Adrian S. Lewis. Derivatives of spectral functions.Mathematics of Operations Research, 21(3):576–588,

  63. [63]

    Optimal rates for the regularized least-squares algorithm

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.(cited on page 29)

  64. [64]

    Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces.Applied and Computational Harmonic Analysis, 48(3):868–890, 2020.(cited on pages 29 and 30)

    Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces.Applied and Computational Harmonic Analysis, 48(3):868–890, 2020.(cited on pages 29 and 30)

  65. [65]

    Sharp analysis of low-rank kernel matrix approximations

    Francis Bach. Sharp analysis of low-rank kernel matrix approximations. InConference on Learning Theory, 2013. (cited on page 29)

  66. [66]

    Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. InAdvances in Neural Information Processing Systems, 2015. (cited on page 29)

  67. [67]

    Joel A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.(cited on page 30)

  68. [68]

    On converse and saturation results for Tikhonov regularization of linear ill-posed problems

    Andreas Neubauer. On converse and saturation results for Tikhonov regularization of linear ill-posed problems. SIAM Journal on Numerical Analysis, 34(2):517–527, 1997.(cited on page 35)

  69. [69]

    Sobolev norm learning rates for regularized least-squares algorithms

    Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1–38, 2020.(cited on page 35)

  70. [70]

    Bounds on rates of variable-basis and neural-network approxi- mation

    Vera Kurková and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approxi- mation. IEEE Transactions on Information Theory, 47(6):2659–2665, 2001.(cited on page 41)

  71. [71]

    A vector-contraction inequality for Rademacher complexities

    Andreas Maurer. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, 2016. (cited on page 45)

  72. [72]

    Revisiting Frank-Wolfe: Projection-free sparse convex optimization

    Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. InInternational Conference on Machine Learning, 2013. (cited on page 47)

  73. [73]

    Anderson.An Introduction to Multivariate Statistical Analysis

    Theodore W. Anderson.An Introduction to Multivariate Statistical Analysis. Wiley, 2003. (cited on page 50)

  74. [74]

    Estimation of entropy and other functionals of a multivariate density.Annals of the Institute of Statistical Mathematics, 41(4):683–697, 1989.(cited on page 54) 17

    Harry Joe. Estimation of entropy and other functionals of a multivariate density.Annals of the Institute of Statistical Mathematics, 41(4):683–697, 1989.(cited on page 54) 17

  75. [75]

    Kulkarni, and Sergio Verdú

    Qing Wang, Sanjeev R. Kulkarni, and Sergio Verdú. Divergence estimation of continuous distributions based on data-dependent partitions.IEEE Transactions on Information Theory, 51(9):3064–3074, 2005. (cited on page 54)

  76. [76]

    quadratic+linear

    Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, 2015. (cited on page 54) A Additional examples of f -divergences We provide in Table 1 below a list of operator-conv...

  77. [77]

    It has both multiplicative and additive terms

    We have: E (Fλ(ˆp∥ˆq, φ) − D(p∥q))2 1/2 ⩽ 32df(λ) n + 8R2 φ λ √ 8 n2ξ + λ2r Z 1 0 ∥θρ∥2dν(ρ) (12) +16 p ξ r dfmax(λ) n p log(n) · D(p∥q) + 8 r 2D(p∥q) αn . It has both multiplicative and additive terms. Throughout the proof, we will use the notationΣ(ρ) = ρΣp + (1 − ρ)Σq, as well as its empirical counterpart ˆΣ(ρ) = ρˆΣp + (1 − ρ)ˆΣq. We will also useM(ρ)...

  78. [78]

    30 Proof

    Then, with t = 4√ξ p dfmax(λ)/n p log(2n) ∈ (0, 3/4], we haveP(Ac) ⩽ 8 n4ξ. 30 Proof. We prove the bound forp; the proof forq is identical. Let A = (Σp + λI)−1/2(Σp − ˆΣp)(Σp + λI)−1/2 op ≤ t . By the intrinsic-dimension matrix Bernstein bound [67, Theorem 7.7.1] applied to (Σp + λI)−1/2 φ(x)φ(x)⊤ − Σp (Σp + λI)−1/2, provided t ≥ s dfmax p (λ) np + dfmax ...