pith. machine review for the scientific record. sign in

arxiv: 2605.06883 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests

Xiaoming Huo, Yijin Ni

Pith reviewed 2026-05-11 00:58 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords MMDtwo-sample testkernel selectionmodel selectioncomplexity penaltynonparametric testingdeep kernelsType-I error control
0
0 comments X

The pith

Treating kernel selection as model selection with a complexity penalty lets MMD tests optimize kernels directly from data while keeping unconditional Type-I validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames data-driven kernel choice in MMD two-sample testing as a model-selection task rather than an afterthought. It derives a penalty term from uniform concentration inequalities that grows with the richness of the allowed kernel family, so the optimized statistic remains bounded in a way that prevents overfitting. This penalty replaces the need for finite grids or ratio-based criteria, letting the procedure maximize the test statistic over continuous parameter spaces that include bandwidths, polynomial features, and deep-network weights. A reader would care because fixed kernels miss some distributional differences while earlier adaptive methods either collapse variance or cannot scale beyond small discrete sets.

Core claim

We establish data-driven kernel selection as a model selection problem and propose Complexity-Penalized MMD (CP-MMD), a criterion obtained by applying the two-sample uniform concentration inequality to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization. CP-MMD therefore permits direct, grid-free maximization over continuous parametric classes that include scalar bandwidths, polynomial-feature bandwidths, and deep-network parameters. By formally accounting for optimization complexity, the procedure maximizes true test power while ensuring unconditional Type-I validity

What carries the argument

The complexity penalty inside CP-MMD, obtained by substituting the optimized kernel into a uniform concentration bound so that the penalty grows with the richness of the kernel family and thereby controls dependence induced by data-driven selection.

If this is right

  • CP-MMD can be maximized directly over continuous kernel parameters instead of being restricted to finite grids.
  • The test controls Type-I error unconditionally at any fixed significance level when the penalty is used.
  • Power is at least as high as existing grid-based or ratio-based MMD procedures across linear, polynomial-feature, and deep-kernel regimes.
  • The same penalty construction extends to any kernel family whose complexity measure can be bounded, including those parameterized by neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same penalty idea could be carried over to other kernel-based procedures where the kernel is chosen from data, such as kernel-based regression or independence testing.
  • In high-dimensional settings where no single fixed kernel works well, CP-MMD offers a systematic way to let the kernel adapt while retaining a valid p-value.
  • If the concentration inequality can be tightened after optimization, the penalty term itself could be made smaller, increasing power without losing validity.

Load-bearing premise

The two-sample uniform concentration inequality derived in earlier work continues to hold without extra adjustments once the kernel has been chosen from the same data.

What would settle it

A Monte Carlo experiment on identical distributions in which the empirical rejection rate of the CP-MMD test, after grid-free optimization over a rich parametric kernel family, exceeds the nominal alpha level.

Figures

Figures reproduced from arXiv: 2605.06883 by Xiaoming Huo, Yijin Ni.

Figure 1
Figure 1. Figure 1: Class-richness sweep on a fixed signal (±1 SE): as w grows, Liu collapses through variance collapse, Plain overfits, while CP-MMD holds power 1.00. Common setup. Unless otherwise stated, all tests use significance level α = 0.05 with a permutation null based on Nperm = 200 permutations. Deep-kernel ex￾periments use the same MLP architecture across meth￾ods: d → 200 → 200 → 10 with LeakyReLU activa￾tions, 1… view at source ↗
Figure 2
Figure 2. Figure 2: Three-regime head-to-head (dot labels = exact power, shaded bands = [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Higgs power (d=28, n ∈ {200, 500}): CP-MMD-polynomial3 matches Median (1.00) and beats MMDAgg (0.98) at n=500; CP-MMD-deep dominates Liu/Plain on the same MLP (0.92 vs. 0.10/0.70). (b) Deployment per-test cost (d=28, n=200; log–log axes): CP-MMD references are B-free (flat); MMDAgg grows linearly; MMD-FUSE near-constant. At B=10, CP-MMD￾polynomial3 (399 ms) is 1.6× faster than MMDAgg (637 ms). CP-MMD i… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics under H0, five independent runs. (a) JLiu diverges to 103–104 despite P = Q. (b) JCP stays negative. (c) The empirical MMD hovers near zero (collapse). (d) Spectral￾norm product along the trajectory, Π(t) := QL j=1 ∥W (t) j ∥2 (the y-axis), plateaus under CP-MMD while growing unboundedly under plain maximization. D Proof of Theorem 3.4 (UCI) This appendix restates the two-sample UCI of Ni… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical Π(T) at convergence (blue circles) decreases with Cb1 across seven orders of magnitude (monotone in the productive range Cb1 ∈ [10−5 , 10], with mild floor saturation at Cb1 = 102 ). The blue dotted line is the least-squares fit to the productive range [10−3 , 10−1 ], with empirical slope ≈ −0.5. The qualitative observation Π(T) < ∞ for every Cb1 > 0 extends across the entire non-zero sweep. Powe… view at source ↗
read the original abstract

The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Complexity-Penalized MMD (CP-MMD) for data-driven kernel selection in nonparametric two-sample testing. It treats kernel optimization as a model selection problem and derives a penalty by applying prior two-sample uniform concentration inequalities directly to the post-optimization MMD statistic. This penalty is claimed to bound the empirical MMD by the complexity of the kernel class, enabling grid-free maximization over continuous parametric families (scalar bandwidths, polynomial features, deep network parameters) while proving unconditional Type-I validity and maximization of true test power.

Significance. If the central derivation holds, the work would offer a unified, scalable framework for kernel selection in MMD tests that avoids both the overfitting of ratio-based criteria and the grid restrictions of aggregation methods. It would enable principled optimization over rich continuous classes including deep kernels, with explicit accounting for optimization complexity. The claimed proofs of validity and power maximization constitute a substantive technical contribution if rigorously established.

major comments (1)
  1. [Abstract] Abstract (and the derivation of CP-MMD): the unconditional Type-I validity claim rests on applying the two-sample uniform concentration inequality of prior work directly to the MMD after data-driven kernel optimization. Because optimization over continuous classes (including deep parameters) introduces dependence between the selected kernel and the test samples, the standard fixed-kernel bound does not automatically transfer; an explicit correction (e.g., chaining argument or union bound over the optimization path) is required to prevent inflation of the bound. This step is load-bearing for the validity guarantee and must be detailed to support the central claim.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'mathematically absorbing the cost of optimization' is informal; a precise statement of how the penalty is obtained from the concentration inequality would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript proposing CP-MMD for kernel selection in MMD tests. We address the major comment on the Type-I validity derivation below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the derivation of CP-MMD): the unconditional Type-I validity claim rests on applying the two-sample uniform concentration inequality of prior work directly to the MMD after data-driven kernel optimization. Because optimization over continuous classes (including deep parameters) introduces dependence between the selected kernel and the test samples, the standard fixed-kernel bound does not automatically transfer; an explicit correction (e.g., chaining argument or union bound over the optimization path) is required to prevent inflation of the bound. This step is load-bearing for the validity guarantee and must be detailed to support the central claim.

    Authors: The referee correctly identifies the importance of handling data dependence in the validity proof. Our approach applies the uniform concentration inequality from prior work directly to the supremum over the kernel class. Because this inequality provides a simultaneous bound over all kernels in the class, it holds for the data-driven optimized kernel as well, without the need for further corrections. The complexity penalty is precisely the term arising from this uniform bound, which absorbs the optimization cost. We will revise the manuscript to explicitly state this reasoning in the abstract and derivation, including a brief explanation of why uniformity ensures validity post-optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: CP-MMD penalty derived from external concentration inequality applied to optimized statistic

full rationale

The derivation applies a two-sample uniform concentration inequality from preceding works directly to the post-optimization MMD to obtain the complexity penalty in CP-MMD. This step imports an independent bound rather than defining the penalty via the target power or validity quantities, so the subsequent claims of power maximization and unconditional Type-I control follow from the imported inequality without reducing to a self-definition or fitted input by construction. No self-citation chains, ansatz smuggling, or renaming of known results appear in the load-bearing steps; the approach remains self-contained against the cited external bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or new entities are mentioned; the method builds on an existing inequality as the key assumption.

axioms (1)
  • domain assumption The two-sample uniform concentration inequality from preceding works can be applied to the MMD statistic after kernel optimization.
    This is explicitly used to derive the complexity penalty for the post-optimization problem.

pith-pipeline@v0.9.0 · 5553 in / 1284 out tokens · 53203 ms · 2026-05-11T00:58:10.697655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

22 extracted references · 1 canonical work pages

  1. [1]

    Searching for exotic particles in high-energy physics with deep learning

    Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5: 0 4308, 2014

  2. [2]

    Local R ademacher complexities

    Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local R ademacher complexities. Annals of Statistics, 33 0 (4): 0 1497--1537, 2005

  3. [3]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

  4. [4]

    MMD-Fuse : Learning and combining kernels for two-sample testing without data splitting

    Felix Biggs, Antonin Schrab, and Arthur Gretton. MMD-Fuse : Learning and combining kernels for two-sample testing without data splitting. In Advances in Neural Information Processing Systems, volume 36, 2023

  5. [5]

    Concentration Inequalities: A Nonasymptotic Theory of Independence

    St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

  6. [6]

    Casella and R.L

    G. Casella and R.L. Berger. Statistical Inference. Duxbury advanced series. Duxbury Press, 2nd edition, 2002. ISBN 9780534243128

  7. [7]

    A kernel test of goodness of fit

    Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pages 2606--2615, 2016

  8. [8]

    A plant-wide industrial process control problem

    James J Downs and Ernest F Vogel. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17 0 (3): 0 245--255, 1993

  9. [9]

    A kernel statistical test of independence

    Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Sch \"o lkopf, and Alex Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, volume 20, 2007

  10. [10]

    A kernel two-sample test

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch \"o lkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13: 0 723--773, 2012 a

  11. [11]

    Optimal kernel choice for large-scale two-sample tests

    Arthur Gretton, Bharath K Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, and Kenji Fukumizu. Optimal kernel choice for large-scale two-sample tests. Advances in Neural Information Processing Systems, 25, 2012 b

  12. [12]

    Testing Statistical Hypotheses

    Erich L Lehmann and Joseph P Romano. Testing Statistical Hypotheses. Springer, 3rd edition, 2005

  13. [13]

    Learning deep kernels for non-parametric two-sample tests

    Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning, pages 6316--6326. PMLR, 2020

  14. [14]

    A kernelized S tein discrepancy for goodness-of-fit tests

    Qiang Liu, Jason Lee, and Michael Jordan. A kernelized S tein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276--284, 2016

  15. [15]

    Uniform concentration and symmetrization for weak interactions

    Andreas Maurer and Massimiliano Pontil. Uniform concentration and symmetrization for weak interactions. In Conference on Learning Theory, pages 2372--2387. PMLR, 2019

  16. [16]

    On the problem of the most efficient tests of statistical hypotheses

    Jerzy Neyman and Egon S Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231: 0 289--337, 1933

  17. [17]

    A uniform concentration inequality for kernel-based two-sample statistics, 2024

    Yijin Ni and Xiaoming Huo. A uniform concentration inequality for kernel-based two-sample statistics, 2024. URL https://arxiv.org/abs/2405.14051

  18. [18]

    On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions

    Aaditya Ramdas, Sashank J Reddi, Barnab \'a s P \'o czos, Aarti Singh, and Larry Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI Conference on Artificial Intelligence, volume 29, pages 3571--3577, 2015

  19. [19]

    MMD aggregated two-sample test

    Antonin Schrab, Ilmun Kim, Mich \`e le Albert, B \'e atrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 24 0 (194): 0 1--81, 2023

  20. [20]

    Hilbert space embeddings and metrics on probability measures

    Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch \"o lkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: 0 1517--1561, 2010

  21. [21]

    Generative models and model criticism via optimized maximum mean discrepancy

    Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alexander Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations, 2017

  22. [22]

    High-Dimensional Statistics: A Non-Asymptotic Viewpoint

    Martin J Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019