arxiv: 2605.06883 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests

Xiaoming Huo, Yijin Ni

Pith reviewed 2026-05-11 00:58 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords MMDtwo-sample testkernel selectionmodel selectioncomplexity penaltynonparametric testingdeep kernelsType-I error control

0 comments

The pith

Treating kernel selection as model selection with a complexity penalty lets MMD tests optimize kernels directly from data while keeping unconditional Type-I validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames data-driven kernel choice in MMD two-sample testing as a model-selection task rather than an afterthought. It derives a penalty term from uniform concentration inequalities that grows with the richness of the allowed kernel family, so the optimized statistic remains bounded in a way that prevents overfitting. This penalty replaces the need for finite grids or ratio-based criteria, letting the procedure maximize the test statistic over continuous parameter spaces that include bandwidths, polynomial features, and deep-network weights. A reader would care because fixed kernels miss some distributional differences while earlier adaptive methods either collapse variance or cannot scale beyond small discrete sets.

Core claim

We establish data-driven kernel selection as a model selection problem and propose Complexity-Penalized MMD (CP-MMD), a criterion obtained by applying the two-sample uniform concentration inequality to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization. CP-MMD therefore permits direct, grid-free maximization over continuous parametric classes that include scalar bandwidths, polynomial-feature bandwidths, and deep-network parameters. By formally accounting for optimization complexity, the procedure maximizes true test power while ensuring unconditional Type-I validity

What carries the argument

The complexity penalty inside CP-MMD, obtained by substituting the optimized kernel into a uniform concentration bound so that the penalty grows with the richness of the kernel family and thereby controls dependence induced by data-driven selection.

If this is right

CP-MMD can be maximized directly over continuous kernel parameters instead of being restricted to finite grids.
The test controls Type-I error unconditionally at any fixed significance level when the penalty is used.
Power is at least as high as existing grid-based or ratio-based MMD procedures across linear, polynomial-feature, and deep-kernel regimes.
The same penalty construction extends to any kernel family whose complexity measure can be bounded, including those parameterized by neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same penalty idea could be carried over to other kernel-based procedures where the kernel is chosen from data, such as kernel-based regression or independence testing.
In high-dimensional settings where no single fixed kernel works well, CP-MMD offers a systematic way to let the kernel adapt while retaining a valid p-value.
If the concentration inequality can be tightened after optimization, the penalty term itself could be made smaller, increasing power without losing validity.

Load-bearing premise

The two-sample uniform concentration inequality derived in earlier work continues to hold without extra adjustments once the kernel has been chosen from the same data.

What would settle it

A Monte Carlo experiment on identical distributions in which the empirical rejection rate of the CP-MMD test, after grid-free optimization over a rich parametric kernel family, exceeds the nominal alpha level.

Figures

Figures reproduced from arXiv: 2605.06883 by Xiaoming Huo, Yijin Ni.

**Figure 1.** Figure 1: Class-richness sweep on a fixed signal (±1 SE): as w grows, Liu collapses through variance collapse, Plain overfits, while CP-MMD holds power 1.00. Common setup. Unless otherwise stated, all tests use significance level α = 0.05 with a permutation null based on Nperm = 200 permutations. Deep-kernel experiments use the same MLP architecture across methods: d → 200 → 200 → 10 with LeakyReLU activations, 1… view at source ↗

**Figure 2.** Figure 2: Three-regime head-to-head (dot labels = exact power, shaded bands = [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Higgs power (d=28, n ∈ {200, 500}): CP-MMD-polynomial3 matches Median (1.00) and beats MMDAgg (0.98) at n=500; CP-MMD-deep dominates Liu/Plain on the same MLP (0.92 vs. 0.10/0.70). (b) Deployment per-test cost (d=28, n=200; log–log axes): CP-MMD references are B-free (flat); MMDAgg grows linearly; MMD-FUSE near-constant. At B=10, CP-MMDpolynomial3 (399 ms) is 1.6× faster than MMDAgg (637 ms). CP-MMD i… view at source ↗

**Figure 4.** Figure 4: Training dynamics under H0, five independent runs. (a) JLiu diverges to 103–104 despite P = Q. (b) JCP stays negative. (c) The empirical MMD hovers near zero (collapse). (d) Spectralnorm product along the trajectory, Π(t) := QL j=1 ∥W (t) j ∥2 (the y-axis), plateaus under CP-MMD while growing unboundedly under plain maximization. D Proof of Theorem 3.4 (UCI) This appendix restates the two-sample UCI of Ni… view at source ↗

**Figure 5.** Figure 5: Empirical Π(T) at convergence (blue circles) decreases with Cb1 across seven orders of magnitude (monotone in the productive range Cb1 ∈ [10−5 , 10], with mild floor saturation at Cb1 = 102 ). The blue dotted line is the least-squares fit to the productive range [10−3 , 10−1 ], with empirical slope ≈ −0.5. The qualitative observation Π(T) < ∞ for every Cb1 > 0 extends across the entire non-zero sweep. Powe… view at source ↗

read the original abstract

The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a complexity penalty for kernel selection in MMD tests that claims to deliver valid and powerful tests over continuous classes, but the handling of dependence from data-driven optimization looks like the key place to verify.

read the letter

This paper takes on the kernel choice problem in maximum mean discrepancy two-sample tests. Instead of fixing a kernel or searching over a grid, they frame selection as model selection and add a penalty based on the complexity of the kernel class. They derive the CP-MMD criterion by taking a uniform concentration inequality from earlier work and applying it after the kernel has been optimized. The penalty is supposed to account for how much the optimization can inflate the statistic, allowing direct maximization over things like bandwidths or even deep network parameters. They claim this gives unconditional type I error control and maximizes power. What stands out is the attempt to move beyond the usual split between ratio-based methods that ignore dependence and grid-based ones that can't handle continuous spaces. If the math works, it would let people use richer kernels without worrying about overfitting the test statistic. The potential issue is exactly the one in the stress test. Applying a bound meant for fixed kernels to a data-dependent choice requires care. The optimization step means the kernel is chosen using the same samples that go into the MMD, so the usual concentration might not transfer directly. The paper says they prove validity, but I would check whether they add any correction for the selection process or if they just plug in the optimized value. If it's the latter, the guarantee could be compromised. On the empirical side, the abstract mentions matching or exceeding state-of-the-art power, but without seeing the experiments I can't judge how strong the evidence is. The theory is the load-bearing part here. This work is aimed at people doing nonparametric hypothesis testing in machine learning settings, especially those wanting to use learned or parametric kernels. A reader who cares about rigorous guarantees for adaptive tests would get the most out of it. It deserves a serious referee. The problem is important, the approach is different from prior art, and even if the proof needs tightening, the direction is worth exploring in review. I would recommend sending it to peer review rather than desk rejecting it.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Complexity-Penalized MMD (CP-MMD) for data-driven kernel selection in nonparametric two-sample testing. It treats kernel optimization as a model selection problem and derives a penalty by applying prior two-sample uniform concentration inequalities directly to the post-optimization MMD statistic. This penalty is claimed to bound the empirical MMD by the complexity of the kernel class, enabling grid-free maximization over continuous parametric families (scalar bandwidths, polynomial features, deep network parameters) while proving unconditional Type-I validity and maximization of true test power.

Significance. If the central derivation holds, the work would offer a unified, scalable framework for kernel selection in MMD tests that avoids both the overfitting of ratio-based criteria and the grid restrictions of aggregation methods. It would enable principled optimization over rich continuous classes including deep kernels, with explicit accounting for optimization complexity. The claimed proofs of validity and power maximization constitute a substantive technical contribution if rigorously established.

major comments (1)

[Abstract] Abstract (and the derivation of CP-MMD): the unconditional Type-I validity claim rests on applying the two-sample uniform concentration inequality of prior work directly to the MMD after data-driven kernel optimization. Because optimization over continuous classes (including deep parameters) introduces dependence between the selected kernel and the test samples, the standard fixed-kernel bound does not automatically transfer; an explicit correction (e.g., chaining argument or union bound over the optimization path) is required to prevent inflation of the bound. This step is load-bearing for the validity guarantee and must be detailed to support the central claim.

minor comments (1)

[Abstract] Abstract: the phrasing 'mathematically absorbing the cost of optimization' is informal; a precise statement of how the penalty is obtained from the concentration inequality would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript proposing CP-MMD for kernel selection in MMD tests. We address the major comment on the Type-I validity derivation below.

read point-by-point responses

Referee: [Abstract] Abstract (and the derivation of CP-MMD): the unconditional Type-I validity claim rests on applying the two-sample uniform concentration inequality of prior work directly to the MMD after data-driven kernel optimization. Because optimization over continuous classes (including deep parameters) introduces dependence between the selected kernel and the test samples, the standard fixed-kernel bound does not automatically transfer; an explicit correction (e.g., chaining argument or union bound over the optimization path) is required to prevent inflation of the bound. This step is load-bearing for the validity guarantee and must be detailed to support the central claim.

Authors: The referee correctly identifies the importance of handling data dependence in the validity proof. Our approach applies the uniform concentration inequality from prior work directly to the supremum over the kernel class. Because this inequality provides a simultaneous bound over all kernels in the class, it holds for the data-driven optimized kernel as well, without the need for further corrections. The complexity penalty is precisely the term arising from this uniform bound, which absorbs the optimization cost. We will revise the manuscript to explicitly state this reasoning in the abstract and derivation, including a brief explanation of why uniformity ensures validity post-optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: CP-MMD penalty derived from external concentration inequality applied to optimized statistic

full rationale

The derivation applies a two-sample uniform concentration inequality from preceding works directly to the post-optimization MMD to obtain the complexity penalty in CP-MMD. This step imports an independent bound rather than defining the penalty via the target power or validity quantities, so the subsequent claims of power maximization and unconditional Type-I control follow from the imported inequality without reducing to a self-definition or fitted input by construction. No self-citation chains, ansatz smuggling, or renaming of known results appear in the load-bearing steps; the approach remains self-contained against the cited external bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or new entities are mentioned; the method builds on an existing inequality as the key assumption.

axioms (1)

domain assumption The two-sample uniform concentration inequality from preceding works can be applied to the MMD statistic after kernel optimization.
This is explicitly used to derive the complexity penalty for the post-optimization problem.

pith-pipeline@v0.9.0 · 5553 in / 1284 out tokens · 53203 ms · 2026-05-11T00:58:10.697655+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of Ni and Huo [2024] to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space... JCP(h) := bγ²_{k,u}(h) − bC1 · eG(h)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity.

Reference graph

Works this paper leans on

22 extracted references · 1 canonical work pages

[1]

Searching for exotic particles in high-energy physics with deep learning

Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5: 0 4308, 2014

2014
[2]

Local R ademacher complexities

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local R ademacher complexities. Annals of Statistics, 33 0 (4): 0 1497--1537, 2005

2005
[3]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[4]

MMD-Fuse : Learning and combining kernels for two-sample testing without data splitting

Felix Biggs, Antonin Schrab, and Arthur Gretton. MMD-Fuse : Learning and combining kernels for two-sample testing without data splitting. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[5]

Concentration Inequalities: A Nonasymptotic Theory of Independence

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013
[6]

Casella and R.L

G. Casella and R.L. Berger. Statistical Inference. Duxbury advanced series. Duxbury Press, 2nd edition, 2002. ISBN 9780534243128

2002
[7]

A kernel test of goodness of fit

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pages 2606--2615, 2016

2016
[8]

A plant-wide industrial process control problem

James J Downs and Ernest F Vogel. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17 0 (3): 0 245--255, 1993

1993
[9]

A kernel statistical test of independence

Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Sch \"o lkopf, and Alex Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, volume 20, 2007

2007
[10]

A kernel two-sample test

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch \"o lkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13: 0 723--773, 2012 a

2012
[11]

Optimal kernel choice for large-scale two-sample tests

Arthur Gretton, Bharath K Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, and Kenji Fukumizu. Optimal kernel choice for large-scale two-sample tests. Advances in Neural Information Processing Systems, 25, 2012 b

2012
[12]

Testing Statistical Hypotheses

Erich L Lehmann and Joseph P Romano. Testing Statistical Hypotheses. Springer, 3rd edition, 2005

2005
[13]

Learning deep kernels for non-parametric two-sample tests

Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning, pages 6316--6326. PMLR, 2020

2020
[14]

A kernelized S tein discrepancy for goodness-of-fit tests

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized S tein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276--284, 2016

2016
[15]

Uniform concentration and symmetrization for weak interactions

Andreas Maurer and Massimiliano Pontil. Uniform concentration and symmetrization for weak interactions. In Conference on Learning Theory, pages 2372--2387. PMLR, 2019

2019
[16]

On the problem of the most efficient tests of statistical hypotheses

Jerzy Neyman and Egon S Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231: 0 289--337, 1933

1933
[17]

A uniform concentration inequality for kernel-based two-sample statistics, 2024

Yijin Ni and Xiaoming Huo. A uniform concentration inequality for kernel-based two-sample statistics, 2024. URL https://arxiv.org/abs/2405.14051

work page arXiv 2024
[18]

On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions

Aaditya Ramdas, Sashank J Reddi, Barnab \'a s P \'o czos, Aarti Singh, and Larry Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI Conference on Artificial Intelligence, volume 29, pages 3571--3577, 2015

2015
[19]

MMD aggregated two-sample test

Antonin Schrab, Ilmun Kim, Mich \`e le Albert, B \'e atrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 24 0 (194): 0 1--81, 2023

2023
[20]

Hilbert space embeddings and metrics on probability measures

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch \"o lkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11: 0 1517--1561, 2010

2010
[21]

Generative models and model criticism via optimized maximum mean discrepancy

Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alexander Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations, 2017

2017
[22]

High-Dimensional Statistics: A Non-Asymptotic Viewpoint

Martin J Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019

2019