arxiv: 2604.19560 · v1 · submitted 2026-04-21 · 💻 cs.LG · math.OC· stat.ML

Recognition: unknown

Separating Geometry from Probability in the Analysis of Generalization

Benjamin Recht, Maxim Raginsky

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:30 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords generalization boundssensitivity analysisoptimizationdeterministic boundsvariational principlesmachine learning theorydata perturbations

0 comments

The pith

Generalization bounds can be derived deterministically by analyzing how optimization solutions change under data perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues for analyzing generalization in machine learning without relying on unverifiable i.i.d. assumptions about data. Instead, it uses sensitivity analysis of optimization problems to data changes to obtain deterministic variational principles. These principles relate in-sample performance to out-of-sample performance through an error term that measures how close the two datasets are geometrically. Statistical assumptions can be applied afterward to characterize when this error term remains small. A sympathetic reader would care because this separates the deterministic, geometry-based part of generalization from probabilistic modeling.

Core claim

By framing generalization as sensitivity analysis of optimization solutions to perturbations in the training data, the authors derive deterministic bounds in the form of variational principles. These principles connect in-sample and out-of-sample evaluations via an error term that quantifies the closeness between out-of-sample data and in-sample data. Probabilistic assumptions are then introduced only ex post to bound or describe the situations in which the error term is small on average or with high probability.

What carries the argument

sensitivity analysis of solutions of optimization problems to perturbations in the problem data

If this is right

Generalization bounds hold by deterministic means without any probabilistic assumptions on the data.
The gap between in-sample and out-of-sample performance is controlled by a term that quantifies closeness of the two datasets.
Statistical tools can be applied after deriving the bound to show when the closeness term is small either on average or with high probability.
The approach applies to general optimization-based learning schemes rather than being limited to specific models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could extend to domains like robust optimization or control where data is not randomly sampled.
Algorithms might be designed to explicitly minimize the data-closeness error term during training.
Empirical tests could compute the derived error term on real datasets and check its correlation with observed generalization gaps.

Load-bearing premise

Sensitivity analysis of optimization problems to data perturbations can be carried out for typical machine-learning objectives in a way that produces nontrivial variational principles relating in-sample and out-of-sample performance.

What would settle it

A standard machine-learning objective such as empirical risk minimization where sensitivity analysis to data changes fails to produce a variational principle that bounds the performance gap by a term measuring dataset proximity.

read the original abstract

The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of ``in-sample'' performance) or some entirely new $S'$ (in which case we speak of ``out-of-sample'' performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to derive generalization bounds via deterministic sensitivity of optimization problems first, then apply probability only to the data-proximity term, but the concrete advantage over stability results is not yet shown.

read the letter

The main takeaway is that Raginsky and Recht want to separate the deterministic geometry of how solutions change with data perturbations from the probabilistic part of generalization. They sketch variational principles that bound the gap between in-sample and out-of-sample risk using a term that measures how close the two datasets are, and they defer any i.i.d. or concentration assumptions until the end to control that term's size.

Referee Report

2 major / 1 minor

Summary. The paper proposes an alternative framework for generalization analysis in machine learning that replaces unverifiable i.i.d. assumptions with deterministic sensitivity analysis of empirical risk minimization problems to perturbations in the data. This yields variational principles relating in-sample and out-of-sample performance via an error term that quantifies proximity between the datasets, after which probabilistic statements are applied only ex post to characterize when the error term is small.

Significance. If the claimed variational principles can be derived nontrivially for standard machine-learning objectives, the separation of deterministic geometry from probability would be a significant conceptual advance, as it directly addresses the unverifiability of i.i.d. assumptions while still allowing statistical characterization of the error term. The manuscript is credited for outlining a logically coherent program that, at the level of the stated claim, avoids the common circularity of fitting bounds to the data they purport to predict.

major comments (2)

Abstract: The abstract describes the intended framework but supplies no derivations, examples, or error-term characterizations, so it is impossible to verify whether the claimed variational principles actually hold or are nontrivial.
The weakest assumption (sensitivity analysis of optimization problems to data perturbations producing nontrivial variational principles for typical ML objectives) is not demonstrated; without explicit construction of the sensitivity measure or the resulting relation between in- and out-of-sample risks, the central claim cannot be assessed for load-bearing content.

minor comments (1)

The manuscript would benefit from at least one worked example (e.g., for linear regression or a simple convex objective) showing the explicit form of the variational principle and the proximity error term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive summary and for identifying the need for greater explicitness in the presentation of our framework. We agree that the manuscript would be strengthened by adding concrete derivations and examples, and we will incorporate these changes in the revised version.

read point-by-point responses

Referee: Abstract: The abstract describes the intended framework but supplies no derivations, examples, or error-term characterizations, so it is impossible to verify whether the claimed variational principles actually hold or are nontrivial.

Authors: We agree that the abstract remains at a high level of description. In the revision we will expand it to include a brief statement of the main variational principle relating in-sample and out-of-sample risk through the sensitivity-based error term, together with a short characterization of that term for the empirical risk minimization problem. This will allow readers to see the nontrivial content immediately while preserving the abstract's brevity. revision: yes
Referee: The weakest assumption (sensitivity analysis of optimization problems to data perturbations producing nontrivial variational principles for typical ML objectives) is not demonstrated; without explicit construction of the sensitivity measure or the resulting relation between in- and out-of-sample risks, the central claim cannot be assessed for load-bearing content.

Authors: The manuscript presents the general deterministic sensitivity framework and its separation from subsequent probabilistic statements, but does not contain explicit constructions for standard objectives. We will add a new section that constructs the sensitivity measure explicitly for the empirical risk minimization problem with squared loss, deriving the Lipschitz constant of the solution map with respect to data perturbations and the resulting variational relation between in-sample and out-of-sample risks. This will demonstrate the nontriviality of the principle for a canonical machine-learning objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core program derives generalization relations deterministically from sensitivity analysis of optimization problems to data perturbations, producing variational principles that bound the difference between in-sample and out-of-sample risk by a proximity error term. Probabilistic assumptions are applied only afterward to characterize the size of that term. This ordering ensures the central relations are obtained independently of the statistical conclusions they later support, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the result to its own inputs by construction. The approach is self-contained against external benchmarks once the deterministic sensitivity analysis is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that optimization problems arising in machine learning admit quantifiable sensitivity to data perturbations that can be expressed variationally; no free parameters or invented entities are visible from the abstract.

axioms (1)

domain assumption Solutions of typical machine-learning optimization problems are sensitive to perturbations in the training data in a manner that can be captured by variational principles relating in-sample and out-of-sample evaluations.
This is the central premise that allows the deterministic step to precede any statistical analysis.

pith-pipeline@v0.9.0 · 5503 in / 1218 out tokens · 45185 ms · 2026-05-10T02:30:35.159880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 3 canonical work pages

[1]

Fr \'e d \'e ric Bonnans and Alexander Shapiro

J. Fr \'e d \'e ric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems. Springer New York, 2000

2000
[2]

Stability and generalization

Olivier Bousquet and Andr \'e Elisseeff. Stability and generalization. Journal of machine learning research, 2 0 (Mar): 0 499--526, 2002

2002
[3]

PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

Olivier Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning . Institute of Mathematical Statistics, 2007

2007
[4]

Prediction, Learning, and Games

Nicol\`o Cesa-Bianchi and G\'abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

2006
[5]

On the generalization ability of on-line learning algorithms

Nicol\`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50 0 (9): 0 2050--2057, 2004

2050
[6]

Dontchev and R

Asen L. Dontchev and R. Tyrrell Rockafellar. Implicit Functions and Solution Mappings: A View from Variational Analysis. Springer New York, 2009

2009
[7]

A general lower bound on the number of examples needed for learning

Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82 0 (3): 0 247--261, 1989

1989
[8]

Weinberger

Michael Golomb and Hans F. Weinberger. Optimal approximation and error bounds. In Rudolph E. Langer, editor, On Numerical Approximation, pages 117--190. University of Wisconsin Press, 1959

1959
[9]

Patterns, predictions, and actions: Foundations of machine learning

Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press, 2022

2022
[10]

Sources of geographic variation in health care: Evidence from patient migration,

Nils Lid Hjort and David Pollard. Asymptotics for minimisers of convex processes, May 1993. URL https://arxiv.org/abs/1107.3806

work page arXiv 1993
[11]

Local Rademacher complexities and oracle inequalities in risk minimization

Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34 0 (6): 0 2593--2696, 2006

2006
[12]

Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: \'E cole d' \'E t \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008

Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: \'E cole d' \'E t \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008 . Springer Berlin Heidelberg, 2011

2008
[13]

Kuchibhotla, Lawrence D

Arun K. Kuchibhotla, Lawrence D. Brown, Andreas Buja, and Junhui Cai. All of linear regression, October 2019. URL https://arxiv.org/abs/1910.06386

work page arXiv 2019
[14]

Interpolating classifiers make few mistakes

Tengyuan Liang and Benjamin Recht. Interpolating classifiers make few mistakes. Journal of Machine Learning Research, 24 0 (20): 0 1--27, 2023

2023
[15]

Luenberger

David G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, Inc., 1969

1969
[16]

C. A. Micchelli and T. J. Rivlin. A survey of optimal recovery. In C. A. Micchelli and T. J. Rivlin, editors, Optimal Estimation in Approximation Theory, pages 1--54. Springer, Boston, MA, 1977

1977
[17]

Tibshirani

Seunghoon Paik, Kangjie Zhou, Matus Telgarsky, and Ryan J. Tibshirani. Basic inequalities for first-order optimization with applications to statistical risk analysis. December 2025. URL https://arxiv.org/abs/2512.24999

work page arXiv 2025
[18]

Private communication

Philippe Rigollet, 2021. Private communication

2021
[19]

Optimal approximation

Arthur Sard. Optimal approximation. Journal of Functional Analysis, 1 0 (2): 0 222--244, 1967

1967
[20]

Approximation based on nonscalar observations

Arthur Sard. Approximation based on nonscalar observations. Journal of Approximation Theory, 8 0 (4): 0 315--334, 1973

1973
[21]

Learnability, stability and uniform convergence

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11: 0 2635--2670, 2010

2010
[22]

Perturbation analysis of optimization problems in B anach spaces

Alexander Shapiro. Perturbation analysis of optimization problems in B anach spaces. Numerical Functional Analysis and Optimization, 13 0 (1--2): 0 97--116, January 1992

1992
[23]

Quantitative stability in stochastic programming

Alexander Shapiro. Quantitative stability in stochastic programming. Mathematical Programming, 67 0 (1): 0 99--108, 1994

1994
[24]

Sriperumbudur, Kenji Fukumizu, and Gert R.G

Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12 0 (70): 0 2389--2410, 2011

2011
[25]

J. F. Traub, G. W. Wasilkowski, and H. Wo \'z niakowski. Information-Based Complexity. Academic Press, 1988

1988
[26]

Empirical Processes in M-Estimation

Sara van de Geer . Empirical Processes in M-Estimation. Cambridge University Press, 2000

2000
[27]

Theory of pattern recognition

Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition. Nauka, Moscow, 1974

1974
[28]

Representer theorems in banach spaces: Minimum norm interpolation, regularized learning and semi-discrete inverse problems

Rui Wang and Yuesheng Xu. Representer theorems in banach spaces: Minimum norm interpolation, regularized learning and semi-discrete inverse problems. Journal of Machine Learning Research, 22 0 (225): 0 1--65, 2021

2021
[29]

Scattered Data Approximation

Holger Wendland. Scattered Data Approximation. Cambridge University Press, 2005

2005