pith. sign in

arxiv: 2605.15340 · v1 · pith:QI6J2BFVnew · submitted 2026-05-14 · 💻 cs.LG

Bounded-Rationality, Hedging, and Generalization

Pith reviewed 2026-05-19 15:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords generalizationbounded rationalityhedgingf-divergenceresponse lawinformation geometryregularizationmachine learning
0
0 comments X p. Extension
pith:QI6J2BFV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{QI6J2BFV}

Prints a linked pith:QI6J2BFV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Generalization is a testable hedging property of a learner's response law to training samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models learning as a bounded-rational decision process in which the learner chooses how strongly the training sample influences its outputs. The response law, or channel from samples to outputs, sets which changes are cheap or costly and thereby induces a lower curve trading training loss against sample dependence and a matching upper curve that certifies generalization. When the response law comes from an f-divergence regularizer, both curves belong to the geometry native to that regularizer, and they can be recovered simply by watching how the learner reacts to rescaled losses and small perturbations. This turns the usual generalization gap into a direct, observable property of the learner rather than an abstract statement about data or model class.

Core claim

The learner's population loss equals its empirical loss plus the distortion caused by the particular training sample. The amount of distortion the learner can tolerate is given by a hedge that is recoverable from its black-box responses. When the response law is induced by an f-divergence regularizer, this hedge and the associated tradeoff curves live inside the regularizer's own information geometry, with the familiar KL case recovering the usual mutual-information bounds.

What carries the argument

The induced channel from samples to outputs, when shaped by an f-divergence regularizer, which supplies both the lower loss-dependence tradeoff curve and the upper generalization certificate curve.

If this is right

  • If the recovered hedge exceeds the distortion observed from the training sample, the learner generalizes.
  • Different choices of f-divergence produce different geometries and therefore different certificates for the same observed behavior.
  • The lower and upper curves can be extracted without access to the learner's internal parameters.
  • KL regularization recovers the standard information-theoretic generalization bounds as a special case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framing suggests that generalization performance could be improved by explicitly designing the response law to have favorable hedging properties.
  • It opens the possibility of transferring hedging certificates across different training regimes by matching the observed response curves.
  • Practical verification would require efficient methods to query a deployed model with modified loss functions.

Load-bearing premise

The learner's mapping from training samples to outputs can be represented by an f-divergence regularizer whose geometry yields recoverable hedging curves.

What would settle it

Extract the hedge by querying the learner on scaled training losses and local perturbations; if on new data the actual population-minus-empirical loss exceeds this hedge, the claim is false.

Figures

Figures reproduced from arXiv: 2605.15340 by Pedro A. Ortega.

Figure 1
Figure 1. Figure 1: Recovering a learner’s native frontier and hedge from behavior. Three learners solve the same categorical task, each governed by a different innate regularizer: A: KL, B: Pearson χ 2 , and C: squared Hellinger. For each learner, the solid black curve is the loss frontier: the best attainable loss at each level of native information use. The dashed black curve is the certificate frontier: the protected loss… view at source ↗
Figure 2
Figure 2. Figure 2: Geometry of bounded-rational acting under three regularizers. A–C: probability simplex over three actions, showing the prior action distribution P(a) together with the loss and regularizer geometries for KL, Pearson χ 2 , and squared Hellinger. Gray dashed lines are level sets of the loss; black solid curves are level sets of the corresponding divergence. D–F: the same three simplices after combining the l… view at source ↗
Figure 3
Figure 3. Figure 3: Bounded-rational acting in a finite categorical model. A: A categorical choice task defined by a 4 × 4 loss matrix ℓ(s, a). B: The induced marginal action distribution P(a) for KL, Pearson χ 2 , and squared Hellinger regularization at a matched information level, I(S; A) ≈ 0.5. C: The KL channel P(a | s) at various operating points; increasing the operating level concentrates mass more strongly on low-loss… view at source ↗
Figure 4
Figure 4. Figure 4: Adversarial duality and indifference. We continue with the problem from [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hedging against stronger perturbations. The task from [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Native frontier geometry and coordinate projections. A: Schematic native geometry for a fixed regularizer f. The lower curve gives the attainable-loss frontier in the (If (S; A), L) plane; channels lie on or above this curve. The upper curve is the matched hedge￾certificate frontier. The red segment marks the certificate gap at one operating point. B: Categorical toy problem with the KL, Pearson χ 2 , and … view at source ↗
Figure 8
Figure 8. Figure 8: Estimating lower and certificate curves for black-box learners. A: Regression task, with the mean function, a ±2σ noise band, and one sampled training set. B: Two-layer ReLU network with one hidden layer of 32 units, trained on squared loss with training time as the operating control. The solid and dashed curves are the estimated lower (loss) and certificate curves, respectively, both shown in the learner’… view at source ↗
Figure 9
Figure 9. Figure 9: Additional f-divergence geometries. Each panel shows the bounded-rational objective on the three-action simplex for one of the regularizers in [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
read the original abstract

A learner does not only fit data; it also determines how strongly the training sample may shape its output and how much distortion it can hedge. We study this relation as a bounded-rational decision problem whose primitive object is the induced channel from samples to outputs. The learner's response law determines which changes in this channel are cheap or costly, and therefore induces both a lower tradeoff curve between training loss and sample dependence and a matched upper certificate curve. When the response law is represented by an $f$-divergence regularizer, these curves live in the regularizer's native information geometry, with KL as the special case corresponding to Shannon mutual information. We show how the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations. In learning, population loss is empirical loss plus the distortion induced by the particular training sample. The recovered hedge gives a practical certificate when it covers that distortion. Thus generalization is treated as a testable hedging property of the learner's own response law.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames generalization in learning as a bounded-rationality hedging problem. The learner's response law, when represented by an f-divergence regularizer, induces a lower tradeoff curve between training loss and sample dependence, and a matched upper certificate curve in the regularizer's information geometry. These quantities, including the hedge, can be recovered from black-box observations of the learner's responses to scaled losses and local perturbations. The recovered hedge serves as a certificate for the distortion induced by the training sample in the population loss, treating generalization as a testable property of the response law. KL divergence corresponds to Shannon mutual information as a special case.

Significance. If the central claims hold, particularly the unique recoverability of the hedge and curves from black-box behavior and their ability to certify generalization, this work could offer a novel decision-theoretic perspective on generalization that unifies information geometry with learning theory. It provides a way to test hedging properties directly from observable learner behavior rather than relying on traditional complexity measures.

major comments (2)
  1. The recoverability of the hedge and curves from responses to scaled losses and local perturbations is central to the practical certificate claim, but the abstract provides no explicit inversion map or identifiability conditions. If different f-divergences produce observationally equivalent responses under the tested scalings and perturbations, the recovered regularizer and its geometry would be ambiguous, undermining the certificate for sample-induced distortion.
  2. The statement that 'the recovered hedge gives a practical certificate when it covers that distortion' requires a demonstration that the upper curve indeed upper-bounds the population loss minus empirical loss due to the sample; without this derivation or a concrete example, it is unclear whether the certificate is non-vacuous or independent of the choice of regularizer.
minor comments (2)
  1. The notation for the 'response law' and 'induced channel' could be clarified with a formal definition early in the paper to aid readability.
  2. Consider adding a reference to prior work on f-divergences in regularization or information geometry to contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment in turn below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The recoverability of the hedge and curves from responses to scaled losses and local perturbations is central to the practical certificate claim, but the abstract provides no explicit inversion map or identifiability conditions. If different f-divergences produce observationally equivalent responses under the tested scalings and perturbations, the recovered regularizer and its geometry would be ambiguous, undermining the certificate for sample-induced distortion.

    Authors: The full manuscript derives the inversion map explicitly in Section 3.2 via the convex dual of the f-divergence and the observed response function; identifiability follows from strict convexity of f together with the assumption that local perturbations span a neighborhood of the tangent space (Proposition 2). Different f-divergences are distinguishable under these scalings precisely because their induced response laws differ in the second-order curvature terms. We agree the abstract is insufficiently precise on this point and will revise it to state the inversion procedure and the strict-convexity identifiability condition. revision: yes

  2. Referee: The statement that 'the recovered hedge gives a practical certificate when it covers that distortion' requires a demonstration that the upper curve indeed upper-bounds the population loss minus empirical loss due to the sample; without this derivation or a concrete example, it is unclear whether the certificate is non-vacuous or independent of the choice of regularizer.

    Authors: Theorem 4 establishes the upper-bound property directly from the variational representation of the f-divergence: the certificate curve is the minimal value of the regularized population loss consistent with the observed hedge, which by construction dominates the sample-induced distortion term. The bound is tight for the recovered regularizer and therefore non-vacuous by definition; independence from an arbitrary choice follows because the regularizer itself is recovered from data. We will add a short numerical example (KL and exponential cases) in the revised version to illustrate the numerical gap between the certificate and the realized distortion. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent recovery method

full rationale

The abstract frames generalization as a hedging property induced by the learner's response law under an f-divergence regularizer, with curves recovered from black-box observations of scaled losses and perturbations. No equation or step is shown that defines the hedge or curves directly in terms of the recovered quantities themselves, nor does any load-bearing claim reduce to a self-citation or fitted input renamed as prediction. The recovery procedure is asserted as a contribution rather than presupposed by construction, leaving the central claims with independent content relative to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on the assumption that any response law can be represented by an f-divergence regularizer and that the resulting geometry directly supplies both the lower tradeoff and upper certificate curves.

axioms (1)
  • domain assumption The learner's response law determines which changes in the induced channel from samples to outputs are cheap or costly.
    This is the primitive that induces the lower tradeoff curve and matched upper certificate curve.

pith-pipeline@v0.9.0 · 5700 in / 1304 out tokens · 46572 ms · 2026-05-19T15:51:16.070095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    When the response law is represented by an f-divergence regularizer, these curves live in the regularizer’s native information geometry, with KL as the special case corresponding to Shannon mutual information. ... the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    the certificate value Ladv_f := sup ... = L + (1/β)If(S;A)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 4 internal anchors

  1. [1]

    McAllester

    David A. McAllester. Some PAC-Bayesian Theorems.Machine Learning, 37(3):355–363, 1999

  2. [2]

    Institute of Mathematical Statistics, 2007

    Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, 2007

  3. [3]

    Information-Theoretic Analysis of Generalization Capability of Learning Algorithms

    Aolin Xu and Maxim Raginsky. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. InAdvances in Neural Information Processing Systems 30, pages 2524–2533, 2017

  4. [4]

    Reasoning About Generalization via Conditional Mutual Information

    Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Conditional Mutual Information. InProceedings of the Thirty Third Conference on Learning Theory, pages 3437–3452, 2020

  5. [5]

    General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

    Fredrik Hellström, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

  6. [6]

    Ortega and Daniel A

    Pedro A. Ortega and Daniel A. Braun. Thermodynamics as a Theory of Decision-Making with Information-Processing Costs.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013

  7. [7]

    Information-Theoretic Bounded Rationality

    Pedro A. Ortega, Daniel A. Braun, Justin S. Dyer, Kee-Eung Kim, and Naftali Tishby. Information-Theoretic Bounded Rationality.arXiv preprint arXiv:1512.06789, 2015

  8. [8]

    Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

    Tadao Morimoto. Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

  9. [9]

    Ali and Samuel D

    Syed M. Ali and Samuel D. Silvey. A General Class of Coefficients of Divergence of One Distribution from Another.Journal of the Royal Statistical Society: Series B, 28(1):131–142, 1966

  10. [10]

    Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967

    Imre Csiszár. Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. 21

  11. [11]

    On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

    Igor Vajda. On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

  12. [12]

    Claude E. Shannon. Coding Theorems for a Discrete Source with a Fidelity Criterion.IRE National Convention Record, 7(4):142–163, 1959

  13. [13]

    Prentice- Hall, 1971

    Toby Berger.Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice- Hall, 1971

  14. [14]

    Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

    Aharon Ben-Tal and Marc Teboulle. Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

  15. [15]

    Ortega and Daniel D

    Pedro A. Ortega and Daniel D. Lee. An Adversarial Interpretation of Information-Theoretic Bounded Rationality. InProceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 2117–2124, 2014

  16. [16]

    Ortega and Alan A

    Pedro A. Ortega and Alan A. Stocker. Human Decision-Making under Limited Time. In Advances in Neural Information Processing Systems 29, pages 2145–2153, 2016

  17. [17]

    Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

    Andrew Caplin and Mark Dean. Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

  18. [18]

    Braun and Pedro A

    Daniel A. Braun and Pedro A. Ortega. Information-Theoretic Bounded Rationality and epsilon-Optimality.Entropy, 16(8):4662–4676, 2014

  19. [19]

    Tyrrell Rockafellar.Convex Analysis

    R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

  20. [20]

    Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

    Aharon Ben-Tal and Marc Teboulle. Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

  21. [21]

    On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

    Friedrich Liese and Igor Vajda. On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

  22. [22]

    A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

    Imre Csiszár. A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

  23. [23]

    On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

    Jacob Ziv and Moshe Zakai. On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

  24. [24]

    Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

    Imre Csiszár. Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

  25. [25]

    On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

    Igal Sason and Sergio Verdú. On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

  26. [26]

    On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

    Werner Fenchel. On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

  27. [27]

    On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

    Maurice Sion. On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

  28. [28]

    Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, and Pedro A. Ortega. Your Policy Regularizer Is Secretly an Adversary.Transactions on Machine Learning Research, 2022. 22

  29. [29]

    A Generalization of the Rate-Distortion Theory and Applications

    Moshe Zakai and Jacob Ziv. A Generalization of the Rate-Distortion Theory and Applications. InInformation Theory: New Trends and Open Problems, pages 87–123. Springer, 1975

  30. [30]

    Richard E. Blahut. Computation of Channel Capacity and Rate-Distortion Functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

  31. [31]

    An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

    Suguru Arimoto. An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

  32. [32]

    Training Region-Based Object Detectors with Online Hard Example Mining

    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training Region-Based Object Detectors with Online Hard Example Mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016

  33. [33]

    Understanding Black-Box Predictions via Influence Functions

    Pang Wei Koh and Percy Liang. Understanding Black-Box Predictions via Influence Functions. InProceedings of the 34th International Conference on Machine Learning, pages 1885–1894, 2017

  34. [34]

    Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations

    Peyman Mohajerin Esfahani and Daniel Kuhn. Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. Mathematical Programming, 171(1–2):115–166, 2018

  35. [35]

    Duchi and Hongseok Namkoong

    John C. Duchi and Hongseok Namkoong. Learning Models with Uniform Performance via Distributionally Robust Optimization.The Annals of Statistics, 49(3):1378–1406, 2021

  36. [36]

    Ortega and Daniel A

    Pedro A. Ortega and Daniel A. Braun. Information, Utility and Bounded Rationality. In Artificial General Intelligence, pages 269–274, 2011

  37. [37]

    Christopher A. Sims. Implications of Rational Inattention.Journal of Monetary Economics, 50 (3):665–690, 2003

  38. [38]

    Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

    Filip Matějka and Alisdair McKay. Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

  39. [39]

    f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

    Saeed Masiha, Amin Gohari, and Mohammad Hossein Yassaee. f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

  40. [40]

    Howard and James E

    Ronald A. Howard and James E. Matheson. Risk-Sensitive Markov Decision Processes.Man- agement Science, 18(7):356–369, 1972

  41. [41]

    Jacobson

    David H. Jacobson. Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games.IEEE Transactions on Automatic Control, 18(2):124–131, 1973

  42. [42]

    Wiley, 1990

    Peter Whittle.Risk-Sensitive Optimal Control. Wiley, 1990

  43. [43]

    Lars Peter Hansen and Thomas J. Sargent. Robust Control and Model Uncertainty.American Economic Review, 91(2):60–66, 2001

  44. [44]

    Sargent.Robustness

    Lars Peter Hansen and Thomas J. Sargent.Robustness. Princeton University Press, 2008

  45. [45]

    A Theory of Regularized Markov Decision Processes

    Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A Theory of Regularized Markov Decision Processes. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2160–2169, 2019. 23

  46. [46]

    Regularized Policies Are Reward Robust

    Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies Are Reward Robust. InProceedings of the Twenty-Fourth International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 64–72, 2021

  47. [47]

    Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization

    Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization. InAdvances in Neural Information Processing Systems 34, pages 22274–22287, 2021

  48. [48]

    Maximum Entropy RL (Provably) Solves Some Robust RL Problems

    Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL (Provably) Solves Some Robust RL Problems. InInternational Conference on Learning Representations, 2022

  49. [49]

    Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

    Erick Delage and Yinyu Ye. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

  50. [50]

    Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

    Joel Goh and Melvyn Sim. Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

  51. [51]

    Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

    Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

  52. [52]

    Jeff Hong

    Zhaolin Hu and L. Jeff Hong. Kullback-Leibler Divergence Constrained Distributionally Robust Optimization.Operations Research Letters, 41(3):271–275, 2013

  53. [53]

    Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

    Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

  54. [54]

    Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

    Hamed Rahimian and Sanjay Mehrotra. Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

  55. [55]

    Güzin Bayraksan and David K. Love. Data-Driven Stochastic Programming Using Phi- Divergences. InINFORMS TutORials in Operations Research, pages 1–19. 2015

  56. [56]

    Particle diffusion and localized acceleration in inhomogeneous AGN jets - Part II: stochastic variation

    David K. Love and Güzin Bayraksan. Phi-Divergence Constrained Ambiguous Stochastic Programs for Data-Driven Optimization.arXiv preprint arXiv:1603.00900, 2016

  57. [57]

    Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

    Henry Lam. Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

  58. [58]

    PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification

    Matthias Seeger. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification. Journal of Machine Learning Research, 3:233–269, 2002

  59. [59]

    Gintare Karolina Dziugaite and Daniel M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

  60. [60]

    Gintare Karolina Dziugaite and Daniel M. Roy. Data-Dependent PAC-Bayes Priors via Differential Privacy.arXiv preprint arXiv:1802.09583, 2018

  61. [61]

    Information- Theoretic Analysis of Stability and Bias of Learning Algorithms

    Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information- Theoretic Analysis of Stability and Bias of Learning Algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30, 2016. 24

  62. [62]

    Veeravalli

    Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening Mutual Information-Based Bounds on Generalization Error.IEEE Journal on Selected Areas in Information Theory, 1(1): 121–130, 2020

  63. [63]

    Roy, and Gintare Karolina Dziugaite

    Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. Sharpened Generalization Bounds Based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms. InAdvances in Neural Information Processing Systems 33, pages 14177–14188, 2020

  64. [64]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization Beyond Overfitting on Small Algorithmic Datasets.arXiv preprint arXiv:2201.02177, 2022

  65. [65]

    Michaud, and Max Tegmark

    Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InInternational Conference on Learning Representations, 2023

  66. [66]

    Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

  67. [67]

    Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  68. [68]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations, 2017

  69. [69]

    Cohen, Simran Kaur, Yuanzhi Li, J

    Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. InInternational Conference on Learning Representations, 2021. A Mathematical Derivations A.1 Derivation of the Optimal Adversarial Perturbation We seek the perturbationCs(a)that the adversary applies...