pith. machine review for the scientific record. sign in

arxiv: 2605.12190 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Information-Theoretic Generalization Bounds for Sequential Decision Making

Futoshi Futami, Masahiro Fujisawa

Pith reviewed 2026-05-13 03:50 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords generalization boundsconditional mutual informationsequential decision makingonline learningmulti-armed banditsactive learninginformation theory
0
0 comments X

The pith

Under row-wise exchangeability, sequential conditional mutual information bounds the generalization gap in sequential decision making.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a sequential supersample framework to obtain information-theoretic generalization bounds for adaptive data settings. The framework separates the learner's filtration from an enlarged supersample for ghost sample comparisons in proofs. Under a row-wise exchangeability assumption, the sequential generalization gap is upper-bounded by the sequential CMI, which is the sum over rounds of selector-loss mutual information terms. A Bernstein-type refinement of the bound is derived to achieve faster rates when variance conditions hold. These bounds are shown to apply to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.

Core claim

We develop a sequential supersample construction that decouples the learner's filtration from the proof-side ghost samples. Under the row-wise exchangeability assumption, the expected generalization gap is upper-bounded by the sequential CMI, which aggregates round-wise mutual information between the selector and the loss. We further derive a Bernstein-type bound that improves the rate when the variance of the loss is small.

What carries the argument

The sequential supersample framework separating learner filtration from proof-side enlargement for ghost comparisons, and the sequential CMI as the sum of roundwise selector-loss information terms.

If this is right

  • The bounds apply to online learning algorithms.
  • They extend to streaming active learning with importance weighting.
  • The strategy bounds generalization in stochastic multi-armed bandits.
  • Bernstein refinement yields faster rates under low variance conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of filtrations may enable similar bounds for other adaptive algorithms like reinforcement learning.
  • Sequential CMI could serve as a diagnostic for generalization performance in real-world sequential systems.
  • This work opens paths to information-theoretic analysis of more general causal decision processes.

Load-bearing premise

The row-wise exchangeability assumption that allows the sequential generalization gap to be controlled by sequential CMI.

What would settle it

A counterexample where rows are not exchangeable and the generalization gap is larger than the sequential CMI predicts.

read the original abstract

Information-theoretic generalization bounds based on the supersample construction are a central tool for algorithm-dependent generalization analysis in the batch i.i.d.~setting. However, existing supersample conditional mutual information (CMI) bounds do not directly apply to sequential decision-making problems such as online learning, streaming active learning, and bandits, where data are revealed adaptively and the learner evolves along a causal trajectory. To address this limitation, we develop a sequential supersample framework that separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons. Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector--loss information terms. We also establish a Bernstein-type refinement that yields faster rates under suitable variance conditions. The selector-SCMI proof strategy applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper develops a sequential supersample framework extending information-theoretic generalization bounds to adaptive sequential decision-making settings (online learning, streaming active learning, stochastic bandits). Under a row-wise exchangeability assumption on the supersample, the sequential generalization gap is bounded by sequential CMI (a sum of round-wise selector-loss conditional mutual information terms). A Bernstein-type refinement is also derived for faster rates under suitable variance conditions. The selector-SCMI strategy is claimed to apply directly to the listed applications.

Significance. If the row-wise exchangeability assumption can be rigorously justified or conditioned for adaptive policies, this would meaningfully extend CMI-based, algorithm-dependent generalization analysis beyond the batch i.i.d. case to causal, sequential trajectories. The Bernstein refinement is a constructive addition that could yield practical rate improvements when variance is controlled. The work addresses a genuine technical gap, but its significance hinges on whether the core assumption survives the adaptive dependence structures in the target domains.

major comments (2)
  1. [Abstract and the statement of the main theorem] The row-wise exchangeability assumption is load-bearing for the central claim (abstract: 'Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI'). In stochastic bandits and online learning, the learner's adaptive action selection (or importance weights) conditions later rows on the filtration of prior outcomes, so the joint law of the supersample rows no longer matches the real trajectory. The manuscript must either prove that the assumption is preserved for the policies considered or delineate the precise additional conditions (e.g., on the reward distributions or policy class) under which the ghost-sample comparison remains valid.
  2. [Abstract and Section 3 (framework definition)] The separation of 'learner filtration from a proof-side enlargement' is invoked to handle causality, yet the abstract provides no derivation details or error analysis. Without the explicit construction of the sequential supersample (how the ghost coordinates are sampled and how the CMI terms are defined round-wise), it is impossible to verify that the bound is non-vacuous or that it avoids the circularity that would arise if the enlargement itself depended on the adaptive choices.
minor comments (1)
  1. The abstract is technically dense; a single additional sentence clarifying the difference between the sequential supersample and the standard batch supersample would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work extending information-theoretic generalization bounds to sequential decision-making. The feedback highlights key points about the role of the row-wise exchangeability assumption and the need for explicit details on the supersample construction. We address each major comment below and will revise the manuscript to incorporate clarifications.

read point-by-point responses
  1. Referee: [Abstract and the statement of the main theorem] The row-wise exchangeability assumption is load-bearing for the central claim (abstract: 'Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI'). In stochastic bandits and online learning, the learner's adaptive action selection (or importance weights) conditions later rows on the filtration of prior outcomes, so the joint law of the supersample rows no longer matches the real trajectory. The manuscript must either prove that the assumption is preserved for the policies considered or delineate the precise additional conditions (e.g., on the reward distributions or policy class) under which the ghost-sample comparison remains valid.

    Authors: We agree that row-wise exchangeability is a central assumption and does not hold unconditionally under arbitrary adaptive dependence. The manuscript presents the bound as holding under this assumption (Section 3), without claiming it is automatically satisfied for all policies. For the listed applications, the assumption can be ensured by enlarging the probability space with independent ghost samples drawn from the same marginal distributions as the real data, decoupled from the learner's adaptive choices. This is feasible under standard i.i.d. reward or data assumptions (e.g., fixed arm distributions in bandits, or i.i.d. instances in online learning). We will add a dedicated subsection (new Section 4.4) that explicitly delineates these conditions for each application, including when the policy class or reward distributions allow the ghost coordinates to remain exchangeable row-wise conditional on the filtration. This revision will clarify the scope without overclaiming generality. revision: yes

  2. Referee: [Abstract and Section 3 (framework definition)] The separation of 'learner filtration from a proof-side enlargement' is invoked to handle causality, yet the abstract provides no derivation details or error analysis. Without the explicit construction of the sequential supersample (how the ghost coordinates are sampled and how the CMI terms are defined round-wise), it is impossible to verify that the bound is non-vacuous or that it avoids the circularity that would arise if the enlargement itself depended on the adaptive choices.

    Authors: The abstract is intentionally high-level, consistent with standard practice, and does not contain derivation details. The explicit construction appears in Section 3: the learner filtration is the sigma-algebra generated by the realized observations and actions, while the proof-side enlargement augments the space with ghost coordinates sampled independently from the same marginals (using auxiliary randomness independent of the policy). The round-wise SCMI terms are then I(selector_t; loss on ghost | past filtration), with the ghosts fixed across rounds for exchangeability. This decouples the enlargement from adaptive choices, avoiding circularity. To make verification easier, we will expand Section 3 with a step-by-step sampling procedure, add an appendix containing a filtration diagram, and include a short error analysis showing the bound remains non-vacuous when the CMI terms are finite (as is standard in the batch CMI literature). revision: yes

Circularity Check

0 steps flagged

No significant circularity; bound derived conditionally from exchangeability assumption

full rationale

The paper introduces a sequential supersample framework that separates learner filtration from ghost-sample enlargement and then states that, under the row-wise exchangeability assumption, the generalization gap is controlled by the sum of roundwise selector-loss CMI terms. This is a standard conditional derivation: the bound follows from the assumption plus the new construction rather than reducing to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. No equations in the abstract or description exhibit the patterns of self-definitional closure, ansatz smuggling, or renaming of known results. The result remains self-contained against the stated assumption and the supersample construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Ledger populated from abstract only; central claim rests on one domain assumption and introduces two new conceptual entities without independent evidence.

axioms (1)
  • domain assumption row-wise exchangeability assumption
    Invoked to separate learner filtration from proof-side enlargement and to control the generalization gap by sequential CMI.
invented entities (2)
  • sequential supersample framework no independent evidence
    purpose: Separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons in sequential settings.
    New construction introduced to extend supersample methods beyond i.i.d. batch data.
  • sequential CMI no independent evidence
    purpose: Bounds the sequential generalization gap as a sum of roundwise selector-loss information terms.
    New quantity defined for the adaptive case.

pith-pipeline@v0.9.0 · 5448 in / 1219 out tokens · 85356 ms · 2026-05-13T03:50:50.702424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Importance weighted active learning

    Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. InProceedings of the 26th annual international conference on machine learning, pages 49–56, 2009

  2. [2]

    Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012

    Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012

  3. [3]

    T. M. Cover and J. A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012

  4. [4]

    Futami and M

    F. Futami and M. Fujisawa. Time-independent information-theoretic generalization bounds for SGLD. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  5. [5]

    Pac-bayes, mac-bayes and conditional mutual information: Fast rate bounds that handle general vc classes

    Peter Grunwald, Thomas Steinke, and Lydia Zakynthinou. Pac-bayes, mac-bayes and conditional mutual information: Fast rate bounds that handle general vc classes. InConference on Learning Theory, pages 2217–2247. PMLR, 2021

  6. [6]

    Online pac-bayes learning.Advances in Neural Information Processing Systems, 35:25725–25738, 2022

    Maxime Haddouche and Benjamin Guedj. Online pac-bayes learning.Advances in Neural Information Processing Systems, 35:25725–25738, 2022

  7. [7]

    PAC-bayes generalisation bounds for heavy-tailed losses through supermartingales.Transactions on Machine Learning Research, 2023

    Maxime Haddouche and Benjamin Guedj. PAC-bayes generalisation bounds for heavy-tailed losses through supermartingales.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https:// openreview.net/forum?id=qxrwt6F3sf

  8. [8]

    Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms

    Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925–9935, 2020. 9

  9. [9]

    Harutyunyan, M

    H. Harutyunyan, M. Raginsky, G. Ver Steeg, and A. Galstyan. Information-theoretic generalization bounds for black-box learning algorithms. InAdvances in Neural Information Processing Systems, pages 24670–24682, 2021

  10. [10]

    Hellström and G

    F. Hellström and G. Durisi. A new family of generalization bounds using samplewise evaluated CMI. InAdvances in Neural Information Processing Systems, pages 20648–20660, 2022

  11. [11]

    Hartung-Gorre Germany, 1998

    Gerhard Kramer.Directed information for channels with feedback, volume 11. Hartung-Gorre Germany, 1998

  12. [12]

    Causality, feedback and directed information

    James Massey et al. Causality, feedback and directed information. InProc. Int. Symp. Inf. Theory Applic.(ISITA-90), volume 2, page 1, 1990

  13. [13]

    The MIT Press, 2nd edition, 2018

    Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, 2nd edition, 2018. ISBN 0262039400

  14. [14]

    Online learning via sequential complexities.The Journal of Machine Learning Research, 16(1):155–186, 2015

    Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities.The Journal of Machine Learning Research, 16(1):155–186, 2015

  15. [15]

    Russo and J

    D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. InProceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 1232–1240, 2016

  16. [16]

    Minimum description length and generalization guarantees for representation learning.Advances in Neural Information Processing Systems, 36:1489–1525, 2023

    Milad Sefidgaran, Abdellatif Zaidi, and Piotr Krasnowski. Minimum description length and generalization guarantees for representation learning.Advances in Neural Information Processing Systems, 36:1489–1525, 2023

  17. [17]

    Pac-bayes- bernstein inequality for martingales and its application to multiarmed bandits

    Yevgeny Seldin, Nicolo Cesa-Bianchi, Peter Auer, François Laviolette, and John Shawe-Taylor. Pac-bayes- bernstein inequality for martingales and its application to multiarmed bandits. InProceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, pages 98–111. JMLR Workshop and Conference Proceedings, 2012

  18. [18]

    Steinke and L

    T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual information. In Jacob Abernethy and Shivani Agarwal, editors,Proceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 3437–3452. PMLR, 09–12 Jul 2020

  19. [19]

    The capacity of channels with feedback.IEEE Transactions on Information Theory, 55(1):323–349, 2008

    Sekhar Tatikonda and Sanjoy Mitter. The capacity of channels with feedback.IEEE Transactions on Information Theory, 55(1):323–349, 2008

  20. [20]

    H. Wang, R. Gao, and F. P. Calmon. Generalization bounds for noisy iterative algorithms using properties of additive noise channels.Journal of Machine Learning Research, 24(26):1–43, 2023

  21. [21]

    Wang and Y

    Z. Wang and Y . Mao. On the generalization of models trained with SGD: Information-theoretic bounds and implications. InThe Tenth International Conference on Learning Representations, 2022

  22. [22]

    Wang and Y

    Z. Wang and Y . Mao. Tighter information-theoretic generalization bounds from supersamples. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 36111–36137, 2023

  23. [23]

    0” and which coordinate is called “1

    A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, pages 2524–2533, 2017. 10 Appendix organization.Appendix A proves the core SCMI results, including the one-coordinate identity, the slow-rate bound, and the exchangeability-free two-coo...