Recognition: 2 theorem links
· Lean TheoremInformation-Theoretic Generalization Bounds for Sequential Decision Making
Pith reviewed 2026-05-13 03:50 UTC · model grok-4.3
The pith
Under row-wise exchangeability, sequential conditional mutual information bounds the generalization gap in sequential decision making.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a sequential supersample construction that decouples the learner's filtration from the proof-side ghost samples. Under the row-wise exchangeability assumption, the expected generalization gap is upper-bounded by the sequential CMI, which aggregates round-wise mutual information between the selector and the loss. We further derive a Bernstein-type bound that improves the rate when the variance of the loss is small.
What carries the argument
The sequential supersample framework separating learner filtration from proof-side enlargement for ghost comparisons, and the sequential CMI as the sum of roundwise selector-loss information terms.
If this is right
- The bounds apply to online learning algorithms.
- They extend to streaming active learning with importance weighting.
- The strategy bounds generalization in stochastic multi-armed bandits.
- Bernstein refinement yields faster rates under low variance conditions.
Where Pith is reading between the lines
- The separation of filtrations may enable similar bounds for other adaptive algorithms like reinforcement learning.
- Sequential CMI could serve as a diagnostic for generalization performance in real-world sequential systems.
- This work opens paths to information-theoretic analysis of more general causal decision processes.
Load-bearing premise
The row-wise exchangeability assumption that allows the sequential generalization gap to be controlled by sequential CMI.
What would settle it
A counterexample where rows are not exchangeable and the generalization gap is larger than the sequential CMI predicts.
read the original abstract
Information-theoretic generalization bounds based on the supersample construction are a central tool for algorithm-dependent generalization analysis in the batch i.i.d.~setting. However, existing supersample conditional mutual information (CMI) bounds do not directly apply to sequential decision-making problems such as online learning, streaming active learning, and bandits, where data are revealed adaptively and the learner evolves along a causal trajectory. To address this limitation, we develop a sequential supersample framework that separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons. Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector--loss information terms. We also establish a Bernstein-type refinement that yields faster rates under suitable variance conditions. The selector-SCMI proof strategy applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a sequential supersample framework extending information-theoretic generalization bounds to adaptive sequential decision-making settings (online learning, streaming active learning, stochastic bandits). Under a row-wise exchangeability assumption on the supersample, the sequential generalization gap is bounded by sequential CMI (a sum of round-wise selector-loss conditional mutual information terms). A Bernstein-type refinement is also derived for faster rates under suitable variance conditions. The selector-SCMI strategy is claimed to apply directly to the listed applications.
Significance. If the row-wise exchangeability assumption can be rigorously justified or conditioned for adaptive policies, this would meaningfully extend CMI-based, algorithm-dependent generalization analysis beyond the batch i.i.d. case to causal, sequential trajectories. The Bernstein refinement is a constructive addition that could yield practical rate improvements when variance is controlled. The work addresses a genuine technical gap, but its significance hinges on whether the core assumption survives the adaptive dependence structures in the target domains.
major comments (2)
- [Abstract and the statement of the main theorem] The row-wise exchangeability assumption is load-bearing for the central claim (abstract: 'Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI'). In stochastic bandits and online learning, the learner's adaptive action selection (or importance weights) conditions later rows on the filtration of prior outcomes, so the joint law of the supersample rows no longer matches the real trajectory. The manuscript must either prove that the assumption is preserved for the policies considered or delineate the precise additional conditions (e.g., on the reward distributions or policy class) under which the ghost-sample comparison remains valid.
- [Abstract and Section 3 (framework definition)] The separation of 'learner filtration from a proof-side enlargement' is invoked to handle causality, yet the abstract provides no derivation details or error analysis. Without the explicit construction of the sequential supersample (how the ghost coordinates are sampled and how the CMI terms are defined round-wise), it is impossible to verify that the bound is non-vacuous or that it avoids the circularity that would arise if the enlargement itself depended on the adaptive choices.
minor comments (1)
- The abstract is technically dense; a single additional sentence clarifying the difference between the sequential supersample and the standard batch supersample would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our work extending information-theoretic generalization bounds to sequential decision-making. The feedback highlights key points about the role of the row-wise exchangeability assumption and the need for explicit details on the supersample construction. We address each major comment below and will revise the manuscript to incorporate clarifications.
read point-by-point responses
-
Referee: [Abstract and the statement of the main theorem] The row-wise exchangeability assumption is load-bearing for the central claim (abstract: 'Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI'). In stochastic bandits and online learning, the learner's adaptive action selection (or importance weights) conditions later rows on the filtration of prior outcomes, so the joint law of the supersample rows no longer matches the real trajectory. The manuscript must either prove that the assumption is preserved for the policies considered or delineate the precise additional conditions (e.g., on the reward distributions or policy class) under which the ghost-sample comparison remains valid.
Authors: We agree that row-wise exchangeability is a central assumption and does not hold unconditionally under arbitrary adaptive dependence. The manuscript presents the bound as holding under this assumption (Section 3), without claiming it is automatically satisfied for all policies. For the listed applications, the assumption can be ensured by enlarging the probability space with independent ghost samples drawn from the same marginal distributions as the real data, decoupled from the learner's adaptive choices. This is feasible under standard i.i.d. reward or data assumptions (e.g., fixed arm distributions in bandits, or i.i.d. instances in online learning). We will add a dedicated subsection (new Section 4.4) that explicitly delineates these conditions for each application, including when the policy class or reward distributions allow the ghost coordinates to remain exchangeable row-wise conditional on the filtration. This revision will clarify the scope without overclaiming generality. revision: yes
-
Referee: [Abstract and Section 3 (framework definition)] The separation of 'learner filtration from a proof-side enlargement' is invoked to handle causality, yet the abstract provides no derivation details or error analysis. Without the explicit construction of the sequential supersample (how the ghost coordinates are sampled and how the CMI terms are defined round-wise), it is impossible to verify that the bound is non-vacuous or that it avoids the circularity that would arise if the enlargement itself depended on the adaptive choices.
Authors: The abstract is intentionally high-level, consistent with standard practice, and does not contain derivation details. The explicit construction appears in Section 3: the learner filtration is the sigma-algebra generated by the realized observations and actions, while the proof-side enlargement augments the space with ghost coordinates sampled independently from the same marginals (using auxiliary randomness independent of the policy). The round-wise SCMI terms are then I(selector_t; loss on ghost | past filtration), with the ghosts fixed across rounds for exchangeability. This decouples the enlargement from adaptive choices, avoiding circularity. To make verification easier, we will expand Section 3 with a step-by-step sampling procedure, add an appendix containing a filtration diagram, and include a short error analysis showing the bound remains non-vacuous when the CMI terms are finite (as is standard in the batch CMI literature). revision: yes
Circularity Check
No significant circularity; bound derived conditionally from exchangeability assumption
full rationale
The paper introduces a sequential supersample framework that separates learner filtration from ghost-sample enlargement and then states that, under the row-wise exchangeability assumption, the generalization gap is controlled by the sum of roundwise selector-loss CMI terms. This is a standard conditional derivation: the bound follows from the assumption plus the new construction rather than reducing to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. No equations in the abstract or description exhibit the patterns of self-definitional closure, ansatz smuggling, or renaming of known results. The result remains self-contained against the stated assumption and the supersample construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption row-wise exchangeability assumption
invented entities (2)
-
sequential supersample framework
no independent evidence
-
sequential CMI
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearUnder a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector–loss information terms.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclearLemma 1. Suppose that (3) and (6) hold. Then Errseq_n = 2/n ∑ E[E[ε_t L^+_t | G_{t-1}]]
Reference graph
Works this paper leans on
-
[1]
Importance weighted active learning
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. InProceedings of the 26th annual international conference on machine learning, pages 49–56, 2009
work page 2009
-
[2]
Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012
-
[3]
T. M. Cover and J. A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012
work page 2012
-
[4]
F. Futami and M. Fujisawa. Time-independent information-theoretic generalization bounds for SGLD. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[5]
Peter Grunwald, Thomas Steinke, and Lydia Zakynthinou. Pac-bayes, mac-bayes and conditional mutual information: Fast rate bounds that handle general vc classes. InConference on Learning Theory, pages 2217–2247. PMLR, 2021
work page 2021
-
[6]
Online pac-bayes learning.Advances in Neural Information Processing Systems, 35:25725–25738, 2022
Maxime Haddouche and Benjamin Guedj. Online pac-bayes learning.Advances in Neural Information Processing Systems, 35:25725–25738, 2022
work page 2022
-
[7]
Maxime Haddouche and Benjamin Guedj. PAC-bayes generalisation bounds for heavy-tailed losses through supermartingales.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https:// openreview.net/forum?id=qxrwt6F3sf
work page 2023
-
[8]
Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925–9935, 2020. 9
work page 2020
-
[9]
H. Harutyunyan, M. Raginsky, G. Ver Steeg, and A. Galstyan. Information-theoretic generalization bounds for black-box learning algorithms. InAdvances in Neural Information Processing Systems, pages 24670–24682, 2021
work page 2021
-
[10]
F. Hellström and G. Durisi. A new family of generalization bounds using samplewise evaluated CMI. InAdvances in Neural Information Processing Systems, pages 20648–20660, 2022
work page 2022
-
[11]
Gerhard Kramer.Directed information for channels with feedback, volume 11. Hartung-Gorre Germany, 1998
work page 1998
-
[12]
Causality, feedback and directed information
James Massey et al. Causality, feedback and directed information. InProc. Int. Symp. Inf. Theory Applic.(ISITA-90), volume 2, page 1, 1990
work page 1990
-
[13]
The MIT Press, 2nd edition, 2018
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, 2nd edition, 2018. ISBN 0262039400
work page 2018
-
[14]
Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities.The Journal of Machine Learning Research, 16(1):155–186, 2015
work page 2015
-
[15]
D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. InProceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 1232–1240, 2016
work page 2016
-
[16]
Milad Sefidgaran, Abdellatif Zaidi, and Piotr Krasnowski. Minimum description length and generalization guarantees for representation learning.Advances in Neural Information Processing Systems, 36:1489–1525, 2023
work page 2023
-
[17]
Pac-bayes- bernstein inequality for martingales and its application to multiarmed bandits
Yevgeny Seldin, Nicolo Cesa-Bianchi, Peter Auer, François Laviolette, and John Shawe-Taylor. Pac-bayes- bernstein inequality for martingales and its application to multiarmed bandits. InProceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, pages 98–111. JMLR Workshop and Conference Proceedings, 2012
work page 2012
-
[18]
T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual information. In Jacob Abernethy and Shivani Agarwal, editors,Proceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 3437–3452. PMLR, 09–12 Jul 2020
work page 2020
-
[19]
The capacity of channels with feedback.IEEE Transactions on Information Theory, 55(1):323–349, 2008
Sekhar Tatikonda and Sanjoy Mitter. The capacity of channels with feedback.IEEE Transactions on Information Theory, 55(1):323–349, 2008
work page 2008
-
[20]
H. Wang, R. Gao, and F. P. Calmon. Generalization bounds for noisy iterative algorithms using properties of additive noise channels.Journal of Machine Learning Research, 24(26):1–43, 2023
work page 2023
-
[21]
Z. Wang and Y . Mao. On the generalization of models trained with SGD: Information-theoretic bounds and implications. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[22]
Z. Wang and Y . Mao. Tighter information-theoretic generalization bounds from supersamples. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 36111–36137, 2023
work page 2023
-
[23]
0” and which coordinate is called “1
A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, pages 2524–2533, 2017. 10 Appendix organization.Appendix A proves the core SCMI results, including the one-coordinate identity, the slow-rate bound, and the exchangeability-free two-coo...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.