Recognition: unknown
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
Pith reviewed 2026-05-10 14:28 UTC · model grok-4.3
The pith
Adaptive spectral routing achieves the minimax optimal rate of order square root of mixing time over sample size for majority-vote ensembles on Markov chains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For stationary, reversible, geometrically ergodic Markov chains in fixed dimension, no measurable estimator attains excess classification risk smaller than order square root of mixing time over sample size. On the AR(1) subclass, uniform bagging has risk at least order mixing time over square root of sample size. Adaptive spectral routing attains the optimal order square root of mixing time over sample size up to a lower-order geometric cut term on graph-regular chains by partitioning according to the empirical Fiedler eigenvector of the dependency graph, without knowledge of the mixing time.
What carries the argument
The empirical Fiedler eigenvector of the dependency graph, which partitions the training data into groups with reduced dependence before majority voting.
If this is right
- Uniform bagging suffers excess risk at least order mixing time over square root of sample size on the AR(1) subclass, creating a square root of mixing time algorithmic gap.
- The optimal rate is achieved without knowledge of the mixing time.
- The lower bound holds for the full class of stationary reversible geometrically ergodic chains while the upper bound holds for graph-regular subclasses.
- The rates are confirmed in experiments on synthetic Markov chains, 2D spatial grids, the UCR time series archive, and Atari DQN replay buffers.
Where Pith is reading between the lines
- The spectral partitioning idea could apply to other ensemble methods such as boosting when data exhibits Markov dependence.
- In reinforcement learning, routing replay buffer samples this way might further stabilize target networks beyond uniform sampling.
- Combining the approach with dimension reduction would be a natural next step to handle settings beyond fixed ambient dimension.
Load-bearing premise
The underlying process is a stationary reversible geometrically ergodic Markov chain in fixed ambient dimension, with the matching upper bound restricted to a graph-regular subclass.
What would settle it
An estimator that achieves excess risk o of square root of mixing time over sample size on some geometrically ergodic chain for large sample size, or a graph-regular chain where spectral routing exceeds order square root of mixing time over sample size.
Figures
read the original abstract
Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $\Omega(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $\Omega(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nystr\"om approximation, and bounded non-stationarity are developed as supporting material in the appendix.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes a minimax lower bound of Ω(√(Tmix/n)) on excess classification risk for any measurable estimator applied to majority-vote ensembles, under the class of stationary, reversible, geometrically ergodic Markov chains in fixed ambient dimension. It shows that dependence-agnostic uniform bagging is suboptimal, attaining only Ω(Tmix/√n) on the AR(1) witness subclass used for the lower bound. The paper then introduces adaptive spectral routing, which constructs a dependency graph from the data, extracts its empirical Fiedler vector to partition the training set, and achieves the matching rate O(√(Tmix/n)) up to a lower-order geometric cut term on a graph-regular subclass, without requiring knowledge of Tmix. The results are illustrated with experiments on synthetic chains, 2D grids, the UCR archive, and Atari DQN replay buffers.
Significance. If the central claims hold, the work is significant for providing the first tight minimax characterization of how Markov dependence degrades classical ensemble variance reduction, together with an adaptive algorithm that attains the optimal rate on the relevant subclass. The construction of a matching upper bound via spectral partitioning without explicit mixing-time input, and the explicit demonstration of a √Tmix algorithmic gap for uniform bagging, are notable strengths. These results have direct implications for ensemble methods in time-series, spatial data, and reinforcement-learning replay buffers.
major comments (2)
- [Lower-bound section and graph-regular subclass definition] The lower-bound construction uses an AR(1) witness subclass of stationary reversible geometrically ergodic chains; the manuscript should explicitly verify that this subclass is contained in the graph-regular class on which the upper bound is proved, so that the Ω(√(Tmix/n)) and O(√(Tmix/n)) rates are directly comparable without additional restrictions.
- [Adaptive spectral routing algorithm and its analysis] The adaptive spectral routing algorithm relies on the empirical Fiedler vector of a data-derived dependency graph; the proof that this yields the minimax rate without knowledge of Tmix should clarify the approximation error between the empirical and population Fiedler vectors under only geometric ergodicity (rather than stronger mixing assumptions).
minor comments (2)
- [Upper-bound theorem statement] The geometric cut term appearing in the upper-bound rate should be defined explicitly in the main text (not only in the appendix) and its order relative to √(Tmix/n) should be stated clearly.
- [Notation and definitions] Notation for Tmix and the graph-regular subclass should be introduced at first use in the main body rather than deferred to the appendix.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation of minor revision. The comments are constructive and help strengthen the connection between the lower and upper bounds as well as the clarity of the spectral analysis. We address each major comment below.
read point-by-point responses
-
Referee: [Lower-bound section and graph-regular subclass definition] The lower-bound construction uses an AR(1) witness subclass of stationary reversible geometrically ergodic chains; the manuscript should explicitly verify that this subclass is contained in the graph-regular class on which the upper bound is proved, so that the Ω(√(Tmix/n)) and O(√(Tmix/n)) rates are directly comparable without additional restrictions.
Authors: We agree that an explicit containment argument is needed for direct rate comparison. In the revised manuscript we will insert a short lemma in the lower-bound section establishing that the AR(1) witness subclass satisfies the graph-regular conditions. The AR(1) process generates a dependency graph that is a path (hence bounded degree and regular spectral properties), and the geometric ergodicity assumption already ensures the required uniform spectral gap. This addition removes any ambiguity and confirms that the Ω(√(Tmix/n)) lower bound holds inside the graph-regular subclass. revision: yes
-
Referee: [Adaptive spectral routing algorithm and its analysis] The adaptive spectral routing algorithm relies on the empirical Fiedler vector of a data-derived dependency graph; the proof that this yields the minimax rate without knowledge of Tmix should clarify the approximation error between the empirical and population Fiedler vectors under only geometric ergodicity (rather than stronger mixing assumptions).
Authors: We thank the referee for highlighting the need for an explicit error bound. The existing proof already invokes geometric ergodicity to control the deviation of the empirical adjacency matrix via standard concentration inequalities for Markov chains, followed by a Davis-Kahan perturbation argument on the Laplacian. In the revision we will expand this step into a dedicated paragraph that states the precise bound on ||v̂_emp - v_pop||, shows that the resulting additive term is o(√(Tmix/n)), and reiterates that only geometric ergodicity (no stronger mixing) is used. This will make the argument self-contained and transparent. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives an information-theoretic minimax lower bound of Ω(√(Tmix/n)) for any measurable estimator on the class of stationary, reversible, geometrically ergodic Markov chains in fixed dimension, using standard techniques that do not reduce to fitted parameters or self-referential definitions. The adaptive spectral routing algorithm is constructed independently via the empirical Fiedler vector of a data-derived graph and attains the matching rate (modulo lower-order terms) on the graph-regular subclass containing the AR(1) witness, without knowledge of Tmix. This is standard minimax structure with no load-bearing self-citation chains, no renaming of known results as new derivations, and no predictions that are forced by construction from inputs. The analysis is externally falsifiable and self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Training data follows a stationary, reversible, geometrically ergodic Markov chain in fixed ambient dimension
- domain assumption The upper-bound result holds on a graph-regular subclass of such chains
Reference graph
Works this paper leans on
-
[1]
Online-to- PAC generalization bounds under graph-mixing dependencies
Fran c ois Ab\'el\`es, Eugenio Clerico, and Massimiliano Pontil. Online-to- PAC generalization bounds under graph-mixing dependencies. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025
2025
-
[2]
An optimistic perspective on offline reinforcement learning
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104--114. PMLR, 2020
2020
-
[3]
Bagging predictors
Leo Breiman. Bagging predictors. Machine Learning, 24 0 (2): 0 123--140, 1996
1996
-
[4]
Random forests
Leo Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, 2001
2001
-
[5]
Angus Dempster, Fran c ois Petitjean, and Geoffrey I. Webb. ROCKET : Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34 0 (5): 0 1454--1495, 2020
2020
-
[6]
A time series forest for classification and feature extraction
Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for classification and feature extraction. Information Sciences, 239: 0 142--153, 2013
2013
-
[7]
Algebraic connectivity of graphs
Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23 0 (2): 0 298--305, 1973
1973
-
[8]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Neural tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018
2018
-
[10]
Generalization bounds for non-stationary mixing processes
Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106 0 (1): 0 93--117, 2017
2017
-
[11]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017
2017
-
[12]
Quantifying uncertainty in random forests via variance
Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via variance. Journal of Machine Learning Research, 17 0 (1): 0 841--881, 2016
2016
-
[13]
Rademacher complexity bounds for non-iid processes
Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. In Advances in Neural Information Processing Systems, volume 21, 2008
2008
-
[14]
Least squares regression with M arkovian data: Fundamental limits and algorithms
Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with M arkovian data: Fundamental limits and algorithms. In Advances in Neural Information Processing Systems, volume 33, pages 16666--16676, 2020
2020
-
[15]
Deep exploration via bootstrapped DQN
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN . In Advances in Neural Information Processing Systems, volume 29, 2016
2016
-
[16]
Randomized prior functions for deep reinforcement learning
Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 31, 2018
2018
-
[17]
Politis and Joseph P
Dimitris N. Politis and Joseph P. Romano. A circular block-resampling procedure for stationary data. In Raoul LePage and Lynne Billard, editors, Exploring the Limits of Bootstrap, pages 263--270. Wiley, 1992
1992
-
[18]
Politis and Joseph P
Dimitris N. Politis and Joseph P. Romano. The stationary bootstrap. Journal of the American Statistical Association, 89 0 (428): 0 1303--1313, 1994
1994
-
[19]
Prioritized experience replay
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016
2016
-
[20]
Spielman and Nikhil Srivastava
Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40 0 (6): 0 1913--1926, 2011
1913
-
[21]
Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12 0 (4): 0 389--434, 2012
2012
-
[22]
Tsybakov
Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009
2009
-
[23]
Gomez, ukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017
2017
-
[24]
Consistency of spectral clustering
Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, 36 0 (2): 0 555--586, 2008
2008
-
[25]
Christopher K. I. Williams and Matthias Seeger. Using the N ystr \"o m method to speed up kernel machines. In Advances in Neural Information Processing Systems, volume 13, pages 682--688, 2001
2001
-
[26]
Rates of convergence for empirical processes of stationary mixing sequences
Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22 0 (1): 0 94--116, 1994
1994
-
[27]
Graph-dependent generalization bounds
Rui Zhang, Liu Liu, and Xiaojin Zhu. Graph-dependent generalization bounds. In International Conference on Machine Learning. PMLR, 2019
2019
-
[28]
Learning with little mixing
Ingvar Ziemann and Steve Hanneke. Learning with little mixing. In Advances in Neural Information Processing Systems, volume 35, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.