Bellman-sufficient Information Complexity

Yunbei Xu

arxiv: 2606.11171 · v6 · pith:7JNU6U3Bnew · submitted 2026-06-09 · 💻 cs.LG · cond-mat.stat-mech· cs.IT· math.IT· math.OC· math.ST· stat.TH

Bellman-sufficient Information Complexity

Yunbei Xu This is my paper

Pith reviewed 2026-07-03 23:42 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.stat-mechcs.ITmath.ITmath.OCmath.STstat.TH

keywords information complexitysequential decision makingBellman equationssufficient statisticsreinforcement learningdynamic programminginformation bounds

0 comments

The pith

Bellman-sufficient state representations unify upper and lower bounds on information complexity via a log-penalized dynamic program.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Bellman-sufficient information complexity as a representation-level framework for sequential decision making under a fixed-truth environment space. It defines a Bellman-sufficient state representation as interactive sufficient statistics for the environment, paired with an information index that is often the optimal decision or value. Upper bounds arise from organizing learning as a dynamic program on this state with a logarithmic information potential. Lower bounds use a Bellman-Fano certificate that propagates separate recursions for information gain and ghost mass on the same representation. When the log-penalized upper value and ghost-quantile lower certificate close at the same radius, they certify identical complexity scales, positioning existing algorithms as certificates or relaxations of this shared program.

Core claim

Bellman-sufficient information complexity organizes learning as a dynamic program on a sufficient state with a logarithmic information potential for the index, and establishes a conditional Bellman information-risk sandwich where the log-penalized Bellman upper value and the ghost-quantile lower certificate certify the same complexity scale when they close at the same radius.

What carries the argument

The Bellman-sufficient state representation, an interactive notion of sufficient statistics for the fixed-truth environment space Ω, together with the information index Y=χ(Ω) often chosen as the optimal decision or value object.

If this is right

Popular algorithms appear as tractable certificates or relaxations of the common log-potential Bellman program.
Matching upper and lower bounds certify the complexity scale for the given information index.
Learning is reduced to propagating Bellman recursions separately for information gain and ghost mass on the sufficient state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework suggests that complexity analyses can be derived by verifying whether an algorithm solves or relaxes the shared Bellman program.
Choosing the information index as the optimal decision object may reduce the state space needed for exact complexity certification.

Load-bearing premise

A Bellman-sufficient state representation exists that serves as sufficient statistics for the fixed-truth environment space.

What would settle it

A sequential decision problem with a Bellman-sufficient state representation where the log-penalized Bellman upper value and the ghost-quantile lower certificate fail to close at the same radius.

read the original abstract

We develop Bellman-sufficient information complexity, a formal representation-level framework for sequential decision making. The primitive benchmark is a fixed-truth environment space $\Omega$ with unrestricted nonanticipating algorithms. The intrinsic object is a Bellman-sufficient state representation, serving as an interactive notion of sufficient statistics, together with an information index $Y=\chi(\Omega)$, often the optimal decision or value object rather than the full environment. On the upper-bound side, learning is organized as a dynamic program on the sufficient state, equipped with a logarithmic information potential for the index. On the lower-bound side, a Bellman-Fano certificate uses the same state representation and information index, but propagates separate Bellman recursions for information gain and ghost mass. The central matching statement is therefore a conditional Bellman information-risk sandwich: when the log-penalized Bellman upper value and the ghost-quantile lower certificate close at the same radius, they certify the same complexity scale. Popular algorithms then appear as tractable certificates or relaxations of this common log-potential Bellman program, rather than as separate notions of information complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a Bellman-based framework for information complexity but the central sandwich rests on an unproven existence claim for the sufficient state representation.

read the letter

The punchline is that this paper offers a Bellman-based organizing framework for information complexity in sequential decisions, but the central conditional sandwich depends on a Bellman-sufficient state representation whose existence and constructibility for arbitrary environments is not established.

The paper develops this by taking a fixed-truth environment space Ω and defining a Bellman-sufficient state representation as an interactive sufficient statistic. It pairs this with an information index Y=χ(Ω), typically the optimal decision or value. On the upper side, it sets up a dynamic program on the sufficient state with a logarithmic information potential. On the lower side, it uses a Bellman-Fano certificate that propagates separate recursions for information gain and ghost mass. The matching claim is that when the log-penalized upper value and the ghost-quantile lower certificate close at the same radius, they certify the same complexity scale. Popular algorithms are presented as tractable certificates or relaxations of this common program.

This approach is new in its specific combination of Bellman sufficiency with these dual recursions and the sandwich theorem. It does well in providing a unified view that could help organize work in theoretical reinforcement learning.

The soft spot is the reliance on the existence of that single representation usable for both bounds. Without a general existence proof or construction that works for any Ω and remains sufficient under nonanticipating policies, the conditional matching may not apply in many cases. If the representation isn't available or differs between upper and lower, the certification doesn't go through. The abstract treats it as the intrinsic object, but the details on when it holds are crucial.

This paper is for researchers focused on theoretical aspects of reinforcement learning and information complexity. Readers interested in new frameworks for lower bounds and algorithm unification would find value in exploring the ideas, provided they check the conditions.

It shows honest engagement with building a formal structure, so it deserves a serious referee to examine the proofs and the representation's properties.

I recommend sending it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper develops 'Bellman-sufficient information complexity' as a representation-level framework for sequential decision making under a fixed-truth environment space Ω with nonanticipating algorithms. The core object is a Bellman-sufficient state representation (interactive sufficient statistic) paired with an information index Y=χ(Ω), typically an optimal decision or value object. Upper bounds are obtained via a logarithmic information-potential dynamic program on this state; lower bounds use a Bellman-Fano certificate that runs separate recursions on information gain and ghost mass. The central claim is a conditional 'Bellman information-risk sandwich': when the log-penalized upper value and ghost-quantile lower certificate close at the same radius, they certify the same complexity scale. Popular algorithms are positioned as tractable certificates or relaxations of this common program.

Significance. If the central conditional sandwich holds with a well-defined sufficient representation, the work would supply a unified Bellman-program view that links upper and lower information-complexity bounds in interactive settings and recasts existing algorithms as approximations to the same object. This could strengthen theoretical understanding of information complexity beyond separate upper/lower analyses. The manuscript does not appear to contain machine-checked proofs or reproducible code, but the attempt at a representation-level unification is a substantive contribution if the existence and non-circularity conditions are made rigorous.

major comments (2)

[Definition of Bellman-sufficient state representation and central matching statement] The existence, uniqueness, and constructibility of a Bellman-sufficient state representation for arbitrary fixed-truth Ω (serving simultaneously for the upper log-penalized DP and the lower Bellman-Fano certificate) is not established. This is load-bearing for the central conditional sandwich claim, because without a general existence result or explicit conditions guaranteeing sufficiency under nonanticipating policies, the matching statement cannot certify complexity for general Ω. (See the definition and usage of the representation in the sections introducing the intrinsic object and the sandwich statement.)
[Bellman-Fano certificate recursions and sandwich statement] The lower-bound Bellman-Fano certificate propagates separate recursions on information gain and ghost mass; it must be shown that these do not reduce to the upper log-penalized value by construction (or via fitted quantities) when the same representation and index Y=χ(Ω) are used. Otherwise the conditional closure at the same radius is tautological rather than certifying a shared complexity scale. (See the recursions for the upper value and the ghost-quantile lower certificate.)

minor comments (2)

[Notation and definitions] Notation for the information index Y=χ(Ω) and the radius at which the sandwich closes should be introduced with explicit dependence on the state representation to avoid ambiguity when the same symbols are reused for upper and lower objects.
[Discussion of popular algorithms] The abstract states that 'popular algorithms then appear as tractable certificates'; the manuscript should include at least one concrete example (e.g., a specific algorithm and the corresponding relaxation of the Bellman program) to illustrate the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing aspects of the central claims. The feedback is helpful for clarifying the scope of the conditional sandwich. We respond to each major comment below and will revise accordingly.

read point-by-point responses

Referee: [Definition of Bellman-sufficient state representation and central matching statement] The existence, uniqueness, and constructibility of a Bellman-sufficient state representation for arbitrary fixed-truth Ω (serving simultaneously for the upper log-penalized DP and the lower Bellman-Fano certificate) is not established. This is load-bearing for the central conditional sandwich claim, because without a general existence result or explicit conditions guaranteeing sufficiency under nonanticipating policies, the matching statement cannot certify complexity for general Ω. (See the definition and usage of the representation in the sections introducing the intrinsic object and the sandwich statement.)

Authors: The manuscript defines the Bellman-sufficient state representation as an interactive sufficient statistic for the index Y=χ(Ω) and presents the sandwich statement explicitly as conditional on the existence of such a representation that works for both the upper and lower programs. We provide concrete constructions for several canonical Ω (e.g., finite MDPs, linear bandits) and show that popular algorithms induce valid representations. However, a general existence/uniqueness theorem for arbitrary Ω is not supplied. We will add an explicit subsection stating the minimal conditions (nonanticipation, measurability of the index, and closure under the Bellman operator) under which a representation is guaranteed to exist, together with a discussion of when these conditions hold or fail. This makes the conditional nature of the sandwich fully rigorous without claiming universality. revision: yes
Referee: [Bellman-Fano certificate recursions and sandwich statement] The lower-bound Bellman-Fano certificate propagates separate recursions on information gain and ghost mass; it must be shown that these do not reduce to the upper log-penalized value by construction (or via fitted quantities) when the same representation and index Y=χ(Ω) are used. Otherwise the conditional closure at the same radius is tautological rather than certifying a shared complexity scale. (See the recursions for the upper value and the ghost-quantile lower certificate.)

Authors: The upper program is a single log-penalized dynamic program whose value function directly encodes the information potential for Y. The lower Bellman-Fano certificate maintains two coupled but distinct recursions: one tracking cumulative information gain and a second tracking the ghost-mass quantile that lower-bounds the risk. These recursions are not algebraically identical to the upper value function; the ghost-mass term introduces an additional quantile operation absent from the upper DP. The sandwich is therefore non-tautological: it holds only when the two independently computed quantities meet at the same radius. We will insert a short lemma in the revision that explicitly contrasts the two Bellman operators and shows that equality of the resulting radii is a substantive matching condition rather than an identity. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description posit a Bellman-sufficient state representation and information index Y=χ(Ω) as primitive objects for a fixed-truth environment space Ω. Upper-bound DP and lower-bound Bellman-Fano certificate are described as separate recursions that match conditionally when closing at the same radius. No equations, fitted parameters, or self-citations are exhibited that reduce the sandwich statement or complexity scale to an input by construction. The framework treats the representation as an assumption serving as interactive sufficient statistics rather than deriving it from the claimed complexity measure. This is the most common honest non-finding for a definitional framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; the framework rests on the existence of sufficient state representations and the choice of information index, but no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description.

axioms (2)

domain assumption Existence of a Bellman-sufficient state representation for the environment space Ω
The intrinsic object of the framework is defined to be this representation.
domain assumption The information index Y=χ(Ω) can be taken as the optimal decision or value object
Stated as the typical choice for the index in the abstract.

invented entities (2)

Bellman-sufficient state representation no independent evidence
purpose: Interactive sufficient statistics for sequential decisions
New primitive introduced as the core object of the framework.
Bellman-Fano certificate no independent evidence
purpose: Lower-bound propagation via separate recursions on information gain and ghost mass
Invented construct for the lower side of the sandwich.

pith-pipeline@v0.9.1-grok · 5727 in / 1501 out tokens · 24457 ms · 2026-07-03T23:42:32.205124+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

[1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, D\' a vid P\' a l, and Csaba Szepesv\' a ri. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011

work page 2011
[2]

Bartlett, and Elad Hazan

Jacob Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, 2011

work page 2011
[3]

Minimax policies for adversarial and stochastic bandits

Jean-Yves Audibert and S\' e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009

work page 2009
[4]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235--256, 2002

work page 2002
[5]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017
[6]

Bandits with knapsacks

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM, 65(3):13:1--13:55, 2018

work page 2018
[7]

Blind network revenue management

Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations Research, 60(6):1537--1550, 2012

work page 2012
[8]

Barron and Thomas M

Andrew R. Barron and Thomas M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4):1034--1054, 1991

work page 1991
[9]

Dynamic Programming

Richard Bellman. Dynamic Programming. Princeton University Press, 1957

work page 1957
[10]

Bertsekas

Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Volumes I and II. Athena Scientific, 4th edition, 2017

work page 2017
[11]

Prediction, Learning, and Games

Nicol \`o Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

work page 2006
[12]

Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu

Fan Chen, Dylan J. Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu. Assouad, Fano, and Le Cam with interaction: A unifying framework for lower bounds. arXiv:2410.05117, 2024

work page arXiv 2024
[13]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[14]

Cormen, Charles E

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press and McGraw--Hill, 2nd edition, 2001

work page 2001
[15]

Du, Sham M

Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[16]

Ronald A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, 222:309--368, 1922

work page 1922
[17]

The statistical complexity of interactive decision making,

Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv:2112.13487, 2021

work page arXiv 2021
[18]

Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan

Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. In Advances in Neural Information Processing Systems, 2022

work page 2022
[19]

Foster, Noah Golowich, and Yanjun Han

Dylan J. Foster, Noah Golowich, and Yanjun Han. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv:2301.08215, 2023

work page arXiv 2023
[20]

Hazan, S

E. Hazan, S. Shalev-Shwartz, and N. Srebro. Research program: Theory of learning in dynamical systems. arXiv preprint arXiv:2512.19410, 2025

work page arXiv 2025
[21]

Schapire

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017
[22]

Minimal expected regret in linear quadratic control

Yassir Jedra and Alexandre Proutiere. Minimal expected regret in linear quadratic control. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022

work page 2022
[23]

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv:1907.05388, 2019

work page arXiv 1907
[24]

Asymptotically efficient adaptive control in stochastic regression models

Tze Leung Lai. Asymptotically efficient adaptive control in stochastic regression models. Advances in Applied Mathematics, 7(1):23--45, 1986

work page 1986
[25]

Bandit Algorithms

Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[26]

Mirror descent and the information ratio

Tor Lattimore and Andr\' a s Gy\"orgy. Mirror descent and the information ratio. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

work page 2021
[27]

A lower bound for adaptive linear regression

Tor Lattimore. A lower bound for adaptive linear regression. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, 2023

work page 2023
[28]

Pointwise Generalization in Deep Neural Networks

Shaojie Li and Yunbei Xu. Pointwise generalization in deep neural networks. arXiv:2605.18598, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds

Hao Liang, Cheng Tang, and Yunzong Xu. Finite-time queue peak laws in stochastic networks: Logarithmic scaling after geometric thresholds. arXiv preprint arXiv:2606.18218, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Decision making in hybrid environments: A model aggregation approach

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Decision making in hybrid environments: A model aggregation approach. In Proceedings of the Thirty Eighth Conference on Learning Theory, 2025

work page 2025
[31]

An improved model-free decision-estimation coefficient with applications in adversarial MDPs

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. An improved model-free decision-estimation coefficient with applications in adversarial MDPs. In Proceedings of the International Conference on Learning Representations, 2026

work page 2026
[32]

Yujie Liu, Vincent Y. F. Tan, and Yunbei Xu. Finite-time minimax bounds and an optimal Lyapunov policy in queueing control. arXiv preprint arXiv:2506.18278, 2025

work page arXiv 2025
[33]

On the power of adaptivity for -best arm identification in linear bandits

Arnab Maiti, Yunbei Xu, and Kevin Jamieson. On the power of adaptivity for -best arm identification in linear bandits. In Proceedings of the 39th Annual Conference on Learning Theory, vol. 336, pages 4911--4968, 2026

work page 2026
[34]

Certainty equivalence is efficient for linear quadratic control

Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[35]

Relax and randomize: From value to algorithms

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems, 2012

work page 2012
[36]

BISTRO : An efficient relaxation-based method for contextual bandits

Alexander Rakhlin and Karthik Sridharan. BISTRO : An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, 2016

work page 2016
[37]

A tour of reinforcement learning: The view from continuous control

Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253--279, 2019

work page 2019
[38]

Learning to optimize via information-directed sampling

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems 27, 2014

work page 2014
[39]

David Silver and Richard S. Sutton. Welcome to the era of experience. Preprint of a chapter to appear in Designing an Intelligence, MIT Press, 2025

work page 2025
[40]

Naive exploration is optimal for online LQR

Max Simchowitz and Dylan Foster. Naive exploration is optimal for online LQR . In Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020
[41]

Nonasymptotic analysis of Monte Carlo tree search

Devavrat Shah, Qiaomin Xie, and Zhi Xu. Nonasymptotic analysis of Monte Carlo tree search. Research, 70(6):3234--3260, 2022

work page 2022
[42]

Kakade, and Matthias Seeger

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010

work page 2010
[43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018
[44]

Open problem: Tight online confidence intervals for RKHS elements

Sattar Vakili, Jonathan Scarlett, and Tara Javidi. Open problem: Tight online confidence intervals for RKHS elements. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

work page 2021
[45]

Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

Yunbei Xu. Pointwise complexity for Gaussian fields: Upper envelopes, algorithmic lower bounds, and separation. arXiv:2606.07931, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Bayesian design principles for frequentist sequential learning

Yunbei Xu and Assaf Zeevi. Bayesian design principles for frequentist sequential learning. Journal of the ACM, 72(5), Article 37, 2025

work page 2025
[47]

Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564--1599, 1999

work page 1999
[48]

Andrew C. Yao. Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science, 1977

work page 1977
[49]

Assouad, Fano, and Le Cam

Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423--435. Springer, New York, 1997

work page 1997
[50]

Information-theoretic upper and lower bounds for statistical estimation

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307--1321, 2006

work page 2006

[1] [1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, D\' a vid P\' a l, and Csaba Szepesv\' a ri. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011

work page 2011

[2] [2]

Bartlett, and Elad Hazan

Jacob Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, 2011

work page 2011

[3] [3]

Minimax policies for adversarial and stochastic bandits

Jean-Yves Audibert and S\' e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009

work page 2009

[4] [4]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235--256, 2002

work page 2002

[5] [5]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017

[6] [6]

Bandits with knapsacks

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM, 65(3):13:1--13:55, 2018

work page 2018

[7] [7]

Blind network revenue management

Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations Research, 60(6):1537--1550, 2012

work page 2012

[8] [8]

Barron and Thomas M

Andrew R. Barron and Thomas M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4):1034--1054, 1991

work page 1991

[9] [9]

Dynamic Programming

Richard Bellman. Dynamic Programming. Princeton University Press, 1957

work page 1957

[10] [10]

Bertsekas

Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Volumes I and II. Athena Scientific, 4th edition, 2017

work page 2017

[11] [11]

Prediction, Learning, and Games

Nicol \`o Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

work page 2006

[12] [12]

Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu

Fan Chen, Dylan J. Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu. Assouad, Fano, and Le Cam with interaction: A unifying framework for lower bounds. arXiv:2410.05117, 2024

work page arXiv 2024

[13] [13]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[14] [14]

Cormen, Charles E

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press and McGraw--Hill, 2nd edition, 2001

work page 2001

[15] [15]

Du, Sham M

Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[16] [16]

Ronald A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, 222:309--368, 1922

work page 1922

[17] [17]

The statistical complexity of interactive decision making,

Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv:2112.13487, 2021

work page arXiv 2021

[18] [18]

Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan

Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. In Advances in Neural Information Processing Systems, 2022

work page 2022

[19] [19]

Foster, Noah Golowich, and Yanjun Han

Dylan J. Foster, Noah Golowich, and Yanjun Han. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv:2301.08215, 2023

work page arXiv 2023

[20] [20]

Hazan, S

E. Hazan, S. Shalev-Shwartz, and N. Srebro. Research program: Theory of learning in dynamical systems. arXiv preprint arXiv:2512.19410, 2025

work page arXiv 2025

[21] [21]

Schapire

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017

[22] [22]

Minimal expected regret in linear quadratic control

Yassir Jedra and Alexandre Proutiere. Minimal expected regret in linear quadratic control. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022

work page 2022

[23] [23]

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv:1907.05388, 2019

work page arXiv 1907

[24] [24]

Asymptotically efficient adaptive control in stochastic regression models

Tze Leung Lai. Asymptotically efficient adaptive control in stochastic regression models. Advances in Applied Mathematics, 7(1):23--45, 1986

work page 1986

[25] [25]

Bandit Algorithms

Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020

work page 2020

[26] [26]

Mirror descent and the information ratio

Tor Lattimore and Andr\' a s Gy\"orgy. Mirror descent and the information ratio. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

work page 2021

[27] [27]

A lower bound for adaptive linear regression

Tor Lattimore. A lower bound for adaptive linear regression. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, 2023

work page 2023

[28] [28]

Pointwise Generalization in Deep Neural Networks

Shaojie Li and Yunbei Xu. Pointwise generalization in deep neural networks. arXiv:2605.18598, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds

Hao Liang, Cheng Tang, and Yunzong Xu. Finite-time queue peak laws in stochastic networks: Logarithmic scaling after geometric thresholds. arXiv preprint arXiv:2606.18218, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Decision making in hybrid environments: A model aggregation approach

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Decision making in hybrid environments: A model aggregation approach. In Proceedings of the Thirty Eighth Conference on Learning Theory, 2025

work page 2025

[31] [31]

An improved model-free decision-estimation coefficient with applications in adversarial MDPs

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. An improved model-free decision-estimation coefficient with applications in adversarial MDPs. In Proceedings of the International Conference on Learning Representations, 2026

work page 2026

[32] [32]

Yujie Liu, Vincent Y. F. Tan, and Yunbei Xu. Finite-time minimax bounds and an optimal Lyapunov policy in queueing control. arXiv preprint arXiv:2506.18278, 2025

work page arXiv 2025

[33] [33]

On the power of adaptivity for \( \)-best arm identification in linear bandits

Arnab Maiti, Yunbei Xu, and Kevin Jamieson. On the power of adaptivity for \( \)-best arm identification in linear bandits. In Proceedings of the 39th Annual Conference on Learning Theory, vol. 336, pages 4911--4968, 2026

work page 2026

[34] [34]

Certainty equivalence is efficient for linear quadratic control

Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[35] [35]

Relax and randomize: From value to algorithms

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems, 2012

work page 2012

[36] [36]

BISTRO : An efficient relaxation-based method for contextual bandits

Alexander Rakhlin and Karthik Sridharan. BISTRO : An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, 2016

work page 2016

[37] [37]

A tour of reinforcement learning: The view from continuous control

Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253--279, 2019

work page 2019

[38] [38]

Learning to optimize via information-directed sampling

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems 27, 2014

work page 2014

[39] [39]

David Silver and Richard S. Sutton. Welcome to the era of experience. Preprint of a chapter to appear in Designing an Intelligence, MIT Press, 2025

work page 2025

[40] [40]

Naive exploration is optimal for online LQR

Max Simchowitz and Dylan Foster. Naive exploration is optimal for online LQR . In Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020

[41] [41]

Nonasymptotic analysis of Monte Carlo tree search

Devavrat Shah, Qiaomin Xie, and Zhi Xu. Nonasymptotic analysis of Monte Carlo tree search. Research, 70(6):3234--3260, 2022

work page 2022

[42] [42]

Kakade, and Matthias Seeger

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010

work page 2010

[43] [43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018

[44] [44]

Open problem: Tight online confidence intervals for RKHS elements

Sattar Vakili, Jonathan Scarlett, and Tara Javidi. Open problem: Tight online confidence intervals for RKHS elements. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

work page 2021

[45] [45]

Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

Yunbei Xu. Pointwise complexity for Gaussian fields: Upper envelopes, algorithmic lower bounds, and separation. arXiv:2606.07931, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Bayesian design principles for frequentist sequential learning

Yunbei Xu and Assaf Zeevi. Bayesian design principles for frequentist sequential learning. Journal of the ACM, 72(5), Article 37, 2025

work page 2025

[47] [47]

Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564--1599, 1999

work page 1999

[48] [48]

Andrew C. Yao. Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science, 1977

work page 1977

[49] [49]

Assouad, Fano, and Le Cam

Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423--435. Springer, New York, 1997

work page 1997

[50] [50]

Information-theoretic upper and lower bounds for statistical estimation

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307--1321, 2006

work page 2006