pith. sign in

arxiv: 2606.11171 · v6 · pith:7JNU6U3Bnew · submitted 2026-06-09 · 💻 cs.LG · cond-mat.stat-mech· cs.IT· math.IT· math.OC· math.ST· stat.TH

Bellman-sufficient Information Complexity

Pith reviewed 2026-07-03 23:42 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.stat-mechcs.ITmath.ITmath.OCmath.STstat.TH
keywords information complexitysequential decision makingBellman equationssufficient statisticsreinforcement learningdynamic programminginformation bounds
0
0 comments X

The pith

Bellman-sufficient state representations unify upper and lower bounds on information complexity via a log-penalized dynamic program.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Bellman-sufficient information complexity as a representation-level framework for sequential decision making under a fixed-truth environment space. It defines a Bellman-sufficient state representation as interactive sufficient statistics for the environment, paired with an information index that is often the optimal decision or value. Upper bounds arise from organizing learning as a dynamic program on this state with a logarithmic information potential. Lower bounds use a Bellman-Fano certificate that propagates separate recursions for information gain and ghost mass on the same representation. When the log-penalized upper value and ghost-quantile lower certificate close at the same radius, they certify identical complexity scales, positioning existing algorithms as certificates or relaxations of this shared program.

Core claim

Bellman-sufficient information complexity organizes learning as a dynamic program on a sufficient state with a logarithmic information potential for the index, and establishes a conditional Bellman information-risk sandwich where the log-penalized Bellman upper value and the ghost-quantile lower certificate certify the same complexity scale when they close at the same radius.

What carries the argument

The Bellman-sufficient state representation, an interactive notion of sufficient statistics for the fixed-truth environment space Ω, together with the information index Y=χ(Ω) often chosen as the optimal decision or value object.

If this is right

  • Popular algorithms appear as tractable certificates or relaxations of the common log-potential Bellman program.
  • Matching upper and lower bounds certify the complexity scale for the given information index.
  • Learning is reduced to propagating Bellman recursions separately for information gain and ghost mass on the sufficient state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework suggests that complexity analyses can be derived by verifying whether an algorithm solves or relaxes the shared Bellman program.
  • Choosing the information index as the optimal decision object may reduce the state space needed for exact complexity certification.

Load-bearing premise

A Bellman-sufficient state representation exists that serves as sufficient statistics for the fixed-truth environment space.

What would settle it

A sequential decision problem with a Bellman-sufficient state representation where the log-penalized Bellman upper value and the ghost-quantile lower certificate fail to close at the same radius.

read the original abstract

We develop Bellman-sufficient information complexity, a formal representation-level framework for sequential decision making. The primitive benchmark is a fixed-truth environment space $\Omega$ with unrestricted nonanticipating algorithms. The intrinsic object is a Bellman-sufficient state representation, serving as an interactive notion of sufficient statistics, together with an information index $Y=\chi(\Omega)$, often the optimal decision or value object rather than the full environment. On the upper-bound side, learning is organized as a dynamic program on the sufficient state, equipped with a logarithmic information potential for the index. On the lower-bound side, a Bellman-Fano certificate uses the same state representation and information index, but propagates separate Bellman recursions for information gain and ghost mass. The central matching statement is therefore a conditional Bellman information-risk sandwich: when the log-penalized Bellman upper value and the ghost-quantile lower certificate close at the same radius, they certify the same complexity scale. Popular algorithms then appear as tractable certificates or relaxations of this common log-potential Bellman program, rather than as separate notions of information complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops 'Bellman-sufficient information complexity' as a representation-level framework for sequential decision making under a fixed-truth environment space Ω with nonanticipating algorithms. The core object is a Bellman-sufficient state representation (interactive sufficient statistic) paired with an information index Y=χ(Ω), typically an optimal decision or value object. Upper bounds are obtained via a logarithmic information-potential dynamic program on this state; lower bounds use a Bellman-Fano certificate that runs separate recursions on information gain and ghost mass. The central claim is a conditional 'Bellman information-risk sandwich': when the log-penalized upper value and ghost-quantile lower certificate close at the same radius, they certify the same complexity scale. Popular algorithms are positioned as tractable certificates or relaxations of this common program.

Significance. If the central conditional sandwich holds with a well-defined sufficient representation, the work would supply a unified Bellman-program view that links upper and lower information-complexity bounds in interactive settings and recasts existing algorithms as approximations to the same object. This could strengthen theoretical understanding of information complexity beyond separate upper/lower analyses. The manuscript does not appear to contain machine-checked proofs or reproducible code, but the attempt at a representation-level unification is a substantive contribution if the existence and non-circularity conditions are made rigorous.

major comments (2)
  1. [Definition of Bellman-sufficient state representation and central matching statement] The existence, uniqueness, and constructibility of a Bellman-sufficient state representation for arbitrary fixed-truth Ω (serving simultaneously for the upper log-penalized DP and the lower Bellman-Fano certificate) is not established. This is load-bearing for the central conditional sandwich claim, because without a general existence result or explicit conditions guaranteeing sufficiency under nonanticipating policies, the matching statement cannot certify complexity for general Ω. (See the definition and usage of the representation in the sections introducing the intrinsic object and the sandwich statement.)
  2. [Bellman-Fano certificate recursions and sandwich statement] The lower-bound Bellman-Fano certificate propagates separate recursions on information gain and ghost mass; it must be shown that these do not reduce to the upper log-penalized value by construction (or via fitted quantities) when the same representation and index Y=χ(Ω) are used. Otherwise the conditional closure at the same radius is tautological rather than certifying a shared complexity scale. (See the recursions for the upper value and the ghost-quantile lower certificate.)
minor comments (2)
  1. [Notation and definitions] Notation for the information index Y=χ(Ω) and the radius at which the sandwich closes should be introduced with explicit dependence on the state representation to avoid ambiguity when the same symbols are reused for upper and lower objects.
  2. [Discussion of popular algorithms] The abstract states that 'popular algorithms then appear as tractable certificates'; the manuscript should include at least one concrete example (e.g., a specific algorithm and the corresponding relaxation of the Bellman program) to illustrate the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing aspects of the central claims. The feedback is helpful for clarifying the scope of the conditional sandwich. We respond to each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Definition of Bellman-sufficient state representation and central matching statement] The existence, uniqueness, and constructibility of a Bellman-sufficient state representation for arbitrary fixed-truth Ω (serving simultaneously for the upper log-penalized DP and the lower Bellman-Fano certificate) is not established. This is load-bearing for the central conditional sandwich claim, because without a general existence result or explicit conditions guaranteeing sufficiency under nonanticipating policies, the matching statement cannot certify complexity for general Ω. (See the definition and usage of the representation in the sections introducing the intrinsic object and the sandwich statement.)

    Authors: The manuscript defines the Bellman-sufficient state representation as an interactive sufficient statistic for the index Y=χ(Ω) and presents the sandwich statement explicitly as conditional on the existence of such a representation that works for both the upper and lower programs. We provide concrete constructions for several canonical Ω (e.g., finite MDPs, linear bandits) and show that popular algorithms induce valid representations. However, a general existence/uniqueness theorem for arbitrary Ω is not supplied. We will add an explicit subsection stating the minimal conditions (nonanticipation, measurability of the index, and closure under the Bellman operator) under which a representation is guaranteed to exist, together with a discussion of when these conditions hold or fail. This makes the conditional nature of the sandwich fully rigorous without claiming universality. revision: yes

  2. Referee: [Bellman-Fano certificate recursions and sandwich statement] The lower-bound Bellman-Fano certificate propagates separate recursions on information gain and ghost mass; it must be shown that these do not reduce to the upper log-penalized value by construction (or via fitted quantities) when the same representation and index Y=χ(Ω) are used. Otherwise the conditional closure at the same radius is tautological rather than certifying a shared complexity scale. (See the recursions for the upper value and the ghost-quantile lower certificate.)

    Authors: The upper program is a single log-penalized dynamic program whose value function directly encodes the information potential for Y. The lower Bellman-Fano certificate maintains two coupled but distinct recursions: one tracking cumulative information gain and a second tracking the ghost-mass quantile that lower-bounds the risk. These recursions are not algebraically identical to the upper value function; the ghost-mass term introduces an additional quantile operation absent from the upper DP. The sandwich is therefore non-tautological: it holds only when the two independently computed quantities meet at the same radius. We will insert a short lemma in the revision that explicitly contrasts the two Bellman operators and shows that equality of the resulting radii is a substantive matching condition rather than an identity. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description posit a Bellman-sufficient state representation and information index Y=χ(Ω) as primitive objects for a fixed-truth environment space Ω. Upper-bound DP and lower-bound Bellman-Fano certificate are described as separate recursions that match conditionally when closing at the same radius. No equations, fitted parameters, or self-citations are exhibited that reduce the sandwich statement or complexity scale to an input by construction. The framework treats the representation as an assumption serving as interactive sufficient statistics rather than deriving it from the claimed complexity measure. This is the most common honest non-finding for a definitional framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; the framework rests on the existence of sufficient state representations and the choice of information index, but no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description.

axioms (2)
  • domain assumption Existence of a Bellman-sufficient state representation for the environment space Ω
    The intrinsic object of the framework is defined to be this representation.
  • domain assumption The information index Y=χ(Ω) can be taken as the optimal decision or value object
    Stated as the typical choice for the index in the abstract.
invented entities (2)
  • Bellman-sufficient state representation no independent evidence
    purpose: Interactive sufficient statistics for sequential decisions
    New primitive introduced as the core object of the framework.
  • Bellman-Fano certificate no independent evidence
    purpose: Lower-bound propagation via separate recursions on information gain and ghost mass
    Invented construct for the lower side of the sandwich.

pith-pipeline@v0.9.1-grok · 5727 in / 1501 out tokens · 24457 ms · 2026-07-03T23:42:32.205124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, D\' a vid P\' a l, and Csaba Szepesv\' a ri. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011

  2. [2]

    Bartlett, and Elad Hazan

    Jacob Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, 2011

  3. [3]

    Minimax policies for adversarial and stochastic bandits

    Jean-Yves Audibert and S\' e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009

  4. [4]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235--256, 2002

  5. [5]

    Minimax regret bounds for reinforcement learning

    Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, 2017

  6. [6]

    Bandits with knapsacks

    Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM, 65(3):13:1--13:55, 2018

  7. [7]

    Blind network revenue management

    Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations Research, 60(6):1537--1550, 2012

  8. [8]

    Barron and Thomas M

    Andrew R. Barron and Thomas M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4):1034--1054, 1991

  9. [9]

    Dynamic Programming

    Richard Bellman. Dynamic Programming. Princeton University Press, 1957

  10. [10]

    Bertsekas

    Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Volumes I and II. Athena Scientific, 4th edition, 2017

  11. [11]

    Prediction, Learning, and Games

    Nicol \`o Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006

  12. [12]

    Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu

    Fan Chen, Dylan J. Foster, Yanjun Han, Jian Qian, Alexander Rakhlin, and Yunbei Xu. Assouad, Fano, and Le Cam with interaction: A unifying framework for lower bounds. arXiv:2410.05117, 2024

  13. [13]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, volume 34, 2021

  14. [14]

    Cormen, Charles E

    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press and McGraw--Hill, 2nd edition, 2001

  15. [15]

    Du, Sham M

    Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In Proceedings of the 38th International Conference on Machine Learning, 2021

  16. [16]

    Ronald A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, 222:309--368, 1922

  17. [17]

    The statistical complexity of interactive decision making,

    Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv:2112.13487, 2021

  18. [18]

    Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan

    Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. In Advances in Neural Information Processing Systems, 2022

  19. [19]

    Foster, Noah Golowich, and Yanjun Han

    Dylan J. Foster, Noah Golowich, and Yanjun Han. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv:2301.08215, 2023

  20. [20]

    Hazan, S

    E. Hazan, S. Shalev-Shwartz, and N. Srebro. Research program: Theory of learning in dynamical systems. arXiv preprint arXiv:2512.19410, 2025

  21. [21]

    Schapire

    Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, 2017

  22. [22]

    Minimal expected regret in linear quadratic control

    Yassir Jedra and Alexandre Proutiere. Minimal expected regret in linear quadratic control. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022

  23. [23]

    Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv:1907.05388, 2019

  24. [24]

    Asymptotically efficient adaptive control in stochastic regression models

    Tze Leung Lai. Asymptotically efficient adaptive control in stochastic regression models. Advances in Applied Mathematics, 7(1):23--45, 1986

  25. [25]

    Bandit Algorithms

    Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020

  26. [26]

    Mirror descent and the information ratio

    Tor Lattimore and Andr\' a s Gy\"orgy. Mirror descent and the information ratio. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

  27. [27]

    A lower bound for adaptive linear regression

    Tor Lattimore. A lower bound for adaptive linear regression. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, 2023

  28. [28]

    Pointwise Generalization in Deep Neural Networks

    Shaojie Li and Yunbei Xu. Pointwise generalization in deep neural networks. arXiv:2605.18598, 2026

  29. [29]

    Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds

    Hao Liang, Cheng Tang, and Yunzong Xu. Finite-time queue peak laws in stochastic networks: Logarithmic scaling after geometric thresholds. arXiv preprint arXiv:2606.18218, 2026

  30. [30]

    Decision making in hybrid environments: A model aggregation approach

    Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Decision making in hybrid environments: A model aggregation approach. In Proceedings of the Thirty Eighth Conference on Learning Theory, 2025

  31. [31]

    An improved model-free decision-estimation coefficient with applications in adversarial MDPs

    Haolin Liu, Chen-Yu Wei, and Julian Zimmert. An improved model-free decision-estimation coefficient with applications in adversarial MDPs. In Proceedings of the International Conference on Learning Representations, 2026

  32. [32]

    Yujie Liu, Vincent Y. F. Tan, and Yunbei Xu. Finite-time minimax bounds and an optimal Lyapunov policy in queueing control. arXiv preprint arXiv:2506.18278, 2025

  33. [33]

    On the power of adaptivity for \( \)-best arm identification in linear bandits

    Arnab Maiti, Yunbei Xu, and Kevin Jamieson. On the power of adaptivity for \( \)-best arm identification in linear bandits. In Proceedings of the 39th Annual Conference on Learning Theory, vol. 336, pages 4911--4968, 2026

  34. [34]

    Certainty equivalence is efficient for linear quadratic control

    Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, 2019

  35. [35]

    Relax and randomize: From value to algorithms

    Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems, 2012

  36. [36]

    BISTRO : An efficient relaxation-based method for contextual bandits

    Alexander Rakhlin and Karthik Sridharan. BISTRO : An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, 2016

  37. [37]

    A tour of reinforcement learning: The view from continuous control

    Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253--279, 2019

  38. [38]

    Learning to optimize via information-directed sampling

    Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems 27, 2014

  39. [39]

    David Silver and Richard S. Sutton. Welcome to the era of experience. Preprint of a chapter to appear in Designing an Intelligence, MIT Press, 2025

  40. [40]

    Naive exploration is optimal for online LQR

    Max Simchowitz and Dylan Foster. Naive exploration is optimal for online LQR . In Proceedings of the 37th International Conference on Machine Learning, 2020

  41. [41]

    Nonasymptotic analysis of Monte Carlo tree search

    Devavrat Shah, Qiaomin Xie, and Zhi Xu. Nonasymptotic analysis of Monte Carlo tree search. Research, 70(6):3234--3260, 2022

  42. [42]

    Kakade, and Matthias Seeger

    Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010

  43. [43]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Learning: An Introduction. MIT Press, 2nd edition, 2018

  44. [44]

    Open problem: Tight online confidence intervals for RKHS elements

    Sattar Vakili, Jonathan Scarlett, and Tara Javidi. Open problem: Tight online confidence intervals for RKHS elements. In Proceedings of the 34th Annual Conference on Learning Theory, 2021

  45. [45]

    Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

    Yunbei Xu. Pointwise complexity for Gaussian fields: Upper envelopes, algorithmic lower bounds, and separation. arXiv:2606.07931, 2026

  46. [46]

    Bayesian design principles for frequentist sequential learning

    Yunbei Xu and Assaf Zeevi. Bayesian design principles for frequentist sequential learning. Journal of the ACM, 72(5), Article 37, 2025

  47. [47]

    Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564--1599, 1999

  48. [48]

    Andrew C. Yao. Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science, 1977

  49. [49]

    Assouad, Fano, and Le Cam

    Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423--435. Springer, New York, 1997

  50. [50]

    Information-theoretic upper and lower bounds for statistical estimation

    Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307--1321, 2006