Volatility promotes exploration and stochasticity suppresses it in Gaussian state-space bandits, shown by extending Gittins indices and deriving the CAUSE exploration bonus via control-as-inference.
Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no performance loss.
COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a baseline policy.
Develops COF algorithm for MAB-CS that intelligently checks cheap arm feasibility by pooling samples, with generalized instance-dependent lower bounds and matching upper bounds on cumulative cost and quality regret.
citing papers explorer
-
Not all uncertainty is alike: volatility, stochasticity, and exploration
Volatility promotes exploration and stochasticity suppresses it in Gaussian state-space bandits, shown by extending Gittins indices and deriving the CAUSE exploration bonus via control-as-inference.
-
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no performance loss.
-
Learning Safely Without Knowing the World:COMPASS-Hedge
COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a baseline policy.
-
Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy
Develops COF algorithm for MAB-CS that intelligently checks cheap arm feasibility by pooling samples, with generalized instance-dependent lower bounds and matching upper bounds on cumulative cost and quality regret.