Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Eric Moulines; Nicholas Bambos; Rahul Singh; Siddharth Chandak; Vivek S. Borkar

arxiv: 2602.16274 · v2 · pith:D26ZTUJYnew · submitted 2026-02-18 · 💻 cs.LG · stat.ML

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh , Siddharth Chandak , Eric Moulines , Vivek S. Borkar , Nicholas Bambos This is my paper

Pith reviewed 2026-05-21 11:53 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online Q-learningregret boundsstochastic approximationMarkov decision processesconcentration inequalitiesexploration in RLtime-inhomogeneous Markov chains

0 comments

The pith

Classical online Q-learning achieves sublinear regret in discounted MDPs without optimism or bonuses via a new concentration bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard online Q-learning can deliver the first regret guarantees in infinite-horizon discounted Markov decision processes without adding optimism or bonus terms. Boltzmann Q-learning with decaying temperature produces sublinear regret when the MDP has large action gaps but can degrade toward linear regret for small gaps. A smoothed epsilon-greedy scheme that blends epsilon-greedy and Boltzmann exploration removes the gap dependence and yields a near-O(N to the 9/10) regret bound that holds with high probability. Both regret and sample-complexity results rest on a fresh high-probability concentration inequality for contractive stochastic approximation whose transition kernel changes with both the current iterate and time, and whose contraction factor approaches one. A reader would care because the result applies directly to the simplest, most commonly implemented Q-learning algorithm without requiring algorithmic modifications that are hard to tune in practice.

Core claim

Classical online Q-learning obtains the first regret bound for infinite-horizon discounted MDPs by deriving a high-probability concentration inequality for contractive Markovian stochastic approximation whose transition dynamics are both iterate-dependent and time-inhomogeneous and whose contraction factor converges to one; this inequality is used to analyze Boltzmann Q-learning with decaying temperature (gap-dependent regret) and a smoothed epsilon_n-greedy scheme (gap-robust near-O(N^{9/10}) regret).

What carries the argument

High-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transitions whose contraction factor approaches one asymptotically.

If this is right

Regret and sample-complexity bounds hold with high probability for both Boltzmann and smoothed epsilon-greedy Q-learning.
The regret of Boltzmann Q-learning scales with the size of the suboptimality gap of the underlying MDP.
The smoothed epsilon-greedy scheme removes dependence on the gap and still delivers sublinear regret of order N to the 9/10.
The same concentration tool applies to other contractive stochastic approximation recursions whose kernels are time-inhomogeneous.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The concentration result may be reusable for analyzing other non-stationary stochastic approximation schemes in reinforcement learning or optimization.
Practical implementations could adopt the smoothed epsilon-greedy rule to obtain gap-robust performance without extra tuning parameters.
The framework suggests that similar high-probability bounds could be obtained for undiscounted or average-reward settings by relaxing the contraction assumption further.

Load-bearing premise

The derived concentration inequality continues to hold when the contraction factor of the Markov chain approaches one and the transition kernel changes with both the current parameter and time.

What would settle it

An explicit MDP and sequence of iterates where the proposed concentration bound is violated while the contraction factor still approaches one, or a concrete run of the smoothed epsilon-greedy algorithm whose cumulative regret grows faster than N to the 9/10.

read the original abstract

We present the first regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes (MDPs), without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $\epsilon_n$-Greedy exploration scheme that combines $\epsilon_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. We also obtain sample complexity guarantees, with both regret and sample complexity bounds holding with high probability. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our framework is allowed to converge to one asymptotically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First high-prob regret bound for plain online Q-learning without optimism, via a new concentration tool for SA with contraction drifting to 1, but the N^{9/10} rate and schedule dependence are the real limits.

read the letter

The paper's core contribution is a high-probability regret bound for classical online Q-learning in infinite-horizon discounted MDPs that avoids any optimism or bonus terms. They first show that plain Boltzmann Q-learning with decaying temperature gives sublinear regret only when the MDP gap is large enough, then introduce a smoothed epsilon_n-greedy scheme that mixes the two and delivers a gap-robust near-O(N^{9/10}) regret bound along with matching sample-complexity results. The technical engine is a new concentration inequality for contractive Markovian stochastic approximation where both the transition kernel and the contraction factor are iterate- and time-dependent and the factor is allowed to approach 1 asymptotically. That inequality looks like the genuinely new piece relative to prior work on SA and RL regret.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver the first regret bound for classical online Q-learning in infinite-horizon discounted MDPs without optimism or bonus terms. It first studies Boltzmann Q-learning with decaying temperature, establishing that regret is sublinear for large suboptimality gaps but can approach linear for small gaps. It then introduces a Smoothed ε_n-Greedy scheme that combines ε_n-greedy and Boltzmann exploration to obtain a gap-robust high-probability regret bound of near-Õ(N^{9/10}), along with sample-complexity guarantees. The analysis rests on a new high-probability concentration inequality for contractive Markovian stochastic approximation whose transition kernel and contraction factor are both iterate- and time-dependent, with the contraction factor permitted to approach 1 asymptotically.

Significance. If the new concentration bound is rigorously established and correctly controls accumulated errors under the paper's exploration schedules, the result would be a meaningful advance: it supplies gap-robust regret guarantees for unmodified Q-learning, avoiding the artificial bonuses common in prior work, and the SA tool may be reusable for other time-inhomogeneous contractive processes.

major comments (2)

[Proof of the concentration inequality (likely §4 or Theorem 3.1)] The central high-probability bound for contractive Markovian SA (whose contraction factor converges to 1) must explicitly control the accumulated deviation over intervals where the instantaneous contraction is arbitrarily close to 1. Under the ε_n schedules that may be forced to decay slowly in small-gap MDPs, any failure-probability or deviation term that grows with the length of such intervals would invalidate the uniform high-probability near-Õ(N^{9/10}) regret claim.
[Regret analysis for Smoothed ε_n-Greedy (likely §5)] The gap-robustness argument for the Smoothed ε_n-Greedy scheme relies on the SA bound delivering the stated deviation uniformly over all MDPs. If the proof only obtains a bound whose constants or failure probability implicitly depend on the minimal gap (via the duration the contraction stays near 1), the claimed independence from the gap is not supported.

minor comments (2)

[Algorithm description] The precise definition of the 'smoothed' combination of ε_n-greedy and Boltzmann exploration should be stated explicitly, including how the temperature and ε_n schedules interact.
[Preliminaries] Notation for the time-dependent contraction factor and the iterate-dependent transition kernel should be introduced once and used consistently to avoid ambiguity in the SA analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying these important points regarding the concentration bound and its application to gap-robust regret. We address each major comment below with clarifications on the proof structure.

read point-by-point responses

Referee: [Proof of the concentration inequality (likely §4 or Theorem 3.1)] The central high-probability bound for contractive Markovian SA (whose contraction factor converges to 1) must explicitly control the accumulated deviation over intervals where the instantaneous contraction is arbitrarily close to 1. Under the ε_n schedules that may be forced to decay slowly in small-gap MDPs, any failure-probability or deviation term that grows with the length of such intervals would invalidate the uniform high-probability near-Õ(N^{9/10}) regret claim.

Authors: We appreciate this observation. In the proof of Theorem 3.1, the error process is analyzed by partitioning the time horizon into intervals where the instantaneous contraction factor is bounded away from 1 by a positive constant (using standard geometric contraction) and intervals where it approaches 1. For the latter intervals, we employ a telescoping sum combined with a Freedman's inequality for the martingale increments, where the length of each such interval is controlled by the fixed decay schedule of ε_n (which does not depend on the gap for the smoothed scheme). The resulting deviation per interval is O(sqrt(L log(1/δ)) + L^{1/2} terms), and a union bound over O(log N) intervals yields an overall failure probability independent of the gap and a total accumulated deviation compatible with the near-Õ(N^{9/10}) rate. We will add an explicit remark and a short lemma in §4 to highlight this partitioning and the gap-independent control of interval lengths. revision: partial
Referee: [Regret analysis for Smoothed ε_n-Greedy (likely §5)] The gap-robustness argument for the Smoothed ε_n-Greedy scheme relies on the SA bound delivering the stated deviation uniformly over all MDPs. If the proof only obtains a bound whose constants or failure probability implicitly depend on the minimal gap (via the duration the contraction stays near 1), the claimed independence from the gap is not supported.

Authors: The Smoothed ε_n-Greedy exploration schedule is defined with a fixed decay rate for ε_n that is chosen independently of any suboptimality gap; the same schedule is used for all MDPs. Consequently, the time spent with contraction factor near 1 is governed solely by this schedule and the discount factor, not by the minimal gap. Theorem 3.1 states the high-probability bound with constants depending only on the discount factor, state-action space cardinality, and universal constants, without reference to the gap. The regret analysis in §5 then invokes this uniform bound directly, establishing the near-Õ(N^{9/10}) guarantee that holds for every MDP. We will insert a clarifying paragraph in §5 that explicitly notes the gap-independence of the constants and failure probability, together with a forward reference to the relevant steps in the proof of Theorem 3.1. revision: partial

Circularity Check

0 steps flagged

Derivation self-contained via new concentration bound

full rationale

The paper introduces a novel high-probability concentration bound for contractive Markovian stochastic approximation with time- and iterate-dependent dynamics where the contraction factor approaches 1. This bound is then used to analyze the regret of Boltzmann Q-learning and a smoothed epsilon-greedy scheme, yielding the claimed near-O(N^{9/10}) regret and sample complexity results. No load-bearing steps reduce to self-definitions, fitted predictions, or self-citation chains; the central claims rest on the independent derivation of the concentration inequality rather than on parameters chosen to match the regret target. The analysis is presented as flowing directly from the new bound without circular reductions.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on a newly introduced concentration inequality for time-inhomogeneous contractive stochastic approximation; standard discounted-MDP contraction and high-probability martingale arguments are invoked but not detailed in the abstract.

free parameters (2)

temperature decay schedule
Decaying temperature parameter in Boltzmann Q-learning is a design choice whose rate affects the regret dependence on gaps.
epsilon_n schedule
Exploration probability schedule in the smoothed epsilon-greedy scheme is chosen to achieve the stated N^{9/10} rate.

axioms (2)

domain assumption The underlying MDP is infinite-horizon discounted with a contraction factor strictly less than one.
Standard assumption for discounted MDPs invoked to guarantee existence of optimal Q-function.
ad hoc to paper The stochastic approximation iterates remain contractive even as the contraction factor approaches one.
Key technical assumption enabling the new concentration bound.

pith-pipeline@v0.9.0 · 5729 in / 1492 out tokens · 65005 ms · 2026-05-21T11:53:13.249414+00:00 · methodology

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)