pith. machine review for the scientific record. sign in

arxiv: 2510.16132 · v2 · submitted 2025-10-17 · 💻 cs.LG · math.OC· stat.ML

Recognition: unknown

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Authors on Pith no claims yet
classification 💻 cs.LG math.OCstat.ML
keywords learningpoliciespolicyq-learningtime-varyinganalysisinftymathbb
0
0 comments X
read the original abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

    cs.LG 2026-05 unverdicted novelty 8.0

    Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.