Stochastic Mirror Descent under Iterate-Dependent Markov Noise: Analysis in the Asymptotic and Finite Time Regimes
Pith reviewed 2026-05-19 14:31 UTC · model grok-4.3
pith:LCPWCFYM Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{LCPWCFYM}
Prints a linked pith:LCPWCFYM badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Stochastic mirror descent converges almost surely under iterate-dependent Markov noise for both convex and non-convex problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that stochastic mirror descent achieves almost sure convergence for convex and non-convex problems under the mild assumption of Lipschitz continuity of the objective function, without requiring differentiability, when stochastic gradients arise from an iterate-dependent Markov chain. For smooth objectives, finite-time concentration bounds are obtained; in the convex regime these yield a sample complexity identical to the classical stochastic mirror descent rate under i.i.d. noise, while the non-convex bound is expressed in terms of the norm of the Riemannian gradient over the probability simplex.
What carries the argument
The ergodicity and mixing properties of the iterate-dependent Markov chain, which control the bias and temporal correlations introduced by the state-dependent sampling.
If this is right
- Almost sure convergence holds for both convex and non-convex objectives under only Lipschitz continuity.
- In the convex setting the algorithm retains the optimal sample complexity of standard stochastic mirror descent with independent noise.
- Finite-time concentration bounds become available once the objective is smooth.
- A single analysis framework covers both the asymptotic almost-sure regime and the finite-time regime.
Where Pith is reading between the lines
- The same mixing-based control could be applied to other first-order methods such as stochastic gradient descent under dependent noise.
- The results justify using mirror descent in adaptive systems where decisions directly shape the distribution of future observations.
- Practitioners facing decision-dependent uncertainty can invoke these guarantees without adding extra smoothness or independence assumptions.
Load-bearing premise
The Markov chain generated by the iterate-dependent sampling distribution must satisfy sufficient mixing or ergodicity conditions that allow the bias and temporal dependence to be controlled.
What would settle it
A simulation in which the iterate-dependent transition kernel is constructed to produce a slowly mixing or non-ergodic chain, followed by direct observation that almost sure convergence of the mirror-descent iterates fails.
read the original abstract
We study a stochastic optimization problem in which the sampling distribution depends on the decision variable, and the available samples are generated through an iterate-dependent Markov chain. Such settings arise naturally in problems with decision-dependent uncertainty; however, they introduce bias and temporal dependence, which render standard techniques developed for i.i.d.\ noise inapplicable. In this work, we analyze the stochastic mirror descent algorithm under iterate-dependent Markov noise. We first establish almost sure convergence for both convex and non-convex problems under the mild assumption of Lipschitz continuity of the objective function, without requiring differentiability. We then derive finite-time concentration bounds for smooth objectives. In the convex setting, the resulting sample complexity matches the classical rate of stochastic mirror descent under i.i.d.\ noise. In the non-convex setting, we obtain a sample complexity bound in terms of the norm of the Riemannian gradient over the probability simplex. Overall, our results establish a unified convergence framework for stochastic mirror descent with state-dependent Markov noise, and highlight its behavior in both convex and non-convex regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes stochastic mirror descent (SMD) for stochastic optimization where the sampling distribution is iterate-dependent and samples are generated by a Markov chain whose transition kernel depends on the current decision variable x_k. It claims almost-sure convergence for both convex and non-convex problems under the sole assumption of Lipschitz continuity of the objective (no differentiability required), followed by finite-time concentration bounds for smooth objectives. In the convex case the sample complexity is stated to match the classical i.i.d. rate; in the non-convex case a bound is given in terms of the norm of the Riemannian gradient over the probability simplex.
Significance. If the central claims hold, the work supplies a unified convergence theory for SMD under decision-dependent Markov noise, a setting that arises in adaptive sampling and online learning with state-dependent uncertainty. The matching of the classical i.i.d. sample complexity in the convex regime is a notable strength, as is the extension to non-convex problems without differentiability. No machine-checked proofs or reproducible code are mentioned, but the asymptotic-plus-finite-time treatment would be a useful addition to the stochastic-approximation literature if the mixing control is rigorous.
major comments (2)
- [§3.1, Theorem 4.1] §3.1, Assumption (A2) and Theorem 4.1: the almost-sure convergence statement for Lipschitz (but non-differentiable) objectives rests on controlling the one-step bias E[∇f(x_k, ξ_k) | x_k] − ∇f(x_k) induced by the x_k-dependent kernel P_{x_k}. The manuscript invokes a uniform ergodicity rate ρ < 1 independent of x, yet the only structural assumption supplied is Lipschitz continuity of f; no explicit uniform mixing condition or step-size domination argument is given that would guarantee the bias is summable when the stationary measure π_x itself moves with x_k. This is load-bearing for the a.s. claim.
- [§5.2, Eq. (28)] §5.2, Eq. (28): the finite-time concentration bound for the convex case is asserted to recover the classical O(1/√T) rate. The derivation appears to absorb the Markov dependence into a constant that is independent of T, but the proof sketch does not quantify how the transient bias accumulates over the first O(log T) steps when the mixing time may depend on the current x_k; a concrete bound on the total variation distance after one step of P_{x_k} is needed to confirm the constant remains O(1).
minor comments (2)
- [§2] Notation: the symbol π_x is introduced in §2 but used interchangeably with the stationary distribution of P_x; a single consistent definition would improve readability.
- [Figure 1] Figure 1: the plotted trajectories for the non-convex case lack error bars or shaded regions indicating variability across the 20 independent runs; adding these would clarify the concentration claim.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable feedback on our manuscript. We address each of the major comments in detail below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [§3.1, Theorem 4.1] §3.1, Assumption (A2) and Theorem 4.1: the almost-sure convergence statement for Lipschitz (but non-differentiable) objectives rests on controlling the one-step bias E[∇f(x_k, ξ_k) | x_k] − ∇f(x_k) induced by the x_k-dependent kernel P_{x_k}. The manuscript invokes a uniform ergodicity rate ρ < 1 independent of x, yet the only structural assumption supplied is Lipschitz continuity of the objective (no differentiability required), no explicit uniform mixing condition or step-size domination argument is given that would guarantee the bias is summable when the stationary measure π_x itself moves with x_k. This is load-bearing for the a.s. claim.
Authors: We appreciate the referee's careful scrutiny of the almost-sure convergence result. In the manuscript, Assumption (A2) explicitly posits a uniform geometric ergodicity condition with rate ρ < 1 that holds uniformly over all x in the domain. This assumption is independent of the Lipschitz continuity of f, which is used separately to control the variation in the stationary distributions π_x as x changes. Specifically, the Lipschitz property ensures that the total variation distance between π_x and π_y is bounded by a constant times ||x - y||, allowing us to show that the bias term is O(α_k) where α_k is the step size, and thus summable under standard step-size conditions. We will add a dedicated remark in Section 3.1 to explicitly derive the bound on the one-step bias using these elements. revision: partial
-
Referee: [§5.2, Eq. (28)] §5.2, Eq. (28): the finite-time concentration bound for the convex case is asserted to recover the classical O(1/√T) rate. The derivation appears to absorb the Markov dependence into a constant that is independent of T, but the proof sketch does not quantify how the transient bias accumulates over the first O(log T) steps when the mixing time may depend on the current x_k; a concrete bound on the total variation distance after one step of P_{x_k} is needed to confirm the constant remains O(1).
Authors: Thank you for this observation regarding the finite-time bounds. In the proof of the concentration inequality in Section 5.2, we do provide a bound on the total variation distance ||P_x - π_x||_{TV} ≤ C ρ, where ρ is from Assumption (A2) and C is independent of x due to the uniform ergodicity. For the transient phase, since the step sizes are decreasing, the number of steps where the mixing has not fully occurred is logarithmic in T, and the accumulated bias is absorbed into the O(1) constant because the initial transient contributes a term that is bounded by a constant independent of T. However, to make this fully rigorous and transparent, we will include an explicit lemma bounding the total variation distance after one step and show how it affects the overall constant in the revised manuscript. revision: yes
Circularity Check
No circularity: derivation relies on external stochastic approximation theorems under stated ergodicity assumptions
full rationale
The paper establishes almost-sure convergence and finite-time bounds for stochastic mirror descent under iterate-dependent Markov noise by invoking standard results from stochastic approximation (e.g., Borkar-type arguments for controlled Markov chains) together with explicit mixing/ergodicity conditions on the x-dependent transition kernel. These conditions are treated as modeling assumptions rather than derived quantities, and the sample-complexity claims are obtained by direct comparison to the i.i.d. case without fitting parameters to the target bounds. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or described proof outline; the central claims remain independent of the results they establish.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first establish almost sure convergence for both convex and non-convex problems under the mild assumption of Lipschitz continuity of the objective function, without requiring differentiability.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the iterates{x n}generated by the stochastic mirror descent algorithm in(2.3)converge almost surely to the set of stationary pointsS.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. AGARWAL, S. M. KAKADE, J. D. LEE,ANDG. MAHAJAN,On the theory of policy gradient methods: Optimality, approximation, and distribution shift, Journal of Machine Learning Research, 22 (2021), pp. 1–76
work page 2021
-
[2]
V. S. BORKAR,Stochastic approximation with ‘controlled markov’noise, Systems & control letters, 55 (2006), pp. 139–145
work page 2006
-
[3]
BOUMAL,An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023
N. BOUMAL,An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023
work page 2023
-
[4]
BUBECK,Introduction to Online Optimization, Lecture notes, Princeton University, 2012
S. BUBECK,Introduction to Online Optimization, Lecture notes, Princeton University, 2012
work page 2012
-
[5]
E. CHE, J. DONG,ANDX. T. TONG,Stochastic gradient descent with adaptive data, Operations Research, (2026)
work page 2026
-
[6]
F. H. CLARKE,Optimization and nonsmooth analysis, SIAM, 1990
work page 1990
-
[7]
CORTES,Discontinuous dynamical systems, IEEE Control systems magazine, 28 (2008), pp
J. CORTES,Discontinuous dynamical systems, IEEE Control systems magazine, 28 (2008), pp. 36–73
work page 2008
- [8]
- [9]
-
[10]
A. DOUCET ANDV. TADIC,Asymptotic bias of stochastic gradient search, Annals of Applied Probability, 27 (2017)
work page 2017
-
[11]
J. C. DUCHI,Introductory lectures on stochastic optimization, 2018
work page 2018
-
[12]
J. C. DUCHI, A. AGARWAL, M. JOHANSSON,ANDM. I. JORDAN,Ergodic mirror descent, SIAM Journal on Optimization, 22 (2012), pp. 1549–1578
work page 2012
-
[13]
M. EVEN,Stochastic gradient descent under markovian sampling schemes, in International Conference on Machine Learning, PMLR, 2023, pp. 9412–9439
work page 2023
-
[14]
I. HALPERIN,Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions: by Warren B. Powell (ed.), Wiley (2022). Hardback. ISBN 9781119815051., vol. 22, Taylor & Francis, 2022
work page 2022
-
[15]
A. HAUSWIRTH, S. BOLOGNANI,ANDF. DORFLER,Projected dynamical systems on irregular, non- euclidean domains for nonlinear optimization, SIAM Journal on Control and Optimization, 59 (2021), pp. 635–668
work page 2021
- [16]
-
[17]
T. LI, F. WU,ANDG. LAN,Stochastic first-order methods for average-reward markov decision processes, Mathematics of Operations Research, 50 (2025), pp. 3125–3160
work page 2025
- [18]
-
[19]
A. NAGURNEY ANDD. ZHANG,Projected dynamical systems and variational inequalities with applications, vol. 2, Springer Science & Business Media, 2012
work page 2012
-
[20]
A. NEMIROVSKI, A. JUDITSKY, G. LAN,ANDA. SHAPIRO,Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–1609, https://doi.org/10.1 137/070704277
work page 2009
- [21]
-
[22]
A. K. PAUL, A. D. MAHINDRAKAR,ANDR. K. KALAIMANI,Convergence analysis of stochastic saddle point mirror descent algorithm: A projected dynamical viewpoint, IEEE Transactions on Automatic Control, (2025)
work page 2025
-
[23]
R. T. ROCKAFELLAR,Convex analysis, Princeton Mathematical Series, Princeton University Press, Prince- ton, N. J., 1970
work page 1970
-
[24]
A. ROY, K. BALASUBRAMANIAN,ANDS. GHADIMI,Constrained stochastic nonconvex optimization with state-dependent markov data, Advances in neural information processing systems, 35 (2022), pp. 23256– 23270
work page 2022
-
[25]
V. SOLODKIN, A. VEPRIKOV, A. CHERNYAVSKIY,ANDA. BEZNOSIKOV,Methods for optimization prob- lems with markovian stochasticity and non-euclidean geometry, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, 2026, pp. 25508–25517
work page 2026
-
[26]
T. SUN, Y. SUN,ANDW. YIN,On markov chain gradient descent, Advances in neural information processing systems, 31 (2018)
work page 2018
-
[27]
VERSHYNIN,High-dimensional probability, University of California, Irvine, 10 (2020), p
R. VERSHYNIN,High-dimensional probability, University of California, Irvine, 10 (2020), p. 31
work page 2020
-
[28]
L. XIAO,On the convergence rates of policy gradient methods, Journal of Machine Learning Research, 23 (2022), pp. 1–36
work page 2022
-
[29]
V. G. YAJI ANDS. BHATNAGAR,Stochastic recursive inclusions in two timescales with nonadditive iterate- dependent markov noise, Mathematics of Operations Research, 45 (2020), pp. 1405–1444
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.