Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL
Pith reviewed 2026-06-25 23:38 UTC · model grok-4.3
The pith
An MLMC-based primal-dual natural actor-critic achieves optimal tilde O(1/sqrt(T)) rates for constrained multi-objective average-reward RL without mixing-time knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MLMC-based primal-dual Natural Actor-Critic algorithm controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge of the MDP. It achieves optimal global convergence and constraint-violation rates of tilde O(1/sqrt(T)) for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints.
What carries the argument
Multi-level Monte Carlo sampling applied to the primal and dual updates to offset bias from nonlinearity of the scalarization functions f and g.
If this is right
- Optimal rates hold for both constrained and unconstrained concave scalarized multi-objective problems.
- The same rates are obtained without any mixing-time information even when scalarization is absent.
- Constraint violation converges at the same optimal rate as the objective.
- The bias-control technique applies simultaneously to policy-gradient, actor-critic, and constraint-evaluation steps.
Where Pith is reading between the lines
- The approach may extend to other stochastic optimization settings where nonlinear utilities create similar gradient bias.
- Practical deployment could allow safer multi-objective RL in domains such as resource allocation where mixing times are hard to estimate.
- The method suggests a template for removing mixing-time dependence from other average-reward algorithms.
Load-bearing premise
The bias introduced by nonlinearity of f and g in policy-gradient and actor-critic updates, along with bias in constraint evaluation, can be controlled via MLMC without requiring mixing-time knowledge of the MDP.
What would settle it
An empirical run on a finite-state MDP with unknown mixing time in which the observed convergence rate remains worse than 1/sqrt(T) after MLMC bias correction would falsify the rate claim.
read the original abstract
Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization, where the agent maximizes a utility $ f(J^\pi_{r_1}, \ldots, J^\pi_{r_M}) $ subject to a scalarized constraint $ g(J^\pi_{c_1}, \ldots, J^\pi_{c_N}) \ge 0 $, where $J^\pi_{r_m}$ and $J^\pi_{c_n}$ denote the average-reward and cost under policy $\pi$. However, the nonlinearity of $f$ and $g$ introduces bias in policy-gradient and actor-critic methods, since gradients must be evaluated using noisy estimates of $J^\pi,$ and $ \mathbb{E}[\partial f(J^\pi)] \neq \partial f(\mathbb{E}[J^\pi]),$ and this bias propagates through both primal and dual updates. We propose an MLMC-based primal-dual Natural Actor-Critic algorithm for average-reward MDPs that controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge. We show that the algorithm achieves optimal global convergence and constraint-violation rates of $ \tilde{O}(1/\sqrt{T}) $. To our knowledge, this is the first result establishing optimal convergence for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints, and the first to do so without mixing-time information even in the absence of scalarization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Bias-Controlled Primal-Dual Natural Actor-Critic algorithm that employs Multilevel Monte Carlo (MLMC) to control bias arising from nonlinear scalarization functions f and g in policy-gradient and actor-critic updates for constrained multi-objective average-reward RL. It claims to achieve optimal global convergence and constraint-violation rates of ilde{O}(1/\sqrt{T}) without requiring mixing-time knowledge of the MDP, both with and without constraints, and presents this as the first such result for concave scalarized multi-objective RL in the average-reward setting.
Significance. If the central rates hold, the result would be significant as the first optimal-rate guarantee for this class of problems in average-reward MDPs. The MLMC-based bias control for nonlinear objectives and constraints, together with the mixing-time-free property, would constitute a technical advance over prior primal-dual actor-critic analyses that typically require mixing-time bounds.
major comments (1)
- [MLMC bias-control argument (proof of main convergence theorem, likely §5)] The load-bearing step is the MLMC construction that simultaneously removes the nonlinearity bias E[∇f(J^π)] ≠ ∇f(E[J^π]) (and the analogous term for g), produces sufficiently unbiased estimates for the natural actor-critic gradient and dual constraint, and does so with total sample cost that still yields the claimed ilde{O}(1/\sqrt{T}) rates while remaining agnostic to the unknown mixing time. In average-reward MDPs the bias of a length-L trajectory estimator decays as O(ρ^L) where ρ depends on the mixing time; the proof must therefore contain either a mixing-time-free bias bound or an adaptive level-selection rule whose overhead does not spoil the rate. This argument is not visible from the abstract and is the single most critical point for the optimality claim.
minor comments (1)
- [Abstract] The abstract uses the nonstandard token "\tilde{O}"; replace with the conventional \tilde{O} notation for consistency with the rest of the manuscript.
Simulated Author's Rebuttal
We thank the referee for the careful review and for recognizing the potential significance of the result. We address the major comment on the MLMC bias-control argument below.
read point-by-point responses
-
Referee: [MLMC bias-control argument (proof of main convergence theorem, likely §5)] The load-bearing step is the MLMC construction that simultaneously removes the nonlinearity bias E[∇f(J^π)] ≠ ∇f(E[J^π]) (and the analogous term for g), produces sufficiently unbiased estimates for the natural actor-critic gradient and dual constraint, and does so with total sample cost that still yields the claimed ãO(1/√T) rates while remaining agnostic to the unknown mixing time. In average-reward MDPs the bias of a length-L trajectory estimator decays as O(ρ^L) where ρ depends on the mixing time; the proof must therefore contain either a mixing-time-free bias bound or an adaptive level-selection rule whose overhead does not spoil the rate. This argument is not visible from the abstract and is the single most critical point for the optimality claim.
Authors: Section 5 contains the full MLMC construction. The estimator uses an adaptive level-selection rule driven by empirical variance and bias estimates (no explicit ρ is required). The telescoping MLMC sum removes the nonlinearity bias for both f and g while the per-iteration sample overhead remains logarithmic; the overall complexity stays ãO(T) and yields the stated ãO(1/√T) rates. We have added a concise overview of this adaptive rule in Section 3.2 and a remark after Theorem 1 to make the argument visible without reference to the abstract. revision: yes
Circularity Check
No significant circularity; derivation self-contained from proposed MLMC algorithm
full rationale
The paper proposes an MLMC-based primal-dual Natural Actor-Critic algorithm that controls bias from nonlinearity of f and g without requiring mixing-time knowledge, then derives the ilde{O}(1/\sqrt{T}) rates from this construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or description. The central claim is presented as following from the algorithm's bias-control mechanism rather than reducing to its inputs by definition. This is the most common honest outcome for a methods paper whose result is externally falsifiable via the stated rates.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023
Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023
2023
-
[2]
Multi-objective reinforcement learning with non-linear scalarization
Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 9–17, 2022
2022
-
[3]
Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022
Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022
2022
-
[4]
Regret guarantees for model-based rein- forcement learning with long-term average constraints
Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based rein- forcement learning with long-term average constraints. InUncertainty in Artificial Intelligence, pages 22–31. PMLR, 2022
2022
-
[5]
Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022
Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022
2022
-
[6]
Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm
Qinbo Bai, Washim U Mondal, and Vaneet Aggarwal. Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm. Advances in Neural Information Processing Systems, 37:108566–108599, 2024
2024
-
[7]
Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes
Qinbo Bai, Washim Uddin Mondal, and Vaneet Aggarwal. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10980–10988, 2024
2024
-
[8]
Learning infinite-horizon average-reward markov decision process with constraints
Liyu Chen, Rahul Jain, and Haipeng Luo. Learning infinite-horizon average-reward markov decision process with constraints. InInternational Conference on Machine Learning, pages 3246–3270. PMLR, 2022
2022
-
[9]
Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023
Xuyang Chen and Lin Zhao. Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023
2023
-
[10]
Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025
Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Basar, and Mihailo R Jovanovic. Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025
2025
-
[11]
Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester
Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, bench- marks and analysis.Mach. Learn., 110(9):2419–2468, September 2021
2021
-
[12]
Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies
Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. InInternational Conference on Machine Learning, pages 9827–9869. PMLR, 2023
2023
-
[13]
Swetha Ganesh and Vaneet Aggarwal. Breaking the bias barrier in concave multi-objective reinforcement learning.arXiv preprint arXiv:2603.08518, 2026
-
[14]
A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach
Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal. A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach. InProceedings of the 42nd International Conference on Machine Learning, pages 18206–18227, 2025
2025
-
[15]
Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024
Fei Gao, Xiaodong Wang, Yuze Fan, Zhenhai Gao, and Rui Zhao. Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024
2024
-
[16]
Reinforcement learning for constrained markov decision processes
Ather Gattami, Qinbo Bai, and Vaneet Aggarwal. Reinforcement learning for constrained markov decision processes. InInternational Conference on Artificial Intelligence and Statistics, pages 2656–2664. PMLR, 2021
2021
-
[17]
A practical guide to multi-objective reinforcement learning and planning
Conor F Hayes et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022. 11
2022
-
[18]
Global convergence of policy gradient in average reward mdps
Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Yehuda Levy, R Srikant, and Shie Mannor. Global convergence of policy gradient in average reward mdps. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[19]
Ankita Kushwaha, Kiran Ravish, Preeti Lamba, and Pawan Kumar. A survey of safe reinforce- ment learning and constrained mdps: A technical survey on single-agent and multi-agent safety. arXiv preprint arXiv:2505.17342, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018
Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[21]
An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020
Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020
2020
-
[22]
Modified policy iteration for exponen- tial cost risk sensitive mdps
Yashaswini Murthy, Mehrdad Moharrami, and R Srikant. Modified policy iteration for exponen- tial cost risk sensitive mdps. InLearning for Dynamics and Control Conference, pages 395–406. PMLR, 2023
2023
-
[23]
Two-timescale critic-actor for average reward mdps with function approximation
Prashansa Panda and Shalabh Bhatnagar. Two-timescale critic-actor for average reward mdps with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 19813–19820, 2025
2025
-
[24]
Stochastic variance-reduced policy gradient
Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018
2018
-
[25]
Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic
Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. InInternational Conference on Machine Learning, pages 33240–33267. PMLR, 2023
2023
-
[26]
Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023
Haoxing Tian, Alex Olshevsky, and Yannis Paschalidis. Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023
2023
-
[27]
Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation
Yudan Wang, Yue Wang, Yi Zhou, and Shaofeng Zou. Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation. InInternational Conference on Machine Learning, pages 51771–51824. PMLR, 2024
2024
-
[28]
Model- free reinforcement learning in infinite-horizon average-reward markov decision processes
Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model- free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170–10180. PMLR, 2020
2020
-
[29]
A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes
Honghao Wei, Xin Liu, and Lei Ying. A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3868–3876, 2022
2022
-
[30]
A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020
Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020
2020
-
[31]
Sample efficient policy gradient methods with recursive variance reduction
Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. InInternational Conference on Learning Representations, 2020
2020
-
[32]
Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020
Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020
2020
-
[33]
Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm
Yang Xu, Swetha Ganesh, Washim Uddin Mondal, Qinbo Bai, and Vaneet Aggarwal. Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12
2025
-
[34]
Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021
Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021
2021
-
[35]
Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021
Sheng Zhang, Zhe Zhang, and Siva Theja Maguluri. Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021
2021
-
[36]
Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023
Zhe Zheng, Yu Han, Yingying Chi, Fusheng Yuan, Wenpeng Cui, Hailong Zhu, Yi Zhang, and Peiying Zhang. Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023
2023
-
[37]
Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems
Jiahong Zhou, Shunhui Mao, Guoliang Yang, Bo Tang, Qianlong Xie, Lebin Lin, Xingxing Wang, and Dong Wang. Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems. InProceedings of the ACM Web Conference 2023, pages 3214–3224, 2023
2023
-
[38]
Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning
Ruida Zhou, Tao Liu, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning. InAdvances in Neural Information Processing Systems, 2022. 13 A Pseudocode Algorithm 1:Multi-Objective Primal-Dual NAC (MO-PDNAC) Input :Parameter classΘ,feature matrixΦ,primal stepsizeα,du...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.