Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

Ankur Naskar; Swetha Ganesh; Vaneet Aggarwal

arxiv: 2606.25012 · v1 · pith:IEMYOYVJnew · submitted 2026-06-23 · 💻 cs.LG

Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

Ankur Naskar , Swetha Ganesh , Vaneet Aggarwal This is my paper

Pith reviewed 2026-06-25 23:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-objective reinforcement learningaverage-reward RLconstrained RLactor-criticprimal-dual methodsbias controlmulti-level Monte Carlonatural actor-critic

0 comments

The pith

An MLMC-based primal-dual natural actor-critic achieves optimal tilde O(1/sqrt(T)) rates for constrained multi-objective average-reward RL without mixing-time knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a primal-dual natural actor-critic algorithm that applies multi-level Monte Carlo sampling to offset bias arising from nonlinear scalarization of multiple rewards and costs. It establishes that this bias control yields global convergence and constraint-violation bounds of order 1 over square root of the time horizon in infinite-horizon average-reward MDPs. A sympathetic reader cares because standard policy-gradient and actor-critic updates become biased when the utility and constraint functions are nonlinear, and most prior analyses demand knowledge of the MDP mixing time that is rarely available. The result covers both the constrained and unconstrained cases and is presented as the first to reach the optimal rate in this setting without that information.

Core claim

The MLMC-based primal-dual Natural Actor-Critic algorithm controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge of the MDP. It achieves optimal global convergence and constraint-violation rates of tilde O(1/sqrt(T)) for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints.

What carries the argument

Multi-level Monte Carlo sampling applied to the primal and dual updates to offset bias from nonlinearity of the scalarization functions f and g.

If this is right

Optimal rates hold for both constrained and unconstrained concave scalarized multi-objective problems.
The same rates are obtained without any mixing-time information even when scalarization is absent.
Constraint violation converges at the same optimal rate as the objective.
The bias-control technique applies simultaneously to policy-gradient, actor-critic, and constraint-evaluation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other stochastic optimization settings where nonlinear utilities create similar gradient bias.
Practical deployment could allow safer multi-objective RL in domains such as resource allocation where mixing times are hard to estimate.
The method suggests a template for removing mixing-time dependence from other average-reward algorithms.

Load-bearing premise

The bias introduced by nonlinearity of f and g in policy-gradient and actor-critic updates, along with bias in constraint evaluation, can be controlled via MLMC without requiring mixing-time knowledge of the MDP.

What would settle it

An empirical run on a finite-state MDP with unknown mixing time in which the observed convergence rate remains worse than 1/sqrt(T) after MLMC bias correction would falsify the rate claim.

read the original abstract

Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization, where the agent maximizes a utility $ f(J^\pi_{r_1}, \ldots, J^\pi_{r_M}) $ subject to a scalarized constraint $ g(J^\pi_{c_1}, \ldots, J^\pi_{c_N}) \ge 0 $, where $J^\pi_{r_m}$ and $J^\pi_{c_n}$ denote the average-reward and cost under policy $\pi$. However, the nonlinearity of $f$ and $g$ introduces bias in policy-gradient and actor-critic methods, since gradients must be evaluated using noisy estimates of $J^\pi,$ and $ \mathbb{E}[\partial f(J^\pi)] \neq \partial f(\mathbb{E}[J^\pi]),$ and this bias propagates through both primal and dual updates. We propose an MLMC-based primal-dual Natural Actor-Critic algorithm for average-reward MDPs that controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge. We show that the algorithm achieves optimal global convergence and constraint-violation rates of $ \tilde{O}(1/\sqrt{T}) $. To our knowledge, this is the first result establishing optimal convergence for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints, and the first to do so without mixing-time information even in the absence of scalarization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the first optimal ilde O(1/sqrt T) rates for constrained multi-objective average-reward RL by using MLMC to control nonlinearity bias in a primal-dual natural actor-critic without mixing-time knowledge.

read the letter

The main takeaway is that this paper gives an MLMC-based primal-dual natural actor-critic algorithm that claims optimal global convergence and constraint violation rates of ilde O(1/sqrt T) for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints, and without any mixing-time information.

What is new is the extension of bias-controlled actor-critic methods to the multi-objective constrained case under average reward, plus the removal of the usual mixing-time dependence even in the non-scalarized setting. The work does a solid job identifying how nonlinearity in f and g creates bias in the policy-gradient and dual updates because E[grad f(J^pi)] differs from grad f(E[J^pi]), and it positions MLMC as the tool to handle bias in both the objective and the constraint estimates.

The soft spot is the MLMC construction itself. To reach the claimed rate, the multilevel estimator must keep bias from finite-length trajectories below 1/sqrt T while controlling variance, all without knowing the geometric decay rate that depends on the unknown mixing time. Standard MLMC level selection needs that decay information or an upper bound; if the paper only has an adaptive rule whose extra samples do not spoil the overall rate, or a mixing-free bias bound, then the argument closes. Otherwise the central rate claim rests on a step that is not visible from the abstract and needs close checking in the proofs.

This paper is for researchers working on rate-optimal methods for safe and multi-objective RL in infinite-horizon settings. A reader who cares about average-reward formulations and explicit bias control would get concrete technical value from the development. The work shows clear engagement with the actor-critic and MLMC literature, so it deserves a serious referee even if the bias-control argument needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper proposes a Bias-Controlled Primal-Dual Natural Actor-Critic algorithm that employs Multilevel Monte Carlo (MLMC) to control bias arising from nonlinear scalarization functions f and g in policy-gradient and actor-critic updates for constrained multi-objective average-reward RL. It claims to achieve optimal global convergence and constraint-violation rates of ilde{O}(1/\sqrt{T}) without requiring mixing-time knowledge of the MDP, both with and without constraints, and presents this as the first such result for concave scalarized multi-objective RL in the average-reward setting.

Significance. If the central rates hold, the result would be significant as the first optimal-rate guarantee for this class of problems in average-reward MDPs. The MLMC-based bias control for nonlinear objectives and constraints, together with the mixing-time-free property, would constitute a technical advance over prior primal-dual actor-critic analyses that typically require mixing-time bounds.

major comments (1)

[MLMC bias-control argument (proof of main convergence theorem, likely §5)] The load-bearing step is the MLMC construction that simultaneously removes the nonlinearity bias E[∇f(J^π)] ≠ ∇f(E[J^π]) (and the analogous term for g), produces sufficiently unbiased estimates for the natural actor-critic gradient and dual constraint, and does so with total sample cost that still yields the claimed ilde{O}(1/\sqrt{T}) rates while remaining agnostic to the unknown mixing time. In average-reward MDPs the bias of a length-L trajectory estimator decays as O(ρ^L) where ρ depends on the mixing time; the proof must therefore contain either a mixing-time-free bias bound or an adaptive level-selection rule whose overhead does not spoil the rate. This argument is not visible from the abstract and is the single most critical point for the optimality claim.

minor comments (1)

[Abstract] The abstract uses the nonstandard token "\tilde{O}"; replace with the conventional \tilde{O} notation for consistency with the rest of the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the potential significance of the result. We address the major comment on the MLMC bias-control argument below.

read point-by-point responses

Referee: [MLMC bias-control argument (proof of main convergence theorem, likely §5)] The load-bearing step is the MLMC construction that simultaneously removes the nonlinearity bias E[∇f(J^π)] ≠ ∇f(E[J^π]) (and the analogous term for g), produces sufficiently unbiased estimates for the natural actor-critic gradient and dual constraint, and does so with total sample cost that still yields the claimed ãO(1/√T) rates while remaining agnostic to the unknown mixing time. In average-reward MDPs the bias of a length-L trajectory estimator decays as O(ρ^L) where ρ depends on the mixing time; the proof must therefore contain either a mixing-time-free bias bound or an adaptive level-selection rule whose overhead does not spoil the rate. This argument is not visible from the abstract and is the single most critical point for the optimality claim.

Authors: Section 5 contains the full MLMC construction. The estimator uses an adaptive level-selection rule driven by empirical variance and bias estimates (no explicit ρ is required). The telescoping MLMC sum removes the nonlinearity bias for both f and g while the per-iteration sample overhead remains logarithmic; the overall complexity stays ãO(T) and yields the stated ãO(1/√T) rates. We have added a concise overview of this adaptive rule in Section 3.2 and a remark after Theorem 1 to make the argument visible without reference to the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from proposed MLMC algorithm

full rationale

The paper proposes an MLMC-based primal-dual Natural Actor-Critic algorithm that controls bias from nonlinearity of f and g without requiring mixing-time knowledge, then derives the ilde{O}(1/\sqrt{T}) rates from this construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or description. The central claim is presented as following from the algorithm's bias-control mechanism rather than reducing to its inputs by definition. This is the most common honest outcome for a methods paper whose result is externally falsifiable via the stated rates.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, new entities, or non-standard axioms are described. Standard RL assumptions (e.g., bounded rewards, existence of stationary policies) are implicitly used but not detailed.

pith-pipeline@v0.9.1-grok · 5834 in / 1195 out tokens · 29095 ms · 2026-06-25T23:38:34.681097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023

Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023

2023
[2]

Multi-objective reinforcement learning with non-linear scalarization

Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 9–17, 2022

2022
[3]

Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022

Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022

2022
[4]

Regret guarantees for model-based rein- forcement learning with long-term average constraints

Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based rein- forcement learning with long-term average constraints. InUncertainty in Artificial Intelligence, pages 22–31. PMLR, 2022

2022
[5]

Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022

Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022

2022
[6]

Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm

Qinbo Bai, Washim U Mondal, and Vaneet Aggarwal. Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm. Advances in Neural Information Processing Systems, 37:108566–108599, 2024

2024
[7]

Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes

Qinbo Bai, Washim Uddin Mondal, and Vaneet Aggarwal. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10980–10988, 2024

2024
[8]

Learning infinite-horizon average-reward markov decision process with constraints

Liyu Chen, Rahul Jain, and Haipeng Luo. Learning infinite-horizon average-reward markov decision process with constraints. InInternational Conference on Machine Learning, pages 3246–3270. PMLR, 2022

2022
[9]

Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023

Xuyang Chen and Lin Zhao. Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023

2023
[10]

Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025

Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Basar, and Mihailo R Jovanovic. Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025

2025
[11]

Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester

Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, bench- marks and analysis.Mach. Learn., 110(9):2419–2468, September 2021

2021
[12]

Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. InInternational Conference on Machine Learning, pages 9827–9869. PMLR, 2023

2023
[13]

Breaking the bias barrier in concave multi-objective reinforcement learning.arXiv preprint arXiv:2603.08518, 2026

Swetha Ganesh and Vaneet Aggarwal. Breaking the bias barrier in concave multi-objective reinforcement learning.arXiv preprint arXiv:2603.08518, 2026

work page arXiv 2026
[14]

A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach

Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal. A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach. InProceedings of the 42nd International Conference on Machine Learning, pages 18206–18227, 2025

2025
[15]

Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024

Fei Gao, Xiaodong Wang, Yuze Fan, Zhenhai Gao, and Rui Zhao. Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024

2024
[16]

Reinforcement learning for constrained markov decision processes

Ather Gattami, Qinbo Bai, and Vaneet Aggarwal. Reinforcement learning for constrained markov decision processes. InInternational Conference on Artificial Intelligence and Statistics, pages 2656–2664. PMLR, 2021

2021
[17]

A practical guide to multi-objective reinforcement learning and planning

Conor F Hayes et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022. 11

2022
[18]

Global convergence of policy gradient in average reward mdps

Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Yehuda Levy, R Srikant, and Shie Mannor. Global convergence of policy gradient in average reward mdps. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[19]

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Ankita Kushwaha, Kiran Ravish, Preeti Lamba, and Pawan Kumar. A survey of safe reinforce- ment learning and constrained mdps: A technical survey on single-agent and multi-agent safety. arXiv preprint arXiv:2505.17342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018

Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018

2018
[21]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

2020
[22]

Modified policy iteration for exponen- tial cost risk sensitive mdps

Yashaswini Murthy, Mehrdad Moharrami, and R Srikant. Modified policy iteration for exponen- tial cost risk sensitive mdps. InLearning for Dynamics and Control Conference, pages 395–406. PMLR, 2023

2023
[23]

Two-timescale critic-actor for average reward mdps with function approximation

Prashansa Panda and Shalabh Bhatnagar. Two-timescale critic-actor for average reward mdps with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 19813–19820, 2025

2025
[24]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018

2018
[25]

Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic

Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. InInternational Conference on Machine Learning, pages 33240–33267. PMLR, 2023

2023
[26]

Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023

Haoxing Tian, Alex Olshevsky, and Yannis Paschalidis. Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023

2023
[27]

Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation

Yudan Wang, Yue Wang, Yi Zhou, and Shaofeng Zou. Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation. InInternational Conference on Machine Learning, pages 51771–51824. PMLR, 2024

2024
[28]

Model- free reinforcement learning in infinite-horizon average-reward markov decision processes

Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model- free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170–10180. PMLR, 2020

2020
[29]

A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes

Honghao Wei, Xin Liu, and Lei Ying. A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3868–3876, 2022

2022
[30]

A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020

Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020

2020
[31]

Sample efficient policy gradient methods with recursive variance reduction

Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. InInternational Conference on Learning Representations, 2020

2020
[32]

Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020

Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020

2020
[33]

Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm

Yang Xu, Swetha Ganesh, Washim Uddin Mondal, Qinbo Bai, and Vaneet Aggarwal. Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025
[34]

Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021

Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021

2021
[35]

Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

Sheng Zhang, Zhe Zhang, and Siva Theja Maguluri. Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

2021
[36]

Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023

Zhe Zheng, Yu Han, Yingying Chi, Fusheng Yuan, Wenpeng Cui, Hailong Zhu, Yi Zhang, and Peiying Zhang. Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023

2023
[37]

Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems

Jiahong Zhou, Shunhui Mao, Guoliang Yang, Bo Tang, Qianlong Xie, Lebin Lin, Xingxing Wang, and Dong Wang. Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems. InProceedings of the ACM Web Conference 2023, pages 3214–3224, 2023

2023
[38]

Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning

Ruida Zhou, Tao Liu, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning. InAdvances in Neural Information Processing Systems, 2022. 13 A Pseudocode Algorithm 1:Multi-Objective Primal-Dual NAC (MO-PDNAC) Input :Parameter classΘ,feature matrixΦ,primal stepsizeα,du...

2022

[1] [1]

Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023

Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards.Journal of Machine Learning Research, 24(49):1–41, 2023

2023

[2] [2]

Multi-objective reinforcement learning with non-linear scalarization

Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 9–17, 2022

2022

[3] [3]

Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022

Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Concave utility reinforcement learning with zero-constraint violations.Transactions on Machine Learning Research, 2022

2022

[4] [4]

Regret guarantees for model-based rein- forcement learning with long-term average constraints

Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based rein- forcement learning with long-term average constraints. InUncertainty in Artificial Intelligence, pages 22–31. PMLR, 2022

2022

[5] [5]

Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022

Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm.Journal of Artificial Intelligence Research, 74, 2022

2022

[6] [6]

Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm

Qinbo Bai, Washim U Mondal, and Vaneet Aggarwal. Learning general parameterized policies for infinite horizon average reward constrained mdps via primal-dual policy gradient algorithm. Advances in Neural Information Processing Systems, 37:108566–108599, 2024

2024

[7] [7]

Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes

Qinbo Bai, Washim Uddin Mondal, and Vaneet Aggarwal. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10980–10988, 2024

2024

[8] [8]

Learning infinite-horizon average-reward markov decision process with constraints

Liyu Chen, Rahul Jain, and Haipeng Luo. Learning infinite-horizon average-reward markov decision process with constraints. InInternational Conference on Machine Learning, pages 3246–3270. PMLR, 2022

2022

[9] [9]

Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023

Xuyang Chen and Lin Zhao. Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049, 2023

2023

[10] [10]

Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025

Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Basar, and Mihailo R Jovanovic. Conver- gence and sample complexity of natural policy gradient primal-dual methods for constrained mdps.Journal of Machine Learning Research, 26(256):1–76, 2025

2025

[11] [11]

Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester

Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, bench- marks and analysis.Mach. Learn., 110(9):2419–2468, September 2021

2021

[12] [12]

Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. InInternational Conference on Machine Learning, pages 9827–9869. PMLR, 2023

2023

[13] [13]

Breaking the bias barrier in concave multi-objective reinforcement learning.arXiv preprint arXiv:2603.08518, 2026

Swetha Ganesh and Vaneet Aggarwal. Breaking the bias barrier in concave multi-objective reinforcement learning.arXiv preprint arXiv:2603.08518, 2026

work page arXiv 2026

[14] [14]

A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach

Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal. A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach. InProceedings of the 42nd International Conference on Machine Learning, pages 18206–18227, 2025

2025

[15] [15]

Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024

Fei Gao, Xiaodong Wang, Yuze Fan, Zhenhai Gao, and Rui Zhao. Constraints driven safe reinforcement learning for autonomous driving decision-making.IEEE Access, 12:128007– 128023, 2024

2024

[16] [16]

Reinforcement learning for constrained markov decision processes

Ather Gattami, Qinbo Bai, and Vaneet Aggarwal. Reinforcement learning for constrained markov decision processes. InInternational Conference on Artificial Intelligence and Statistics, pages 2656–2664. PMLR, 2021

2021

[17] [17]

A practical guide to multi-objective reinforcement learning and planning

Conor F Hayes et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022. 11

2022

[18] [18]

Global convergence of policy gradient in average reward mdps

Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Yehuda Levy, R Srikant, and Shie Mannor. Global convergence of policy gradient in average reward mdps. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[19] [19]

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Ankita Kushwaha, Kiran Ravish, Preeti Lamba, and Pawan Kumar. A survey of safe reinforce- ment learning and constrained mdps: A technical survey on single-agent and multi-agent safety. arXiv preprint arXiv:2505.17342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018

Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control.Advances in Neural Information Processing Systems, 31, 2018

2018

[21] [21]

An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

2020

[22] [22]

Modified policy iteration for exponen- tial cost risk sensitive mdps

Yashaswini Murthy, Mehrdad Moharrami, and R Srikant. Modified policy iteration for exponen- tial cost risk sensitive mdps. InLearning for Dynamics and Control Conference, pages 395–406. PMLR, 2023

2023

[23] [23]

Two-timescale critic-actor for average reward mdps with function approximation

Prashansa Panda and Shalabh Bhatnagar. Two-timescale critic-actor for average reward mdps with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 19813–19820, 2025

2025

[24] [24]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. InInternational conference on machine learning, pages 4026–4035. PMLR, 2018

2018

[25] [25]

Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic

Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. InInternational Conference on Machine Learning, pages 33240–33267. PMLR, 2023

2023

[26] [26]

Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023

Haoxing Tian, Alex Olshevsky, and Yannis Paschalidis. Convergence of actor-critic with multi- layer neural networks.Advances in neural information processing systems, 36:9279–9321, 2023

2023

[27] [27]

Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation

Yudan Wang, Yue Wang, Yi Zhou, and Shaofeng Zou. Non-asymptotic analysis for single-loop (natural) actor-critic with compatible function approximation. InInternational Conference on Machine Learning, pages 51771–51824. PMLR, 2024

2024

[28] [28]

Model- free reinforcement learning in infinite-horizon average-reward markov decision processes

Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model- free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170–10180. PMLR, 2020

2020

[29] [29]

A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes

Honghao Wei, Xin Liu, and Lei Ying. A provably-efficient model-free algorithm for infinite- horizon average-reward constrained markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3868–3876, 2022

2022

[30] [30]

A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020

Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617– 17628, 2020

2020

[31] [31]

Sample efficient policy gradient methods with recursive variance reduction

Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. InInternational Conference on Learning Representations, 2020

2020

[32] [32]

Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020

Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33:4358–4369, 2020

2020

[33] [33]

Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm

Yang Xu, Swetha Ganesh, Washim Uddin Mondal, Qinbo Bai, and Vaneet Aggarwal. Global convergence for average reward constrained MDPs with primal-dual actor critic algorithm. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025

[34] [34]

Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021

Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is enough for convex mdps.Advances in Neural Information Processing Systems, 34:25746–25759, 2021

2021

[35] [35]

Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

Sheng Zhang, Zhe Zhang, and Siva Theja Maguluri. Finite sample analysis of average-reward td learning and q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

2021

[36] [36]

Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023

Zhe Zheng, Yu Han, Yingying Chi, Fusheng Yuan, Wenpeng Cui, Hailong Zhu, Yi Zhang, and Peiying Zhang. Network resource allocation algorithm using reinforcement learning policy-based network in a smart grid scenario.Electronics, 12(15):3330, 2023

2023

[37] [37]

Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems

Jiahong Zhou, Shunhui Mao, Guoliang Yang, Bo Tang, Qianlong Xie, Lebin Lin, Xingxing Wang, and Dong Wang. Rl-mpca: A reinforcement learning based multi-phase computation allocation approach for recommender systems. InProceedings of the ACM Web Conference 2023, pages 3214–3224, 2023

2023

[38] [38]

Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning

Ruida Zhou, Tao Liu, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning. InAdvances in Neural Information Processing Systems, 2022. 13 A Pseudocode Algorithm 1:Multi-Objective Primal-Dual NAC (MO-PDNAC) Input :Parameter classΘ,feature matrixΦ,primal stepsizeα,du...

2022