QPLEX Decision Processes: Formulation via Nonlinear Markov Chains and Optimization via Policy Gradients

Antonius B. Dieker; Steven T. Hackman; Yunhao Yan; Zitong Wang

arxiv: 2605.17149 · v1 · pith:HHXHJWETnew · submitted 2026-05-16 · 🧮 math.OC · math.PR

QPLEX Decision Processes: Formulation via Nonlinear Markov Chains and Optimization via Policy Gradients

Antonius B. Dieker , Steven T. Hackman , Zitong Wang , Yunhao Yan This is my paper

Pith reviewed 2026-05-20 14:10 UTC · model grok-4.3

classification 🧮 math.OC math.PR

keywords QPLEX Decision Processesnonlinear Markov chainspolicy gradientsdynamic pricingqueueing systemsservice-level chance constraintstransient analysis

0 comments

The pith

QPLEX Decision Processes embed nonlinear approximations into Markov decisions to solve queueing control on vastly smaller state spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QPLEX Decision Processes as a model for dynamic control of queueing systems that feature non-stationary arrivals, general service distributions, and service-level chance constraints. It integrates QPLEX approximations, which use nonlinear transition probabilities, directly into a Markov decision framework so that the state space shrinks by orders of magnitude. Forward and backward iterative schemes then compute policy gradients deterministically on this reduced space, removing sampling variance. The method is illustrated on a single-station dynamic pricing problem, where it locates near-optimal policies quickly for both cost-based rewards and those that penalize violations of service-level constraints.

Core claim

QDPs integrate QPLEX approximations into a nonlinear Markov decision framework so that gradients can be computed deterministically on orders-of-magnitude smaller state spaces, enabling near-optimal policies for dynamic pricing in seconds to a minute on a single CPU even with service-level chance constraints.

What carries the argument

QPLEX nonlinear transition probabilities inside a nonlinear Markov decision process, which enable deterministic forward-backward gradient computation on the reduced state space.

If this is right

Near-optimal policies for dynamic pricing using waiting and terminal costs are obtained in seconds on a single CPU.
High-quality practical policies for pricing under service-level chance constraint penalties are obtained in roughly one minute on a single CPU.
The curse of dimensionality from general service times is circumvented because the state space is reduced by orders of magnitude.
Natural-gradient-inspired optimization with block-diagonal Fisher approximations handles the resulting optimization landscapes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The elimination of sampling variance in gradient estimates may prove especially useful for problems with rare events or long planning horizons.
The single-station pricing results suggest the same reduction could be tested on multi-station networks that share similar non-stationary arrival patterns.
Varying the penalty weights on constraint violations provides a direct way to explore the trade-off between revenue and service-level adherence.

Load-bearing premise

The QPLEX nonlinear transition probabilities remain sufficiently accurate for the transient dynamics of the queueing system under the chosen actions and non-stationary arrivals.

What would settle it

Apply the policy obtained from the QDP to a full simulation of the original queueing system and measure whether the realized performance and constraint satisfaction match the values predicted by the reduced-state model.

read the original abstract

We introduce a QPLEX Decision Process (QDP) as a model for dynamic control of queueing systems with non-stationary arrivals, general service distributions, and service-level chance constraints. QDPs integrate QPLEX, a computational modeling methodology for transient analysis of stochastic systems, into a nonlinear Markov decision framework. Since QPLEX approximations use nonlinear transition probabilities with orders-of-magnitude smaller state spaces, QDPs circumvent the curse of dimensionality associated with general service times. Via forward and backward iterative schemes, we can rapidly compute gradients deterministically on the much smaller state space, eliminating sampling variance. We further address optimization through natural-gradient-inspired methods with block-diagonal Fisher approximations. To illustrate the QDP methodology, we formulate a single-station dynamic pricing problem with non-stationary demand as a QDP. When the reward structure uses waiting and terminal costs, our approach can find near-optimal policies in seconds on a single CPU; when the reward structure uses penalties for deviating from service-level chance constraints, the optimization landscape is substantially more challenging yet our approach can find a high-quality, practical policy in approximately a minute on a single CPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This folds QPLEX transient models into a nonlinear MDP so deterministic gradients run on a much smaller state space for queueing control with general services and chance constraints.

read the letter

The main thing here is that the authors embed QPLEX nonlinear transition probabilities into a Markov decision process. That step shrinks the state space enough to let them compute policy gradients forward and backward without sampling, then apply block-diagonal Fisher approximations for the updates. The single-station dynamic pricing example with non-stationary arrivals shows how the framework incorporates waiting costs or service-level penalties and reports run times of seconds to a minute on one CPU.

Referee Report

1 major / 1 minor

Summary. The paper introduces QPLEX Decision Processes (QDPs) that embed QPLEX nonlinear approximations into a Markov decision process framework for dynamic control of queueing systems featuring non-stationary arrivals, general service distributions, and service-level chance constraints. The approach yields a nonlinear Markov chain on a drastically reduced state space, permitting deterministic forward-backward computation of policy gradients and natural-gradient-style optimization with block-diagonal Fisher approximations. The methodology is illustrated on a single-station dynamic pricing problem, where the authors report that near-optimal policies can be obtained in seconds to roughly one minute on a single CPU depending on the reward structure.

Significance. If the QPLEX nonlinear transitions remain sufficiently accurate under the optimized policies, the framework offers a promising route to scalable, sampling-free optimization of transient queueing control problems that would otherwise suffer from the curse of dimensionality. The deterministic gradient computation and reduced-state formulation constitute a clear technical contribution over standard MDP or simulation-based approaches for this class of problems.

major comments (1)

The central performance claims rest on the transfer of policies optimized in the QDP to the original queueing process. The manuscript does not appear to report direct simulation-based validation (e.g., out-of-sample performance on the true system with general service times and non-stationary arrivals) that would confirm the nonlinear transition probabilities remain accurate in the state-action regimes visited by the learned policy, particularly under chance constraints. This validation is load-bearing for the claim that the reduced-state policies are near-optimal in the original system.

minor comments (1)

Notation for the nonlinear transition kernel and the block-diagonal Fisher approximation could be introduced with greater precision in the formulation section to aid readers unfamiliar with QPLEX.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central performance claims rest on the transfer of policies optimized in the QDP to the original queueing process. The manuscript does not appear to report direct simulation-based validation (e.g., out-of-sample performance on the true system with general service times and non-stationary arrivals) that would confirm the nonlinear transition probabilities remain accurate in the state-action regimes visited by the learned policy, particularly under chance constraints. This validation is load-bearing for the claim that the reduced-state policies are near-optimal in the original system.

Authors: We agree that direct simulation-based validation on the original system is important for confirming that the QPLEX nonlinear transitions remain accurate under the optimized policies, especially when chance constraints are active. The current manuscript focuses on the QDP formulation, the deterministic gradient computation, and the optimization procedure within the reduced-state nonlinear Markov chain, with performance claims tied to that model. To address the concern, the revised version will include out-of-sample Monte Carlo simulations of the true queueing process (with general service distributions and non-stationary arrivals) under the policies obtained from the QDP. These experiments will report realized costs and empirical constraint violation rates, thereby providing evidence on the transferability of the policies to the original system. revision: yes

Circularity Check

0 steps flagged

No circularity: QDP formulation derives from QPLEX integration and deterministic gradients without reducing to fitted inputs or self-referential definitions

full rationale

The paper's derivation chain formulates QDPs by embedding QPLEX nonlinear transition probabilities into a reduced-state nonlinear MDP, then computes policy gradients via forward/backward iterations on that smaller space. This is presented as a modeling and algorithmic step that circumvents the curse of dimensionality for general service times and non-stationary arrivals. No equation or claim equates a 'prediction' or optimized policy back to the QPLEX parameters by construction; the accuracy of the nonlinear transitions under optimized policies is explicitly an assumption (the weakest one identified), not a derived identity. No load-bearing self-citation chains appear in the provided text for uniqueness theorems or ansatzes; the contribution is the integration and natural-gradient optimization method itself. The framework remains externally falsifiable against the original queueing system, consistent with a self-contained modeling paper rather than a tautological fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view; the central claim rests on the accuracy of QPLEX approximations for transient queueing behavior and on the validity of the nonlinear transition probabilities as a faithful reduced model. No explicit free parameters, axioms, or invented entities are named in the provided text.

pith-pipeline@v0.9.0 · 5743 in / 1157 out tokens · 43086 ms · 2026-05-20T14:10:22.776810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A QPLEX Decision Process (QDP) is a nonlinear MDP that uses QPLEX-based transition probabilities... nonlinear transition probabilities p(t)_μ(s′|s,a) parameterized by a pmf μ... forward iterative scheme... backward iterative scheme for approximate gradient calculations
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop an MDP-style policy gradient framework for nonlinear Markov chains... exponentiated Q-ascent algorithm... block-diagonal Fisher approximations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Natural Gradient Works Eﬀiciently in Learning

Shun-ichi Amari. “Natural Gradient Works Eﬀiciently in Learning” . In: Neural Computation 10.2 (1998), pp. 251–276

work page 1998
[2]

Dynamic Pricing for Reusable Resources: The Power of Two Prices

Santiago R. Balseiro, Will Ma, and Wenxin Zhang. “Dynamic Pricing for Reusable Resources: The Power of Two Prices” . In: arXiv preprint arXiv:2308.13822 (2025)

work page arXiv 2025
[3]

Mean field Markov decision processes

Nicole Bäuerle. “Mean field Markov decision processes” . In: Applied Mathematics & Optimization 87.2 (2023), p. 22

work page 2023
[4]

Control of McKean–Vlasov dynamics versus mean field games

René Carmona, François Delarue, and Aimé Lachapelle. “Control of McKean–Vlasov dynamics versus mean field games” . In: Mathematics and Financial Economics 7.2 (2013), pp. 131–166

work page 2013
[5]

Cassandras and Stephane Lafortune

C.G. Cassandras and Stephane Lafortune. Introduction to Discrete Event Systems . 2nd. 2008

work page 2008
[6]

Differentiable Discrete Event Simulation for Queuing Network Control

Ethan Che, Jing Dong, and Hongseok Namkoong. “Differentiable Discrete Event Simulation for Queuing Network Control” . In: arXiv preprint arXiv:2409.03740 (2024)

work page arXiv 2024
[7]

Elements of information theory

Thomas M Cover. Elements of information theory . John Wiley & Sons, 1999

work page 1999
[8]

Dieker and Steven T

Antonius B. Dieker and Steven T. Hackman. QPLEX: A computational modeling and analysis methodology for stochastic systems . Springer Nature, 2025

work page 2025
[9]

The Power of Static Pricing for Reusable Resources

Adam N. Elmachtoub and Jiaqi Shi. “The Power of Static Pricing for Reusable Resources” . In: arXiv preprint arXiv:2302.11723 (2025)

work page arXiv 2025
[10]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. “Variance reduction techniques for gradient estimates in reinforcement learning” . In: Journal of Machine Learning Research 5 (2004), pp. 1471–1530

work page 2004
[11]

Convergence for natural policy gradient on infinite-state queueing MDPs

Isaac Grosof, Siva Theja Maguluri, and R Srikant. “Convergence for natural policy gradient on infinite-state queueing MDPs” . In: arXiv preprint arXiv:2402.05274 (2024)

work page arXiv 2024
[12]

Variance Reduction Methods in Simulation

Shane G. Henderson and Peter W. Glynn. “Variance Reduction Methods in Simulation” . In: Pro- ceedings of the Winter Simulation Conference . 2002, pp. 135–141

work page 2002
[13]

A Natural Policy Gradient

Sham M. Kakade. “A Natural Policy Gradient” . PhD thesis. University College London, 2002. REFERENCES 42

work page 2002
[14]

Kolokoltsov

Vassili N. Kolokoltsov. Nonlinear Markov Processes and Kinetic Equations . Cambridge University Press, 2010

work page 2010
[15]

Actor-Critic Algorithms

Vijay R. Konda and John N. Tsitsiklis. “Actor-Critic Algorithms” . In: NIPS. 2000, pp. 1008–1014

work page 2000
[16]

The empirical likelihood approach to simulation input uncertainty

Henry Lam and Enlu Zhou. “The empirical likelihood approach to simulation input uncertainty” . In: Operations Research 65.2 (2017), pp. 458–475

work page 2017
[17]

Optimizing Neural Networks with Kronecker-factored Ap- proximate Curvature

James Martens and Roger Grosse. “Optimizing Neural Networks with Kronecker-factored Ap- proximate Curvature” . In: ICML. 2015, pp. 2408–2417

work page 2015
[18]

A class of Markov processes associated with nonlinear parabolic equations

H. P. McKean Jr. “A class of Markov processes associated with nonlinear parabolic equations” . In: Proceedings of the National Academy of Sciences of the United States of America 56 (6 1966), pp. 1907–1911

work page 1966
[19]

Proximal Policy Optimization Algorithms

John Schulman et al. “Proximal Policy Optimization Algorithms” . In:arXiv preprint arXiv:1707.06347. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Trust Region Policy Optimization

John Schulman et al. “Trust Region Policy Optimization” . In:Proceedings of the 32nd International Conference on Machine Learning . 2015, pp. 1889–1897

work page 2015
[21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . 2nd. MIT Press, 2018

work page 2018
[22]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. “Policy gradient meth- ods for reinforcement learning with function approximation” . In: Advances in Neural Information Processing Systems. Vol. 12. 2000, pp. 1057–1063

work page 2000
[23]

Topics in propagation of chaos

Alain-Sol Sznitman. “Topics in propagation of chaos” . In: Ecole d’Eté de Probabilités de Saint- Flour XIX—1989 . Springer, 1991, pp. 165–251

work page 1989
[24]

Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning

Ronald J. Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning” . In: Machine Learning 8.3-4 (1992), pp. 229–256

work page 1992
[25]

Optimal prices for finite capacity queueing systems

Serhan Ziya, Hayriye Ayhan, and Robert D. Foley. “Optimal prices for finite capacity queueing systems” . In: Operations Research Letters 34 (2006), pp. 214–218

work page 2006

[1] [1]

Natural Gradient Works Eﬀiciently in Learning

Shun-ichi Amari. “Natural Gradient Works Eﬀiciently in Learning” . In: Neural Computation 10.2 (1998), pp. 251–276

work page 1998

[2] [2]

Dynamic Pricing for Reusable Resources: The Power of Two Prices

Santiago R. Balseiro, Will Ma, and Wenxin Zhang. “Dynamic Pricing for Reusable Resources: The Power of Two Prices” . In: arXiv preprint arXiv:2308.13822 (2025)

work page arXiv 2025

[3] [3]

Mean field Markov decision processes

Nicole Bäuerle. “Mean field Markov decision processes” . In: Applied Mathematics & Optimization 87.2 (2023), p. 22

work page 2023

[4] [4]

Control of McKean–Vlasov dynamics versus mean field games

René Carmona, François Delarue, and Aimé Lachapelle. “Control of McKean–Vlasov dynamics versus mean field games” . In: Mathematics and Financial Economics 7.2 (2013), pp. 131–166

work page 2013

[5] [5]

Cassandras and Stephane Lafortune

C.G. Cassandras and Stephane Lafortune. Introduction to Discrete Event Systems . 2nd. 2008

work page 2008

[6] [6]

Differentiable Discrete Event Simulation for Queuing Network Control

Ethan Che, Jing Dong, and Hongseok Namkoong. “Differentiable Discrete Event Simulation for Queuing Network Control” . In: arXiv preprint arXiv:2409.03740 (2024)

work page arXiv 2024

[7] [7]

Elements of information theory

Thomas M Cover. Elements of information theory . John Wiley & Sons, 1999

work page 1999

[8] [8]

Dieker and Steven T

Antonius B. Dieker and Steven T. Hackman. QPLEX: A computational modeling and analysis methodology for stochastic systems . Springer Nature, 2025

work page 2025

[9] [9]

The Power of Static Pricing for Reusable Resources

Adam N. Elmachtoub and Jiaqi Shi. “The Power of Static Pricing for Reusable Resources” . In: arXiv preprint arXiv:2302.11723 (2025)

work page arXiv 2025

[10] [10]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. “Variance reduction techniques for gradient estimates in reinforcement learning” . In: Journal of Machine Learning Research 5 (2004), pp. 1471–1530

work page 2004

[11] [11]

Convergence for natural policy gradient on infinite-state queueing MDPs

Isaac Grosof, Siva Theja Maguluri, and R Srikant. “Convergence for natural policy gradient on infinite-state queueing MDPs” . In: arXiv preprint arXiv:2402.05274 (2024)

work page arXiv 2024

[12] [12]

Variance Reduction Methods in Simulation

Shane G. Henderson and Peter W. Glynn. “Variance Reduction Methods in Simulation” . In: Pro- ceedings of the Winter Simulation Conference . 2002, pp. 135–141

work page 2002

[13] [13]

A Natural Policy Gradient

Sham M. Kakade. “A Natural Policy Gradient” . PhD thesis. University College London, 2002. REFERENCES 42

work page 2002

[14] [14]

Kolokoltsov

Vassili N. Kolokoltsov. Nonlinear Markov Processes and Kinetic Equations . Cambridge University Press, 2010

work page 2010

[15] [15]

Actor-Critic Algorithms

Vijay R. Konda and John N. Tsitsiklis. “Actor-Critic Algorithms” . In: NIPS. 2000, pp. 1008–1014

work page 2000

[16] [16]

The empirical likelihood approach to simulation input uncertainty

Henry Lam and Enlu Zhou. “The empirical likelihood approach to simulation input uncertainty” . In: Operations Research 65.2 (2017), pp. 458–475

work page 2017

[17] [17]

Optimizing Neural Networks with Kronecker-factored Ap- proximate Curvature

James Martens and Roger Grosse. “Optimizing Neural Networks with Kronecker-factored Ap- proximate Curvature” . In: ICML. 2015, pp. 2408–2417

work page 2015

[18] [18]

A class of Markov processes associated with nonlinear parabolic equations

H. P. McKean Jr. “A class of Markov processes associated with nonlinear parabolic equations” . In: Proceedings of the National Academy of Sciences of the United States of America 56 (6 1966), pp. 1907–1911

work page 1966

[19] [19]

Proximal Policy Optimization Algorithms

John Schulman et al. “Proximal Policy Optimization Algorithms” . In:arXiv preprint arXiv:1707.06347. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Trust Region Policy Optimization

John Schulman et al. “Trust Region Policy Optimization” . In:Proceedings of the 32nd International Conference on Machine Learning . 2015, pp. 1889–1897

work page 2015

[21] [21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . 2nd. MIT Press, 2018

work page 2018

[22] [22]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. “Policy gradient meth- ods for reinforcement learning with function approximation” . In: Advances in Neural Information Processing Systems. Vol. 12. 2000, pp. 1057–1063

work page 2000

[23] [23]

Topics in propagation of chaos

Alain-Sol Sznitman. “Topics in propagation of chaos” . In: Ecole d’Eté de Probabilités de Saint- Flour XIX—1989 . Springer, 1991, pp. 165–251

work page 1989

[24] [24]

Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning

Ronald J. Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning” . In: Machine Learning 8.3-4 (1992), pp. 229–256

work page 1992

[25] [25]

Optimal prices for finite capacity queueing systems

Serhan Ziya, Hayriye Ayhan, and Robert D. Foley. “Optimal prices for finite capacity queueing systems” . In: Operations Research Letters 34 (2006), pp. 214–218

work page 2006