Randomized Optimal Switching Problem and Related Mirror Descent Flow

Yuchao Dong

arxiv: 2606.12875 · v1 · pith:JV6X35LJnew · submitted 2026-06-11 · 🧮 math.OC

Randomized Optimal Switching Problem and Related Mirror Descent Flow

Yuchao Dong This is my paper

Pith reviewed 2026-06-27 06:21 UTC · model grok-4.3

classification 🧮 math.OC

keywords optimal switchingcontinuous-time reinforcement learningKL regularizationmirror descent flowHamilton-Jacobi-Bellman equationGibbs policytemperature annealing

0 comments

The pith

The KL-regularized value function solves an elliptic HJB system, yields an explicit Gibbs policy, and approximates the classical optimum with error O(λ log 1/λ).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper relaxes the deterministic optimal switching control of a diffusion process to a randomized Markov chain policy, adding a temperature-weighted KL-divergence term to the cost. The resulting regularized value function is proven to be the unique smooth solution of a coupled elliptic Hamilton-Jacobi-Bellman system, from which the optimal policy is recovered explicitly via an exponential (Gibbs) map on value differences. A mirror descent flow is then defined in the dual logarithmic policy variables; the flow is well-posed, strictly decreases the regularized value, and converges to the classical optimum at explicit rates that depend on whether the temperature λ is held constant or annealed.

Core claim

Under mild assumptions on the coefficients, the regularized value function is the unique smooth solution of an elliptic Hamilton--Jacobi--Bellman system, and an explicit optimal Gibbs policy is derived by exponential transformation of value-function differences across modes. The regularized value approximates the classical optimal value with error of order O(λ log 1/λ). The mirror descent flow in the dual logarithmic policy space is well-posed, decreases the value function monotonically, and achieves quantitative error bounds of order O(1/(e^{λ s} - 1) + λ log 1/λ) for constant temperature and O(log s / sqrt(s)) for the annealing schedule λ_s = 1/sqrt(1+s).

What carries the argument

The mirror descent flow in the dual logarithmic policy space, a continuous-time dynamical system on randomized switching policies whose trajectories decrease the regularized cost.

If this is right

An explicit randomized optimal policy is obtained directly from the value-function differences without further optimization.
The approximation error vanishes as λ → 0 at the rate O(λ log 1/λ).
The mirror descent flow converges to the classical optimum at the stated rates for both constant and annealing temperature schedules.
The value function decreases monotonically along every trajectory of the flow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization-plus-flow construction may be applied to other continuous-time stochastic control problems that require exploration.
Annealing the temperature according to 1/sqrt(s) supplies a concrete schedule that trades off early exploration against late exploitation while still guaranteeing convergence.
Numerical discretizations of the mirror descent flow could be used to train policies for regime-switching models arising in finance or energy systems.

Load-bearing premise

The diffusion coefficients and running/transition costs satisfy conditions that guarantee unique smooth solutions to the coupled elliptic HJB system.

What would settle it

A concrete coefficient set for which either the regularized value function is not C^2 or the approximation gap to the classical value exceeds any multiple of λ log(1/λ) for sufficiently small λ.

read the original abstract

We study continuous-time reinforcement learning for the optimal switching problem, in which a decision-maker controls a diffusion process by switching among finitely many regimes, incurring both running and transition costs. To enable exploration, we relax the classical deterministic switching control to a randomized framework, where the switching decisions are governed by a continuous-time Markov chain with state-dependent generator, and augment the cost functional with a KL-divergence regularization weighted by a temperature parameter $\lambda$. Under mild assumptions on the coefficients, we establish that the regularized value function is the unique smooth solution of an elliptic Hamilton--Jacobi--Bellman system, and derive an explicit optimal Gibbs policy given by an exponential transformation of the value function differences across modes. We further prove that the regularized value function approximates the classical optimal value function with error of order $O\left(\lambda \log \frac{1}{\lambda}\right)$, which is consistent with analogous bounds established in other entropy-regularized control problems and is believed to be sharp. To solve the regularized problem numerically, we introduce a mirror descent flow in the dual logarithmic policy space, prove its well-posedness and the monotonic decrease of the value function along the flow, and establish quantitative error bound to the classical optimal value function. For a constant temperature scheduler, the convergence rate is of order $O\left(\frac{1}{e^{\lambda s} - 1}+\lambda \log\frac1\lambda\right)$, while under the annealing scheduler $\lambda_s = \frac{1}{\sqrt{1+s}}$, we obtain the rate $O\left(\frac{\log s}{\sqrt{s}}\right)$, which decays to zero as the flow time $s \to \infty$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mirror descent flow in log-policy space supplies explicit rates for entropy-regularized switching, but the whole chain rests on unstated mild assumptions for HJB uniqueness and smoothness.

read the letter

The paper sets up entropy-regularized optimal switching for a diffusion controlled by a continuous-time Markov chain and then builds a mirror descent flow directly in the logarithmic policy variables. It supplies two concrete convergence rates to the classical value function: O(1/(e^{λs}-1) + λ log 1/λ) for constant temperature and O(log s / sqrt(s)) for the annealing schedule λ_s = 1/sqrt(1+s). It also gives an explicit Gibbs policy from the value-function differences and recovers the familiar O(λ log 1/λ) approximation error.

Those rates and the flow construction in dual space are the genuinely new pieces. The approximation bound lines up with results already known for other entropy-regularized control problems, which is reassuring. The monotonicity claim along the flow and the well-posedness statement are standard once the HJB is in hand.

The load-bearing step is the claim that, under mild assumptions on the coefficients, the regularized value function is the unique smooth solution of the coupled elliptic HJB system. The abstract never lists the precise hypotheses, so it is impossible to check whether uniform ellipticity holds across regimes or whether the transition costs satisfy the conditions needed for a comparison principle. If that regularity or uniqueness fails, the explicit policy formula and every subsequent error bound collapse. That is the only place where the argument could break.

The work is aimed at people already working on continuous-time switching control or entropy methods in stochastic control. A reader who needs a provably convergent numerical scheme for these problems could extract the flow and the rates. It is too specialized for a general audience.

The formal structure is clear and the citations stay within the relevant literature, so the paper deserves a serious referee once the exact assumptions are written down and the PDE existence proof is verified.

Referee Report

1 major / 0 minor

Summary. The paper considers continuous-time optimal switching control of a diffusion process among finitely many regimes, with running and switching costs. It introduces KL-divergence regularization with temperature λ to obtain a randomized policy, shows that the regularized value function is the unique smooth solution of a coupled elliptic HJB system under mild coefficient assumptions, derives an explicit optimal Gibbs policy via exponential transformation, proves an O(λ log 1/λ) approximation to the unregularized value function, and analyzes a mirror-descent flow in logarithmic policy space whose value decreases monotonically, yielding explicit convergence rates O(1/(e^{λ s}-1) + λ log 1/λ) for constant λ and O(log s / sqrt(s)) for the annealing schedule λ_s = 1/sqrt(1+s).

Significance. If the well-posedness and approximation results hold, the work supplies a rigorous entropy-regularized formulation for continuous-time switching problems together with explicit policies and quantitative convergence rates for the associated mirror-descent dynamics; the explicit rates under both constant and annealing temperature schedules constitute a concrete contribution that could inform numerical schemes in stochastic control.

major comments (1)

[Abstract / existence theorem] Abstract (first paragraph after abstract) and the existence/uniqueness theorem: the claim that the regularized value function is the unique smooth solution of the elliptic HJB system rests on “mild assumptions on the coefficients” whose precise statement (uniform ellipticity constants, growth and regularity conditions on the running and transition costs, and any regime-uniformity requirements) is not quoted. Because this theorem is load-bearing for the explicit Gibbs policy, the O(λ log 1/λ) approximation, and all subsequent convergence rates of the mirror-descent flow, the hypotheses must be stated explicitly in the theorem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comment on the presentation of assumptions. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / existence theorem] Abstract (first paragraph after abstract) and the existence/uniqueness theorem: the claim that the regularized value function is the unique smooth solution of the elliptic HJB system rests on “mild assumptions on the coefficients” whose precise statement (uniform ellipticity constants, growth and regularity conditions on the running and transition costs, and any regime-uniformity requirements) is not quoted. Because this theorem is load-bearing for the explicit Gibbs policy, the O(λ log 1/λ) approximation, and all subsequent convergence rates of the mirror-descent flow, the hypotheses must be stated explicitly in the theorem.

Authors: We agree that the precise hypotheses must appear explicitly in the theorem statement. In the revised manuscript we will insert the full list of assumptions (uniform ellipticity constants, growth and regularity conditions on the running and switching costs, and regime-uniformity requirements) directly into the statement of the existence/uniqueness theorem. The abstract will be updated to refer to these explicitly stated assumptions rather than the generic phrase “mild assumptions.” revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from problem setup and standard HJB analysis

full rationale

The paper establishes under mild assumptions that the regularized value function is the unique smooth solution to the coupled elliptic HJB system, derives the explicit Gibbs policy via exponential transformation, proves the O(λ log 1/λ) approximation error, and analyzes convergence rates of the introduced mirror descent flow. These steps follow from the KL-regularized formulation, standard stochastic control theory for existence/uniqueness, and direct analysis of the flow's well-posedness and monotonicity; no equations reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations. The chain is independent of the target results and relies on external mathematical facts about HJB equations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework adds the temperature parameter λ and the mirror descent dynamics; it rests on standard existence theory for elliptic systems under coefficient assumptions.

free parameters (1)

λ
Temperature weighting the KL regularization term; controls exploration versus optimality trade-off.

axioms (1)

domain assumption Mild assumptions on the coefficients of the diffusion and running/transition costs guarantee unique smooth solutions to the coupled elliptic HJB system.
Invoked immediately to obtain existence, uniqueness, and the Gibbs policy representation.

pith-pipeline@v0.9.1-grok · 5829 in / 1468 out tokens · 28795 ms · 2026-06-27T06:21:57.242047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

[1]

Optimality and ap- proximation with policy gradient methods in markov decision processes

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and ap- proximation with policy gradient methods in markov decision processes. In Conference on learning theory, pages 64–66. PMLR, 2020

2020
[2]

A probabilistic numerical method for optimal multiple switching problems in high dimension

Ren´ e A¨ ıd, Luciano Campi, Nicolas Langren´ e, and Huyˆ en Pham. A probabilistic numerical method for optimal multiple switching problems in high dimension. 2014

2014
[3]

A neural network approach to high-dimensional optimal switching problems with jumps in energy markets

Erhan Bayraktar, Asaf Cohen, and April Nellis. A neural network approach to high-dimensional optimal switching problems with jumps in energy markets. SIAM Journal on Financial Mathematics, 14(4):1028–1061, 2023

2023
[4]

Evaluating natural resource investments

Michael J Brennan and Eduardo S Schwartz. Evaluating natural resource investments. Journal of business, pages 135–157, 1985

1985
[5]

Valuation of energy storage: An optimal switching approach

Ren´ e Carmona and Michael Ludkovski. Valuation of energy storage: An optimal switching approach. Quantitative finance, 10(4):359–374, 2010

2010
[6]

Reinforcement learning for arbitrage strategies in stock index futures

Min Dai, Yuchao Dong, and Linfeng Li. Reinforcement learning for arbitrage strategies in stock index futures. Available at SSRN 5403455 , 2025

2025
[7]

Learning to optimally stop diffusion processes, with financial applications

Min Dai, Yu Sun, Zuo Quan Xu, and Xun Yu Zhou. Learning to optimally stop diffusion processes, with financial applications. Management Science, 2026

2026
[8]

Exploratory optimal stopping: A singular control formulation

Jodi Dianetti, Giorgio Ferrari, and Renyuan Xu. Exploratory optimal stopping: A singular control formulation. arXiv preprint arXiv:2408.09335 , 2024

work page arXiv 2024
[9]

A finite horizon optimal multiple switching problem

Boualem Djehiche, Said Hamadene, and Alexandre Popier. A finite horizon optimal multiple switching problem. SIAM Journal on Control and Optimization , 48(4):2751–2770, 2009

2009
[10]

Randomized optimal stopping problem in continuous time and reinforcement learning algorithm

Yuchao Dong. Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization , 62(3):1590–1614, 2024

2024
[11]

A model for investment decisions with switching costs

Kate Duckworth and Mihail Zervos. A model for investment decisions with switching costs. Annals of Applied probability , pages 239–260, 2001

2001
[12]

Convergence rates for regularized optimal transport via quantization

Stephan Eckstein and Marcel Nutz. Convergence rates for regularized optimal transport via quantization. Mathematics of Operations Research, 49(2):1223–1240, 2024

2024
[13]

Multiple integrals in the calculus of variations and nonlinear elliptic systems, 1984

Lawrence C Evans. Multiple integrals in the calculus of variations and nonlinear elliptic systems, 1984

1984
[14]

Continuous-time reinforcement learning for optimal switching over multiple regimes

Yijie Huang, Mengge Li, Xiang Yu, and Zhou Zhou. Continuous-time reinforcement learning for optimal switching over multiple regimes. arXiv preprint arXiv:2512.04697 , 2025

work page arXiv 2025
[15]

Viscosity solutions for monotone systems of second–order elliptic PDEs

Hitoshi Ishii and Shigeaki Koike. Viscosity solutions for monotone systems of second–order elliptic PDEs. Communications in partial differential equations , 16(6-7):1095–1128, 1991. 21

1991
[16]

Limit theorems for stochastic processes, volume 288

Jean Jacod and Albert Shiryaev. Limit theorems for stochastic processes, volume 288. Springer Science & Business Media, 2013

2013
[17]

Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach

Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1– 55, 2022

2022
[18]

Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms

Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research , 23(275):1–50, 2022

2022
[19]

q-learning in continuous time

Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24(161):1–61, 2023

2023
[20]

A natural policy gradient

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001

2001
[21]

A Fisher–Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces

Bekzhan Kerimkulov, James-Michael Leahy, David Siska, Lukasz Szpruch, and Yufei Zhang. A Fisher–Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces. Foundations of Computational Mathematics , pages 1–75, 2025

2025
[22]

Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes

Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming , 198(1):1059–1106, 2023

2023
[23]

Valuing american options by simulation: A simple least-squares approach

Francis A Longstaff and Eduardo S Schwartz. Valuing american options by simulation: A simple least-squares approach. The review of financial studies , 14(1):113–147, 2001

2001
[24]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983
[25]

On the smooth-fit property for one-dimensional optimal switching problem

Huyˆ en Pham. On the smooth-fit property for one-dimensional optimal switching problem. In S´ eminaire de probabilit´ es XL, pages 187–199. Springer, 2007

2007
[26]

Optimal switching over multiple regimes

Huyen Pham, Vathana Ly Vath, and Xun Yu Zhou. Optimal switching over multiple regimes. SIAM Journal on Control and Optimization , 48(4):2217–2253, 2009

2009
[27]

Entropy annealing for policy mirror descent in continuous time and space

Deven Sethi, David ˇSiˇ ska, and Yufei Zhang. Entropy annealing for policy mirror descent in continuous time and space. SIAM Journal on Control and Optimization , 63(4):3006–3041, 2025

2025
[28]

Finite horizon stochastic optimal switching and impulse controls with a viscosity solution approach

Shanjian Tang and Jiongmin Yong. Finite horizon stochastic optimal switching and impulse controls with a viscosity solution approach. Stochastics: An International Journal of Probability and Stochastic Processes, 45(3-4):145–176, 1993

1993
[29]

Exploratory hjb equations and their convergence

Wenpin Tang, Yuming Paul Zhang, and Xun Yu Zhou. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization , 60(6):3191–3216, 2022

2022
[30]

Reinforcement learning in continuous time and space: A stochastic control approach

Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research , 21(198):1–34, 2020. 22

2020

[1] [1]

Optimality and ap- proximation with policy gradient methods in markov decision processes

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and ap- proximation with policy gradient methods in markov decision processes. In Conference on learning theory, pages 64–66. PMLR, 2020

2020

[2] [2]

A probabilistic numerical method for optimal multiple switching problems in high dimension

Ren´ e A¨ ıd, Luciano Campi, Nicolas Langren´ e, and Huyˆ en Pham. A probabilistic numerical method for optimal multiple switching problems in high dimension. 2014

2014

[3] [3]

A neural network approach to high-dimensional optimal switching problems with jumps in energy markets

Erhan Bayraktar, Asaf Cohen, and April Nellis. A neural network approach to high-dimensional optimal switching problems with jumps in energy markets. SIAM Journal on Financial Mathematics, 14(4):1028–1061, 2023

2023

[4] [4]

Evaluating natural resource investments

Michael J Brennan and Eduardo S Schwartz. Evaluating natural resource investments. Journal of business, pages 135–157, 1985

1985

[5] [5]

Valuation of energy storage: An optimal switching approach

Ren´ e Carmona and Michael Ludkovski. Valuation of energy storage: An optimal switching approach. Quantitative finance, 10(4):359–374, 2010

2010

[6] [6]

Reinforcement learning for arbitrage strategies in stock index futures

Min Dai, Yuchao Dong, and Linfeng Li. Reinforcement learning for arbitrage strategies in stock index futures. Available at SSRN 5403455 , 2025

2025

[7] [7]

Learning to optimally stop diffusion processes, with financial applications

Min Dai, Yu Sun, Zuo Quan Xu, and Xun Yu Zhou. Learning to optimally stop diffusion processes, with financial applications. Management Science, 2026

2026

[8] [8]

Exploratory optimal stopping: A singular control formulation

Jodi Dianetti, Giorgio Ferrari, and Renyuan Xu. Exploratory optimal stopping: A singular control formulation. arXiv preprint arXiv:2408.09335 , 2024

work page arXiv 2024

[9] [9]

A finite horizon optimal multiple switching problem

Boualem Djehiche, Said Hamadene, and Alexandre Popier. A finite horizon optimal multiple switching problem. SIAM Journal on Control and Optimization , 48(4):2751–2770, 2009

2009

[10] [10]

Randomized optimal stopping problem in continuous time and reinforcement learning algorithm

Yuchao Dong. Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization , 62(3):1590–1614, 2024

2024

[11] [11]

A model for investment decisions with switching costs

Kate Duckworth and Mihail Zervos. A model for investment decisions with switching costs. Annals of Applied probability , pages 239–260, 2001

2001

[12] [12]

Convergence rates for regularized optimal transport via quantization

Stephan Eckstein and Marcel Nutz. Convergence rates for regularized optimal transport via quantization. Mathematics of Operations Research, 49(2):1223–1240, 2024

2024

[13] [13]

Multiple integrals in the calculus of variations and nonlinear elliptic systems, 1984

Lawrence C Evans. Multiple integrals in the calculus of variations and nonlinear elliptic systems, 1984

1984

[14] [14]

Continuous-time reinforcement learning for optimal switching over multiple regimes

Yijie Huang, Mengge Li, Xiang Yu, and Zhou Zhou. Continuous-time reinforcement learning for optimal switching over multiple regimes. arXiv preprint arXiv:2512.04697 , 2025

work page arXiv 2025

[15] [15]

Viscosity solutions for monotone systems of second–order elliptic PDEs

Hitoshi Ishii and Shigeaki Koike. Viscosity solutions for monotone systems of second–order elliptic PDEs. Communications in partial differential equations , 16(6-7):1095–1128, 1991. 21

1991

[16] [16]

Limit theorems for stochastic processes, volume 288

Jean Jacod and Albert Shiryaev. Limit theorems for stochastic processes, volume 288. Springer Science & Business Media, 2013

2013

[17] [17]

Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach

Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1– 55, 2022

2022

[18] [18]

Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms

Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research , 23(275):1–50, 2022

2022

[19] [19]

q-learning in continuous time

Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24(161):1–61, 2023

2023

[20] [20]

A natural policy gradient

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001

2001

[21] [21]

A Fisher–Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces

Bekzhan Kerimkulov, James-Michael Leahy, David Siska, Lukasz Szpruch, and Yufei Zhang. A Fisher–Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces. Foundations of Computational Mathematics , pages 1–75, 2025

2025

[22] [22]

Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes

Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming , 198(1):1059–1106, 2023

2023

[23] [23]

Valuing american options by simulation: A simple least-squares approach

Francis A Longstaff and Eduardo S Schwartz. Valuing american options by simulation: A simple least-squares approach. The review of financial studies , 14(1):113–147, 2001

2001

[24] [24]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983

[25] [25]

On the smooth-fit property for one-dimensional optimal switching problem

Huyˆ en Pham. On the smooth-fit property for one-dimensional optimal switching problem. In S´ eminaire de probabilit´ es XL, pages 187–199. Springer, 2007

2007

[26] [26]

Optimal switching over multiple regimes

Huyen Pham, Vathana Ly Vath, and Xun Yu Zhou. Optimal switching over multiple regimes. SIAM Journal on Control and Optimization , 48(4):2217–2253, 2009

2009

[27] [27]

Entropy annealing for policy mirror descent in continuous time and space

Deven Sethi, David ˇSiˇ ska, and Yufei Zhang. Entropy annealing for policy mirror descent in continuous time and space. SIAM Journal on Control and Optimization , 63(4):3006–3041, 2025

2025

[28] [28]

Finite horizon stochastic optimal switching and impulse controls with a viscosity solution approach

Shanjian Tang and Jiongmin Yong. Finite horizon stochastic optimal switching and impulse controls with a viscosity solution approach. Stochastics: An International Journal of Probability and Stochastic Processes, 45(3-4):145–176, 1993

1993

[29] [29]

Exploratory hjb equations and their convergence

Wenpin Tang, Yuming Paul Zhang, and Xun Yu Zhou. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization , 60(6):3191–3216, 2022

2022

[30] [30]

Reinforcement learning in continuous time and space: A stochastic control approach

Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research , 21(198):1–34, 2020. 22

2020