MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

Akifumi Wachi; Katsuki Fujisawa; Ken Kawano; Kyoung-Sook Kim; Mehrshad Sadria; Richard Weiss; Sebastien Gros; Xiao Hu; Xin Liu; Xun Shen

arxiv: 2606.01028 · v1 · pith:DPSUTSYCnew · submitted 2026-05-31 · 💻 cs.LG

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

Yuepeng Wang , Ken Kawano , Yongqi Zhou , Yoshihiko Fujisawa , Richard Weiss , Akifumi Wachi , Katsuki Fujisawa , Ying Chen

show 6 more authors

Mehrshad Sadria Xin Liu Kyoung-Sook Kim Xiao Hu Sebastien Gros Xun Shen

This is my paper

Pith reviewed 2026-06-28 17:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords medical reinforcement learningcontinuous-time RLbenchmark environmentdynamic treatment recommendationPhysics-Informed Neural Networkspatient trajectory simulationoffline RLpersonalized treatment

0 comments

The pith

MedGym constructs a continuous-time benchmark for reinforcement learning in dynamic medical treatment from clinical data using Physics-Informed Neural Networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedGym to overcome the mismatch between real medical treatment, where physiology changes continuously and measurements occur at irregular intervals, and existing RL environments that rely on fixed discrete time steps. It builds the benchmark by training Physics-Informed Neural Networks on clinical data to simulate individualized patient trajectories and treatment responses. This setup permits direct testing of RL algorithms on problems such as time-dependent disease progression, safety between interventions, and the difference between offline learning and online deployment. A sympathetic reader would care because medical decisions must account for varying intervals and patient differences, and a configurable continuous-time environment could reveal where current methods fall short in realistic conditions. The benchmark supports both offline and online RL evaluation along clinical axes including personalization and trajectory safety.

Core claim

MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment.

What carries the argument

MedGym benchmark environment, built by training Physics-Informed Neural Networks on clinical data to generate continuous-time patient-specific dynamics and treatment effects.

If this is right

RL methods can be tested for handling time-interval-dependent disease progression and personalized responses.
Direct comparisons become possible between discrete-time and continuous-time RL approaches on the same patient data.
Evaluation can include trajectory-level safety and the gap between model-based offline policies and online deployment.
The environment remains configurable for different clinical datasets while preserving continuous-time structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulated dynamics prove faithful, the benchmark could be used to pre-screen RL policies for safety before any real-patient testing.
The continuous-time formulation might expose failure modes in standard discrete-time RL algorithms when decision intervals vary widely.
Extending the same PINN construction process to new disease areas would require only additional clinical time-series data.
The benchmark could serve as a common testbed for developing new continuous-time RL algorithms tailored to irregular medical observations.

Load-bearing premise

Physics-Informed Neural Networks trained on clinical data can faithfully reproduce continuous-time patient-specific dynamics and treatment effects at irregular intervals.

What would settle it

Generated trajectories in MedGym diverge substantially from held-out real patient records in measured physiological variables or observed treatment responses over multiple irregular intervals.

Figures

Figures reproduced from arXiv: 2606.01028 by Akifumi Wachi, Katsuki Fujisawa, Ken Kawano, Kyoung-Sook Kim, Mehrshad Sadria, Richard Weiss, Sebastien Gros, Xiao Hu, Xin Liu, Xun Shen, Ying Chen, Yongqi Zhou, Yoshihiko Fujisawa, Yuepeng Wang.

**Figure 2.** Figure 2: A diagram of benchmarking pipeline of MedGym. Clinical data, such as the MIMIC-III dataset, are used to train the PINN modules for environment construction, yielding the MedGym environment. The resulting environment is then used to evaluate both offline RL and online RL methods: offline RL algorithms learn policies from training data generated by MedGym, whereas online RL algorithms learn through direct in… view at source ↗

**Figure 3.** Figure 3: Rollout evaluation on patient-specific PINN datasets. (a) Rollout trajectories against [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of 3 RL algorithms with and without time adaptation on population-level PINN. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distributional transfer evaluation of SAC and Lagrangian TRPO on 110 individual patient [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation of individual and population-level policies with adaptive time on a representative [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: PINN reconstruction: Population vs Individual on two representative ICU stays. Stars: [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Closed-loop trajectories under the four policies (Pop/Ind [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Distributional transfer evaluation under the same setting as Fig. 5, with additional policy [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Supplementary patient-wise visualization of the distributional transfer evaluation in Fig. 9. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Distributional transfer evaluation of GCQL against the TRPOLag behavior policy under [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Fixed-dt distributional transfer evaluation of TRPOLag, DQN, CQL, and GCQL on 110 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: OOD support-distance distribution for fixed-dt individual offline policies. For each [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedGym creates a continuous-time medical RL benchmark from clinical data via PINNs, but the abstract shows no validation of the learned dynamics.

read the letter

The paper introduces MedGym as a unified continuous-time benchmark for dynamic medical treatment reinforcement learning. It constructs the environment from clinical data using Physics-Informed Neural Networks to model patient evolution with irregular timing and individual differences.

This is new in the sense that prior work stayed with discrete MDP or POMDP setups. The benchmark allows testing RL methods under conditions closer to clinical practice, including comparisons between discrete and continuous approaches, and evaluations on personalization and safety.

The paper does well at identifying the limitations of existing simulated environments and outlining how a continuous-time framework could address them. The configurable nature and support for both offline and online RL are practical features.

However, the soundness is limited by the lack of any reported validation. There are no results on PINN training errors, reconstruction accuracy on held-out data, or confirmation that the dynamics support valid treatment effect estimates. The stress-test note correctly flags that without evidence of fidelity to irregular clinical trajectories, the benchmark's utility for comparisons remains unproven.

If the full manuscript includes those analyses, the contribution would be stronger. As presented, the central assumption about the PINNs faithfully reproducing the dynamics is not yet supported by evidence.

This paper is for researchers working on RL for healthcare applications who are looking for benchmarks that better reflect continuous-time patient dynamics. It could spark useful discussion in the subfield.

I would bring it to a reading group as a starting point for talking about continuous-time modeling in medical RL. It deserves peer review because the idea is relevant and the gap is real, even though more work is needed on the implementation validation.

Referee Report

2 major / 1 minor

Summary. The paper introduces MedGym, a benchmark environment for dynamic medical treatment recommendation in RL. It models longitudinal patient evolution in a continuous-time framework and constructs a configurable benchmark from clinical data using Physics-Informed Neural Networks (PINNs) to support offline and online RL, enable direct comparisons between discrete-time and continuous-time methods under irregular timing and patient-specific dynamics, and evaluate personalization, trajectory-level safety, and model-based offline vs. online performance gaps.

Significance. If the PINN models accurately capture patient-specific continuous-time dynamics and treatment effects from clinical data, MedGym would fill an important gap by providing a standardized, data-driven benchmark for evaluating RL methods on realistic medical challenges such as irregular measurement intervals and individual variability. The use of PINNs to derive the environments from real data is a promising direction for creating more faithful simulators than hand-crafted discrete MDPs.

major comments (2)

[Abstract] Abstract: The central claim that MedGym enables valid discrete-vs-continuous RL comparisons rests on the PINN-derived dynamics faithfully reproducing observed state evolution at irregular times and causal treatment effects, yet the manuscript supplies no quantitative validation metrics, held-out trajectory error analysis, residual norms, or stability checks between observation points.
[Abstract] Abstract: Without evidence that the learned vector fields match external ground truth on treatment effects or remain stable in intervals without measurements, the benchmark's utility for safety and personalization evaluations cannot be assessed, making this a load-bearing omission for the paper's contribution.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific clinical datasets used to train the PINNs and the RL algorithms included in the initial evaluations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for identifying the critical need for explicit validation of the PINN-derived dynamics. We agree that quantitative evidence of fidelity to observed trajectories and stability is necessary to support the benchmark's claims and will add these analyses in revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that MedGym enables valid discrete-vs-continuous RL comparisons rests on the PINN-derived dynamics faithfully reproducing observed state evolution at irregular times and causal treatment effects, yet the manuscript supplies no quantitative validation metrics, held-out trajectory error analysis, residual norms, or stability checks between observation points.

Authors: We accept this assessment. The current version does not report these metrics. In the revised manuscript we will include held-out trajectory error (MSE/MAE on state predictions), PINN residual norms, and interval stability checks (e.g., forward integration error between observations) to demonstrate faithful reproduction of observed evolution. revision: yes
Referee: [Abstract] Abstract: Without evidence that the learned vector fields match external ground truth on treatment effects or remain stable in intervals without measurements, the benchmark's utility for safety and personalization evaluations cannot be assessed, making this a load-bearing omission for the paper's contribution.

Authors: We will add the requested stability checks between measurements. Direct external ground truth for causal treatment effects is unavailable in observational clinical data; we will instead report fidelity to observed trajectories under treatment and sensitivity analyses, while explicitly noting this limitation. revision: partial

standing simulated objections not resolved

Direct external ground truth on causal treatment effects from observational data

Circularity Check

0 steps flagged

No circularity: benchmark is data-driven construction without self-referential derivations

full rationale

The paper presents MedGym as an environment constructed by training PINNs on clinical data to produce continuous-time patient dynamics. No equations, predictions, or uniqueness theorems are shown that reduce the benchmark outputs to fitted inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing premises. The work is self-contained as an empirical benchmark generator; its validity rests on external validation of PINN fidelity rather than internal redefinition of results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central construction relies on the unstated assumption that clinical data plus PINNs suffice to generate faithful continuous trajectories.

pith-pipeline@v0.9.1-grok · 5781 in / 1082 out tokens · 18630 ms · 2026-06-28T17:27:44.994291+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Bibhas Chakraborty and Susan A. Murphy. Dynamic treatment regimes.Annual Review of Statistics and Its Application, 1:447–464, 2014

2014
[2]

Kartik Choudhary, Dhawal Gupta, and Philip S. Thomas. Icu-sepsis: A benchmark mdp built from real medical data.arXiv preprint arXiv:2406.05646, 2024. doi: 10.48550/arXiv.2406. 05646. URLhttps://arxiv.org/abs/2406.05646

work page doi:10.48550/arxiv.2406 2024
[3]

Lehman, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez

Omer Gottesman, Fredrik Johansson, Joshua Meier, David Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xiaoxiao Peng, Jiayu Yao, Isaac Lage, Constantin Mosch, Li-wei H. Lehman, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational...

Pith/arXiv arXiv 2018
[4]

Guidelines for reinforcement learning in healthcare.Nature Medicine, 25:16–18, 2019

Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare.Nature Medicine, 25:16–18, 2019

2019
[5]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018

2018
[6]

Epicare: A reinforcement learning benchmark for dynamic treatment regimes.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

Mason Hargrave, Alex Spaeth, and Logan Grosenick. Epicare: A reinforcement learning benchmark for dynamic treatment regimes.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

2024
[7]

Johnson, T.J

A.E.W. Johnson, T.J. Pollard, L. Shen, L.H. Hehman, M. Feng, M.Ghassemi, B. Moody, P. Szolovits, L.A. Celi, and R.G. Mark. Mimic-iii, a freely accessible critical care database. 3 (may. 2016), 2016. URLhttps://doi.org/10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016
[8]

Kidwell and Daniel Almirall

Kelley M. Kidwell and Daniel Almirall. Sequential, multiple assignment, randomized trial designs.JAMA, 329:336–337, 2023

2023
[9]

Komorowski, L.A

M. Komorowski, L.A. Celi, O. Badawi, A.C. Gordon, and A.A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.Nature Medicine, 24: 1716–1720, 2018

2018
[10]

Kravitz, Naihua Duan, and Joel Braslow

Richard L. Kravitz, Naihua Duan, and Joel Braslow. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages.The Milbank Quarterly, 82:661–687, 2004

2004
[11]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning.Proceedings of the 34th International Conference on Neural Information Processing Systems, 33:1179–1191, 2020. 10

2020
[12]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

Pith/arXiv arXiv 2005
[14]

Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

Pith/arXiv arXiv 2013
[15]

Susan A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24:1455–1481, 2005

2005
[16]

Lagrangian duality in reinforcement learning.arXiv preprint arXiv:2007.09998,

Pranay Pasula. Lagrangian duality in reinforcement learning.arXiv preprint arXiv:2007.09998,

arXiv 2007
[17]

URLhttps://arxiv.org/abs/2007.09998

doi: 10.48550/arXiv.2007.09998. URLhttps://arxiv.org/abs/2007.09998

work page doi:10.48550/arxiv.2007.09998 2007
[18]

Fazlic, Alexander Schmeink, Gerd Ascheid, Christian Thiemermann, Andreas Schuppert, Richard Kin- dle, Leo Celi, Gernot Marx, and Lukas Martin

Alexander Peine, Ahmed Hallawa, Johannes Bickenbach, Gerrit Dartmann, Lejla B. Fazlic, Alexander Schmeink, Gerd Ascheid, Christian Thiemermann, Andreas Schuppert, Richard Kin- dle, Leo Celi, Gernot Marx, and Lukas Martin. Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care.npj Di...

2021
[19]

Prudencio, Marcos R

Rafael F. Prudencio, Marcos R. O. A. Maximo, and Esther L. Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, pages 1–20, 2023

2023
[20]

Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach

Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. InProceedings of the 2nd Machine Learning for Healthcare Conference, volume 68 ofProceedings of Machine Learning Research, pages 147–163. PMLR, 2017

2017
[21]

Raissi, P

M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019. doi: https: //doi.org/10.1016/j.jcp.2018.10.045

work page doi:10.1016/j.jcp.2018.10.045 2019
[22]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897. PMLR, 07–09 Jul 2015

2015
[23]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/arXiv. 1707.06347. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2017
[24]

Shortreed, Eric Laber, Daniel J

Susan M. Shortreed, Eric Laber, Daniel J. Lizotte, T. Scott Stroup, Joelle Pineau, and Susan A. Murphy. Informing sequential clinical decision-making through reinforcement learning: An empirical study.Machine Learning, 84:109–136, 2011

2011
[25]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforce- ment learning.arXiv preprint arXiv:1604.00923, 2016

Pith/arXiv arXiv 2016
[26]

When to sense and control? a time-adaptive approach for continuous-time rl.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

Lenart Treven, Bhavya Sukhija, Yarden As, Florian Dörfler, and Andreas Krause. When to sense and control? a time-adaptive approach for continuous-time rl.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

2024
[27]

Guardian-regularized safe offline reinforcement learning for smart weaning of mechanical circulatory devices.arXiv preprint arXiv:2511.06111, 2025

Aysın Tumay, Sophia Sun, Sonia Fereidooni, Aaron Dumas, Elise Jortberg, and Rose Yu. Guardian-regularized safe offline reinforcement learning for smart weaning of mechanical circulatory devices.arXiv preprint arXiv:2511.06111, 2025. doi: 10.48550/arXiv.2511.06111

work page doi:10.48550/arxiv.2511.06111 2025
[28]

A review of off-policy evaluation in reinforcement learning, 2022

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning, 2022. Review paper. 11

2022
[29]

Le, Nan Jiang, and Yisong Yue

Christoph V oloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854, 2021

arXiv 1911
[30]

Ran Xu and et al. Medagentgym: A scalable agentic training environment for code-centric reasoning in biomedical data science.Proceedings of The Fourteenth International Conference on Learning Representations, 2026

2026
[31]

Offline guarded safe reinforcement learning for medical treatment optimization strategies.Proceedings of the 39th Advances in Neural Information Processing Systems, 2025

Runze Yan*, Xun Shen*, Akifumi Wachi, Sebastien Gros, Anni Zhao, and Xiao Hu. Offline guarded safe reinforcement learning for medical treatment optimization strategies.Proceedings of the 39th Advances in Neural Information Processing Systems, 2025

2025
[32]

Reinforcement learning in healthcare: A survey.ACM Computing Surveys, 55:5:1–5:36, 2021

Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey.ACM Computing Surveys, 55:5:1–5:36, 2021

2021
[33]

Continuous-time decision transformer for healthcare applications

Zhihao Zhang, Haowei Mei, and Yang Xu. Continuous-time decision transformer for healthcare applications. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 6245–6262. PMLR, 2023. 12 A Limitations MedGymhas several limitations that should be taken into account when interpreting the benchmark results. First, ...

2023

[1] [1]

Bibhas Chakraborty and Susan A. Murphy. Dynamic treatment regimes.Annual Review of Statistics and Its Application, 1:447–464, 2014

2014

[2] [2]

Kartik Choudhary, Dhawal Gupta, and Philip S. Thomas. Icu-sepsis: A benchmark mdp built from real medical data.arXiv preprint arXiv:2406.05646, 2024. doi: 10.48550/arXiv.2406. 05646. URLhttps://arxiv.org/abs/2406.05646

work page doi:10.48550/arxiv.2406 2024

[3] [3]

Lehman, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez

Omer Gottesman, Fredrik Johansson, Joshua Meier, David Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xiaoxiao Peng, Jiayu Yao, Isaac Lage, Constantin Mosch, Li-wei H. Lehman, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational...

Pith/arXiv arXiv 2018

[4] [4]

Guidelines for reinforcement learning in healthcare.Nature Medicine, 25:16–18, 2019

Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare.Nature Medicine, 25:16–18, 2019

2019

[5] [5]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018

2018

[6] [6]

Epicare: A reinforcement learning benchmark for dynamic treatment regimes.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

Mason Hargrave, Alex Spaeth, and Logan Grosenick. Epicare: A reinforcement learning benchmark for dynamic treatment regimes.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

2024

[7] [7]

Johnson, T.J

A.E.W. Johnson, T.J. Pollard, L. Shen, L.H. Hehman, M. Feng, M.Ghassemi, B. Moody, P. Szolovits, L.A. Celi, and R.G. Mark. Mimic-iii, a freely accessible critical care database. 3 (may. 2016), 2016. URLhttps://doi.org/10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016

[8] [8]

Kidwell and Daniel Almirall

Kelley M. Kidwell and Daniel Almirall. Sequential, multiple assignment, randomized trial designs.JAMA, 329:336–337, 2023

2023

[9] [9]

Komorowski, L.A

M. Komorowski, L.A. Celi, O. Badawi, A.C. Gordon, and A.A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.Nature Medicine, 24: 1716–1720, 2018

2018

[10] [10]

Kravitz, Naihua Duan, and Joel Braslow

Richard L. Kravitz, Naihua Duan, and Joel Braslow. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages.The Milbank Quarterly, 82:661–687, 2004

2004

[11] [11]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning.Proceedings of the 34th International Conference on Neural Information Processing Systems, 33:1179–1191, 2020. 10

2020

[12] [12]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

Pith/arXiv arXiv 2005

[13] [14]

Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

Pith/arXiv arXiv 2013

[14] [15]

Susan A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24:1455–1481, 2005

2005

[15] [16]

Lagrangian duality in reinforcement learning.arXiv preprint arXiv:2007.09998,

Pranay Pasula. Lagrangian duality in reinforcement learning.arXiv preprint arXiv:2007.09998,

arXiv 2007

[16] [17]

URLhttps://arxiv.org/abs/2007.09998

doi: 10.48550/arXiv.2007.09998. URLhttps://arxiv.org/abs/2007.09998

work page doi:10.48550/arxiv.2007.09998 2007

[17] [18]

Fazlic, Alexander Schmeink, Gerd Ascheid, Christian Thiemermann, Andreas Schuppert, Richard Kin- dle, Leo Celi, Gernot Marx, and Lukas Martin

Alexander Peine, Ahmed Hallawa, Johannes Bickenbach, Gerrit Dartmann, Lejla B. Fazlic, Alexander Schmeink, Gerd Ascheid, Christian Thiemermann, Andreas Schuppert, Richard Kin- dle, Leo Celi, Gernot Marx, and Lukas Martin. Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care.npj Di...

2021

[18] [19]

Prudencio, Marcos R

Rafael F. Prudencio, Marcos R. O. A. Maximo, and Esther L. Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, pages 1–20, 2023

2023

[19] [20]

Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach

Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. InProceedings of the 2nd Machine Learning for Healthcare Conference, volume 68 ofProceedings of Machine Learning Research, pages 147–163. PMLR, 2017

2017

[20] [21]

Raissi, P

M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019. doi: https: //doi.org/10.1016/j.jcp.2018.10.045

work page doi:10.1016/j.jcp.2018.10.045 2019

[21] [22]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897. PMLR, 07–09 Jul 2015

2015

[22] [23]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/arXiv. 1707.06347. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2017

[23] [24]

Shortreed, Eric Laber, Daniel J

Susan M. Shortreed, Eric Laber, Daniel J. Lizotte, T. Scott Stroup, Joelle Pineau, and Susan A. Murphy. Informing sequential clinical decision-making through reinforcement learning: An empirical study.Machine Learning, 84:109–136, 2011

2011

[24] [25]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforce- ment learning.arXiv preprint arXiv:1604.00923, 2016

Pith/arXiv arXiv 2016

[25] [26]

When to sense and control? a time-adaptive approach for continuous-time rl.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

Lenart Treven, Bhavya Sukhija, Yarden As, Florian Dörfler, and Andreas Krause. When to sense and control? a time-adaptive approach for continuous-time rl.Proceedings of the 36th Advances in Neural Information Processing Systems, 2024

2024

[26] [27]

Guardian-regularized safe offline reinforcement learning for smart weaning of mechanical circulatory devices.arXiv preprint arXiv:2511.06111, 2025

Aysın Tumay, Sophia Sun, Sonia Fereidooni, Aaron Dumas, Elise Jortberg, and Rose Yu. Guardian-regularized safe offline reinforcement learning for smart weaning of mechanical circulatory devices.arXiv preprint arXiv:2511.06111, 2025. doi: 10.48550/arXiv.2511.06111

work page doi:10.48550/arxiv.2511.06111 2025

[27] [28]

A review of off-policy evaluation in reinforcement learning, 2022

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning, 2022. Review paper. 11

2022

[28] [29]

Le, Nan Jiang, and Yisong Yue

Christoph V oloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854, 2021

arXiv 1911

[29] [30]

Ran Xu and et al. Medagentgym: A scalable agentic training environment for code-centric reasoning in biomedical data science.Proceedings of The Fourteenth International Conference on Learning Representations, 2026

2026

[30] [31]

Offline guarded safe reinforcement learning for medical treatment optimization strategies.Proceedings of the 39th Advances in Neural Information Processing Systems, 2025

Runze Yan*, Xun Shen*, Akifumi Wachi, Sebastien Gros, Anni Zhao, and Xiao Hu. Offline guarded safe reinforcement learning for medical treatment optimization strategies.Proceedings of the 39th Advances in Neural Information Processing Systems, 2025

2025

[31] [32]

Reinforcement learning in healthcare: A survey.ACM Computing Surveys, 55:5:1–5:36, 2021

Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey.ACM Computing Surveys, 55:5:1–5:36, 2021

2021

[32] [33]

Continuous-time decision transformer for healthcare applications

Zhihao Zhang, Haowei Mei, and Yang Xu. Continuous-time decision transformer for healthcare applications. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 6245–6262. PMLR, 2023. 12 A Limitations MedGymhas several limitations that should be taken into account when interpreting the benchmark results. First, ...

2023