pith. machine review for the scientific record. sign in

arxiv: 2605.02552 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability

Firas Mohamed Elamine Kiram, Gian Antonio Susto, Imane Youkana, Laid Kahloul, Rachida Saouli

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords chemotherapy optimizationreinforcement learningpartial observabilityrecurrent neural networksTD3LSTMdynamic treatment regimestumor suppression
0
0 comments X

The pith

Recurrent reinforcement learning policies achieve steadier tumor control and better healthy-cell preservation when chemotherapy dosing decisions must be made from incomplete patient observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chemotherapy optimization involves choosing doses over time to shrink tumors while limiting harm to normal cells, but real patient data is rarely complete. Most reinforcement learning approaches assume the full state is known at every step, an assumption that rarely holds in practice. The paper tests whether adding memory through recurrent networks lets agents handle the missing information better. On the AhnChemoEnv benchmark, recurrent TD3 agents show only modest gains when the full state is given but deliver substantially more stable and effective performance when observations are partial and noisy, producing more reliable tumor reduction and less damage to healthy cells. This points to memory as a practical way to compensate for uncertainty in sequential treatment decisions.

Core claim

The paper shows that recurrent TD3 agents equipped with separate LSTM actor and critic networks outperform both feed-forward TD3 and Soft Actor-Critic baselines under partial observability on the AhnChemoEnv benchmark. Across ten random seeds, recurrence produces only modest improvement when the full state is available, yet yields markedly stronger and more stable results when observations contain noise and hidden-state uncertainty, with more consistent tumor suppression and improved normal-cell preservation while pharmacokinetic and pharmacodynamic variability remain fixed.

What carries the argument

Recurrent TD3 with separate LSTM actor and critic networks that maintain hidden state across time steps to compensate for incomplete observations.

If this is right

  • Recurrent policies deliver only modest gains when the full patient state is observable.
  • Under partial observability the same recurrent policies produce substantially stronger and more stable performance across random seeds.
  • Memory augmentation leads to more consistent tumor suppression across treatment episodes.
  • Recurrent agents preserve normal cells better than feed-forward counterparts when observations are noisy.
  • The performance difference is isolated to observation uncertainty because pharmacokinetic and pharmacodynamic parameters are held fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar memory mechanisms could help reinforcement learning in other medical domains where key state variables are unobserved or delayed.
  • Real clinical systems might improve by feeding patient history directly into policies rather than relying on instantaneous measurements alone.
  • Testing the same recurrent architecture on environments that also vary patient-specific parameters would clarify whether the benefit persists beyond controlled benchmarks.

Load-bearing premise

The AhnChemoEnv benchmark with fixed pharmacokinetic and pharmacodynamic variability plus added observation noise adequately captures the partial observability and uncertainty found in actual clinical chemotherapy practice.

What would settle it

A follow-up experiment on real patient monitoring data or a simulation that includes realistic inter-patient variability showing no advantage in tumor control or toxicity for recurrent over non-recurrent policies.

Figures

Figures reproduced from arXiv: 2605.02552 by Firas Mohamed Elamine Kiram, Gian Antonio Susto, Imane Youkana, Laid Kahloul, Rachida Saouli.

Figure 1
Figure 1. Figure 1: Recurrent TD3 architecture adapted from [19] for the view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory-level evaluation under partial observability over 30 view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation performance under full and partial observ view at source ↗
Figure 4
Figure 4. Figure 4: Action and immune-cell trajectories under partial observability over 30 view at source ↗
Figure 5
Figure 5. Figure 5: Drug-concentration trajectories under partial observ view at source ↗
read the original abstract

Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability. To this end, we employ a recurrent TD3-based approach with separate LSTM actor-critic networks and evaluate it on the AhnChemoEnv benchmark from DTR-Bench, considering both off-policy and on-policy recurrent architectures against feed-forward TD3 and Soft Actor-Critic. Pharmacokinetic and pharmacodynamic variability are held fixed to isolate hidden-state uncertainty and observation noise and to avoid confounding effects from inter-patient variability. Across ten random seeds, recurrence yields modest benefit under full observability but substantially stronger and more stable performance under partial observability, with more consistent tumor suppression and improved normal-cell preservation. These findings indicate that memory-based policies are particularly beneficial when clinically relevant state information is incomplete or noisy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper formulates chemotherapy dose optimization as a dynamic treatment regime under uncertainty and examines whether memory-augmented policies improve performance when patient state is only partially observable. It implements a recurrent TD3 variant using separate LSTM actor and critic networks, compares it to feed-forward TD3 and SAC (both off- and on-policy) on the AhnChemoEnv benchmark, and deliberately fixes pharmacokinetic/pharmacodynamic parameters while adding observation noise to isolate hidden-state effects. Across ten random seeds the results indicate modest gains from recurrence under full observability but substantially stronger and more stable tumor suppression together with better normal-cell preservation under partial observability.

Significance. If the empirical findings are robust, the work supplies concrete evidence that recurrent architectures can mitigate the performance degradation caused by incomplete or noisy state information in a clinically motivated control task. The explicit isolation of observation noise from inter-patient variability is a methodological strength that permits clear attribution of benefits to memory. The study therefore contributes to the growing literature on partial-observability handling in medical RL and offers a reproducible benchmark comparison that future work can extend.

major comments (2)
  1. [Abstract] Abstract: the headline claim that recurrence produces 'substantially stronger and more stable performance under partial observability' is presented without accompanying effect sizes, confidence intervals, or statistical tests across the ten seeds; this absence makes it impossible to judge whether the reported stability is statistically distinguishable from noise or from the feed-forward baselines.
  2. [Abstract] Abstract: by holding PK/PD parameters fixed while adding only observation noise, the experimental design deliberately excludes inter-patient variability; the interpretation that the observed gains address 'clinically relevant state information' therefore rests on an assumption that real partial observability is dominated by additive noise rather than by the need to infer patient-specific parameters from noisy trajectories, an assumption whose validity is not tested within the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that recurrence produces 'substantially stronger and more stable performance under partial observability' is presented without accompanying effect sizes, confidence intervals, or statistical tests across the ten seeds; this absence makes it impossible to judge whether the reported stability is statistically distinguishable from noise or from the feed-forward baselines.

    Authors: We agree that quantitative support is needed to substantiate the headline claim. In the revised manuscript we will expand the abstract to report effect sizes (mean differences in tumor volume reduction and normal-cell preservation), standard deviations across the ten seeds, and p-values from paired statistical tests (t-tests or Wilcoxon signed-rank) comparing recurrent versus feed-forward agents under partial observability. These additions will allow readers to assess whether the observed improvements exceed what could be attributed to random variation. revision: yes

  2. Referee: [Abstract] Abstract: by holding PK/PD parameters fixed while adding only observation noise, the experimental design deliberately excludes inter-patient variability; the interpretation that the observed gains address 'clinically relevant state information' therefore rests on an assumption that real partial observability is dominated by additive noise rather than by the need to infer patient-specific parameters from noisy trajectories, an assumption whose validity is not tested within the manuscript.

    Authors: The referee correctly notes that our design fixes PK/PD parameters to isolate observation noise and hidden-state effects. This choice was intentional, as stated in the manuscript, to prevent confounding from inter-patient variability and to attribute performance differences specifically to the recurrent architecture's memory. We do not claim the setup captures every clinical source of partial observability. In the revision we will rephrase the abstract to avoid over-generalization, explicitly state that parameters are held fixed, and add a short limitations paragraph in the discussion acknowledging that future work should examine joint inference of patient-specific parameters from noisy trajectories. revision: partial

Circularity Check

0 steps flagged

Empirical RL benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper reports an empirical comparison of recurrent TD3 and other RL agents on the external AhnChemoEnv benchmark from DTR-Bench. It evaluates performance under full vs. partial observability by holding PK/PD parameters fixed and adding observation noise. No mathematical derivations, predictions, or first-principles results are claimed that could reduce to fitted inputs or self-citations. Training follows standard off-policy RL optimization; results are reported across random seeds without reuse of target metrics in the objective. The fixed-variability design is an explicit experimental control, not a circular definition. This is a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard reinforcement-learning assumptions for POMDPs and the fidelity of the chosen benchmark environment; no new entities or ad-hoc parameters are introduced beyond typical RL hyperparameters.

axioms (2)
  • domain assumption The chemotherapy environment can be modeled as a partially observable Markov decision process with fixed pharmacokinetic and pharmacodynamic parameters.
    Invoked when isolating hidden-state uncertainty and observation noise while holding PK/PD variability fixed.
  • standard math Standard TD3 and Soft Actor-Critic training procedures remain valid when actor and critic are replaced by LSTM networks.
    Used without additional justification when comparing recurrent and feed-forward variants.

pith-pipeline@v0.9.0 · 5491 in / 1438 out tokens · 57891 ms · 2026-05-08T18:45:44.270728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Cancer statistics, 2026,

    R. L. Siegel, K. D. Miller, N. S. Wagle, and A. Jemal, “Cancer statistics, 2026,”CA: A Cancer Journal for Clinicians, vol. 76, no. 1, pp. 17–48, 2026

  2. [2]

    Optimal dosing of cancer chemotherapy using model predictive control and moving horizon state/parameter estimation,

    T. Chen, N. F. Kirkby, and R. Jena, “Optimal dosing of cancer chemotherapy using model predictive control and moving horizon state/parameter estimation,” vol. 108, no. 3, pp. 973–983, 2012

  3. [3]

    Personalized medicine: Progress and promise,

    I. S. Chan and G. S. Ginsburg, “Personalized medicine: Progress and promise,”Annual Review of Genomics and Human Genetics, vol. 12, no. V olume 12, 2011, pp. 217–244, 2011

  4. [4]

    Optimal dynamic treatment regimes,

    S. A. Murphy, “Optimal dynamic treatment regimes,”Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 65, no. 2, pp. 331–355, 2003

  5. [5]

    Dynamic treatment regimes,

    B. Chakraborty and S. A. Murphy, “Dynamic treatment regimes,”Annual Review of Statistics and Its Application, vol. 1, no. V olume 1, 2014, pp. 447–464, 2014

  6. [6]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA, USA, 2018

  7. [7]

    Reinforcement learning for sequential decision making in population research,

    N. Deliu, “Reinforcement learning for sequential decision making in population research,”Quality & Quantity, vol. 58, pp. 5057–5080,

  8. [8]

    Available: https://doi.org/10.1007/s11135-023-01755-z

    [Online]. Available: https://doi.org/10.1007/s11135-023-01755-z

  9. [9]

    Continuous control with deep reinforcement learning

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,”arXiv preprint arXiv:1509.02971, 2015

  10. [10]

    Deep reinforce- ment learning for personalized chemotherapy treatment,

    H.-E. Tseng, C.-Y . Liao, C.-H. Hsu, and L.-C. Fu, “Deep reinforce- ment learning for personalized chemotherapy treatment,” in2017 IEEE Healthcare Innovations and Point of Care Technologies (HI-POCT), 2017, pp. 176–179

  11. [11]

    Reinforcement learning-based control of drug dosing for cancer chemotherapy treat- ment,

    R. Padmanabhan, N. Meskin, and W. M. Haddad, “Reinforcement learning-based control of drug dosing for cancer chemotherapy treat- ment,”Mathematical Biosciences, vol. 293, pp. 11–20, 2017

  12. [12]

    A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients,

    J. D. Mart ´ın-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, M. Climente-Mart´ı, and N. V . Jim´enez-Torres, “A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients,”Expert Systems with Applications, no. 6, pp. 9737–9742

  13. [13]

    Dtr-bench: An in silico environment and benchmark platform for reinforcement learning based dynamic treatment regime,

    Z. Luo, M. Zhu, F. Liu, J. Li, Y . Pan, J. Zhou, and T. Zhu, “Dtr-bench: An in silico environment and benchmark platform for reinforcement learning based dynamic treatment regime,” 2024

  14. [14]

    A systematic review of dynamic treatment regime methods in healthcare,

    Y . Lianget al., “A systematic review of dynamic treatment regime methods in healthcare,”Computer Meth- ods and Programs in Biomedicine, 2025, available at: https://dspace.library.uu.nl/server/api/core/bitstreams/e5067da0-a232- 465a-b58a-729fa0890aa7/content

  15. [15]

    Deep reinforcement learning-based control of chemo-drug dose in cancer treatment,

    H. Mashayekhi, M. Nazari, F. Jafarinejad, and N. Meskin, “Deep reinforcement learning-based control of chemo-drug dose in cancer treatment,”Computer Methods and Programs in Biomedicine, vol. 243, p. 107884, 2024

  16. [16]

    An inverse reinforcement learning algorithm for partially observable domains with application on health- care dialogue management,

    H. R. Chinaei and B. Chaib-Draa, “An inverse reinforcement learning algorithm for partially observable domains with application on health- care dialogue management,” in2012 11th International Conference on Machine Learning and Applications, vol. 1, 2012, pp. 144–149

  17. [17]

    Planning and acting in partially observable stochastic domains,

    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial Intelligence, pp. 99–134, 1998

  18. [18]

    Ordinary differential equa- tion models for adoptive immunotherapy,

    A. Talkington, C. Dantoin, and R. Durrett, “Ordinary differential equa- tion models for adoptive immunotherapy,”Bulletin of Mathematical Biology, vol. 80, pp. 1059–1083, 2018

  19. [19]

    A mathematical tumor model with immune resistance and drug therapy: An optimal control approach,

    L. G. De Pillis and A. Radunskaya, “A mathematical tumor model with immune resistance and drug therapy: An optimal control approach,” Computational and Mathematical Methods in Medicine, vol. 3, no. 2, p. 318436, 2001

  20. [20]

    Ground-Truth Models

    T. Ni, B. Eysenbach, and R. Salakhutdinov, “Recurrent model-free rl can be a strong baseline for many pomdps,” 2022. [Online]. Available: https://arxiv.org/abs/2110.05038

  21. [22]
  22. [23]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  23. [24]

    Stable-baselines3: Reliable reinforcement learning implementations,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/ 20-1364.html

  24. [25]

    Recurrent experience replay in distributed reinforcement learning,

    S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum?id=r1lyTjAqYX

  25. [26]

    The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care,

    M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal, “The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care,”Nature Medicine, vol. 24, no. 11, pp. 1716– 1720, 2018

  26. [27]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020. [Online]. Available: https: //arxiv.org/abs/2005.01643

  27. [28]

    Conservative q-learning for offline reinforcement learning,

    A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191