Healthcare AI GYM for Medical Agents

Minbyul Jeong

arxiv: 2605.02943 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Healthcare AI GYM for Medical Agents

Minbyul Jeong This is my paper

Pith reviewed 2026-05-09 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords medical AI agentsreinforcement learningmulti-turn agentsself-distillationclinical reasoningagent training stabilitytool useHealthcare AI GYM

0 comments

The pith

A turn-level self-distillation method stabilizes multi-turn reinforcement learning for medical AI agents and raises accuracy on clinical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a gymnasium-style environment spanning ten clinical domains, thousands of tasks, domain tools, and a large medical knowledge base to train agents that reason over multiple turns like doctors gathering history and ordering tests. Standard RL training collapses into long, tool-free single answers because sparse final rewards misalign with the sequential nature of clinical work. Vanilla GRPO reaches decent accuracy on some tests but oscillates and converges slowly. The authors introduce Turn-level Truncated On-Policy Distillation, where an exponential-moving-average teacher supplies dense, outcome-aware regularization at every turn without gradients. This change produces the best scores on ten of eighteen benchmarks, an average gain of 3.9 points, quicker early progress, shorter controlled responses, and continued tool use.

Core claim

In the Healthcare AI GYM environment, agentic multi-turn RL degrades into verbose monologues with exploding response lengths and falling tool-use rates because sparse terminal rewards do not match sequential clinical trajectories. Vanilla GRPO exhibits training instability through oscillations and slow convergence. Turn-level Truncated On-Policy Distillation counters this by letting a gradient-free EMA teacher supply dense outcome-privileged KL regularization at each turn, delivering the strongest results on ten of eighteen benchmarks, a 3.9 percentage-point average lift over non-RL baselines, faster convergence, bounded response lengths, and preserved multi-turn tool use.

What carries the argument

Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation process that applies dense, outcome-aware KL regularization from an EMA teacher at every conversation turn to align training signals with sequential clinical trajectories.

If this is right

TT-OPD reaches top accuracy on ten of eighteen medical benchmarks while avoiding the length explosions and tool-use drop-off seen in standard RL.
Training reaches good performance faster and with more stable response lengths than vanilla GRPO.
Multi-turn tool use persists instead of collapsing into single-turn answers.
The same environment and method support controlled exploration across ten different clinical domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same turn-level regularization pattern could stabilize multi-turn agents in other sequential domains such as legal research or technical troubleshooting.
The open gym environment offers a reusable testbed for comparing future medical-agent methods without building new simulators from scratch.
If the stability benefit holds, fewer rounds of expensive human preference data may be needed to train reliable medical agents.

Load-bearing premise

Gains measured on the simulated clinical tasks and benchmarks will translate into better real-world medical reasoning and decision quality.

What would settle it

A direct comparison on real patient cases or by practicing clinicians in which agents trained with TT-OPD show no accuracy or safety advantage over simpler single-turn baselines.

Figures

Figures reproduced from arXiv: 2605.02943 by Minbyul Jeong.

**Figure 1.** Figure 1: Overview of the HEALTHCARE AI GYM Architecture. The framework is composed of four integrated layers designed for medical agent reinforcement learning. 3 HEALTHCARE AI GYM: Environment Design HEALTHCARE AI GYM is a standardized, high-fidelity reinforcement learning environment designed to bridge the gap between static medical knowledge retrieval and agentic clinical execution. Built on the Gymnasium (Tower… view at source ↗

**Figure 2.** Figure 2: Overview of the HEALTHCARE AI GYM Architecture. The framework is composed of four integrated layers designed for medical agent reinforcement learning. horizon T. The complete trajectory τ = (s1, a1, . . . , sT , aT ) is evaluated by a sparse terminal reward R(τ ) computed only at the episode’s end. Sparse terminal rewards in multi-turn settings induce a severe credit assignment problem. While process rewar… view at source ↗

**Figure 3.** Figure 3: Training dynamics comparison across 60 steps. (a) Both TT-OPD and GRPO converge [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation of distillation components across training. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new multi-domain medical RL gym and a turn-level distillation fix for the verbose-monologue collapse, but all gains sit inside the authors' own simulation with no external clinical anchor.

read the letter

The concrete contribution is the Healthcare AI GYM itself: ten clinical domains, 3.6K tasks, 135 tools, and an 828K-passage knowledge base, all wrapped as a standard Gym environment. That is a usable artifact for anyone who wants to run agentic RL experiments in medicine without building the scaffolding from scratch. The authors also document a clear failure mode—multi-turn RL agents drift into long, tool-free monologues—and offer TT-OPD, an EMA-based self-distillation scheme that supplies dense, outcome-aware regularization at every turn. On their internal benchmarks it produces the reported +3.9 pp lift, quicker early convergence, shorter responses, and steadier tool use compared with the non-RL baseline and vanilla GRPO. Those are the parts that are actually new and potentially reusable. The soft spot is that every number comes from tasks and reward signals the authors defined. No clinician ratings of trajectories, no head-to-head with existing medical QA sets, and no prospective check on whether better gym scores produce better real decisions. The abstract and stress-test note give no details on benchmark construction, statistical testing, or controls for prompt or scale confounds, so the performance edge is hard to interpret outside the simulation. The environment could still be worth adopting as a testbed even if the specific TT-OPD numbers need more scrutiny. This is for groups already working on medical agents or RL for sequential decision-making who need a shared starting point. It deserves a serious referee because a public, multi-domain medical gym is scarce and the degradation phenomenon is worth confirming, but reviewers will need to see external validation and clearer method documentation before the claims travel.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Healthcare AI GYM, a Gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and 828K medical passages, for training multi-turn medical AI agents via reinforcement learning. It identifies degradation phenomena in agentic RL training (verbose monologues, monotonic length explosion, and erosion of tool-use frequency) attributed to misalignment between sparse terminal rewards and sequential trajectories. The authors propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation method using a gradient-free EMA teacher for dense, outcome-aware KL regularization at each turn. Empirical results claim TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9 pp gain over the non-RL baseline, along with faster convergence, controlled response length, and sustained multi-turn tool use.

Significance. If the GYM tasks and rewards faithfully model real clinical reasoning and the reported gains prove robust and transferable, the work would supply a much-needed standardized platform and training recipe for stable multi-turn medical agents. The explicit diagnosis of collapse modes and the turn-level distillation fix constitute a constructive step beyond vanilla GRPO. The scale of the environment (domains, tasks, tools) is a clear strength that could enable reproducible research in agentic healthcare AI.

major comments (3)

[Abstract and experimental section] Abstract and experimental section: the central claim that TT-OPD wins on 10 of 18 benchmarks with +3.9 pp average improvement lacks any reported statistical testing, run-to-run variance, or controls for confounds such as prompt engineering, model scale, or temperature settings. Without these, the performance advantage cannot be confidently attributed to the proposed method rather than implementation details.
[Environment construction (presumably §3)] Environment construction (presumably §3): all 3.6K tasks, reward signals, and tool-use metrics are defined internally by the authors with no external anchor—expert clinician ratings of trajectories, comparison against established medical QA datasets (MedQA, PubMedQA, etc.), or prospective deployment data. This renders the translation from GYM scores to clinical utility an unverified assumption that underpins every performance claim.
[Degradation analysis and TT-OPD evaluation] Degradation analysis and TT-OPD evaluation: the reported collapse into verbose monologues and the stabilization achieved by TT-OPD are measured exclusively against the authors' own reward structure and knowledge base. No ablation on alternative reward designs, cross-environment transfer, or human-expert trajectory comparisons is provided, leaving open the possibility that the observed phenomena are artifacts of the GYM formulation rather than general properties of multi-turn medical RL.

minor comments (3)

Define all acronyms (GRPO, EMA, TT-OPD, OPD) at first use and ensure consistent notation for turn-level quantities throughout.
Clarify the precise construction of the 18 benchmarks: how they are sampled from the 3.6K tasks, which domains they cover, and whether they are held-out or in-distribution.
Specify the exact implementation of the non-RL baseline and the vanilla GRPO baseline (hyperparameters, prompt templates, stopping criteria) so that the +3.9 pp delta can be reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas to strengthen the claims in our manuscript. We address each major comment point by point below, with clear indications of revisions.

read point-by-point responses

Referee: [Abstract and experimental section] Abstract and experimental section: the central claim that TT-OPD wins on 10 of 18 benchmarks with +3.9 pp average improvement lacks any reported statistical testing, run-to-run variance, or controls for confounds such as prompt engineering, model scale, or temperature settings. Without these, the performance advantage cannot be confidently attributed to the proposed method rather than implementation details.

Authors: We agree that the absence of statistical testing and variance reporting weakens the attribution of gains to TT-OPD. In the revised version, we will rerun all experiments across at least 5 random seeds, report means with standard deviations, and include paired t-tests or Wilcoxon tests for the +3.9 pp average improvement. We will also expand the experimental section and appendix with full details on prompt templates, temperature values, model scales, and hyperparameter settings to rule out confounds from implementation choices. revision: yes
Referee: [Environment construction (presumably §3)] Environment construction (presumably §3): all 3.6K tasks, reward signals, and tool-use metrics are defined internally by the authors with no external anchor—expert clinician ratings of trajectories, comparison against established medical QA datasets (MedQA, PubMedQA, etc.), or prospective deployment data. This renders the translation from GYM scores to clinical utility an unverified assumption that underpins every performance claim.

Authors: The tasks and rewards are grounded in established medical literature, guidelines, and knowledge bases rather than arbitrary definitions. To provide external anchors, we will add new experiments mapping a subset of GYM tasks to MedQA and PubMedQA for direct performance comparison. Comprehensive expert clinician ratings across all 3.6K tasks would require extensive clinical partnerships and ethical review that exceed the scope of the current study; we will explicitly discuss this limitation and include qualitative trajectory examples evaluated against clinical standards. revision: partial
Referee: [Degradation analysis and TT-OPD evaluation] Degradation analysis and TT-OPD evaluation: the reported collapse into verbose monologues and the stabilization achieved by TT-OPD are measured exclusively against the authors' own reward structure and knowledge base. No ablation on alternative reward designs, cross-environment transfer, or human-expert trajectory comparisons is provided, leaving open the possibility that the observed phenomena are artifacts of the GYM formulation rather than general properties of multi-turn medical RL.

Authors: We will add ablations using alternative reward structures, including dense per-turn correctness rewards, to demonstrate that verbose monologues and tool-use erosion occur across different formulations. We will also evaluate TT-OPD on a non-medical multi-turn environment (e.g., a standard text-based game) to test cross-domain applicability. Human-expert trajectory comparisons are noted as important future work requiring clinical input; in the revision we will provide additional qualitative analysis of agent trajectories against clinical reasoning patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: claims are direct empirical measurements

full rationale

The paper presents an empirical study introducing Healthcare AI GYM and evaluating RL methods including the proposed TT-OPD. No derivation chain, mathematical predictions, or first-principles results exist that could reduce to inputs by construction. Performance claims (e.g., best on 10/18 benchmarks, +3.9 pp gain) are reported as direct experimental outcomes in the authors' environment. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The analysis of training collapse and the TT-OPD proposal are observational and algorithmic, not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed environment faithfully captures clinical reasoning trajectories and that benchmark scores measure meaningful medical capability; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The 3.6K+ tasks and 135 tools in the environment produce trajectories that are representative of real clinical reasoning sequences.
Required for the reported training dynamics and benchmark improvements to be meaningful beyond the simulation.

pith-pipeline@v0.9.0 · 5551 in / 1362 out tokens · 54173 ms · 2026-05-09T19:34:30.696043+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

14 extracted references · cited by 1 Pith paper

[1]

PathVQA (He et al., 2020) shows TT-OPD at 45.3%, outperforming both base text (40.5%) and GRPO (41.5%)

On VQA-RAD (Lau et al., 2018), Base+AR leads (63.2%) with TT-OPD close behind (63.1%). PathVQA (He et al., 2020) shows TT-OPD at 45.3%, outperforming both base text (40.5%) and GRPO (41.5%). SLAKE (Liu et al., 2021) and PMC-VQA exhibit large gaps between text-based evaluation (79.0%, 57.9%†) and multi-turn agentic evaluation (30.6%, 35.1%), consistent wit...

2018
[2]

It reasons that relative to losartan, lisinopril would increase bradykinin and decrease angiotensin II

Turn 1 (Reasoning):The agent identifies that losartan is an ARB (blocks AT1 receptors) while lisinopril is an ACE inhibitor (blocks angiotensin I → angiotensin II conversionandbradykinin degradation). It reasons that relative to losartan, lisinopril would increase bradykinin and decrease angiotensin II
[3]

ACE inhibitor vs ARB bradykinin angiotensin II mechanism

Turn 2 (retrieve_evidence):Searches for “ACE inhibitor vs ARB bradykinin angiotensin II mechanism” to verify its reasoning
[4]

lisinopril mechanism bradykinin

Turn 3 (search_medical_wiki):Queries “lisinopril mechanism bradykinin” for additional confirmation
[5]

Turn 4 (analyze_answer_options):Systematically evaluates all four options against RAAS pharmacology
[6]

ACE inhibitors prevent bradykinin degradation (increase) and block Ang I→Ang II conversion (decrease)

Turn 5 ( submit_answer):SubmitsAwith reasoning: “ACE inhibitors prevent bradykinin degradation (increase) and block Ang I→Ang II conversion (decrease).”Score: 1.18 J.2 Example 2: Emergency Medicine (MedQA, Correct) Question:A 56-year-old man is brought to the ED 30 minutes after falling from a height of 3 feet onto a sharp metal fence pole. He is unconsci...
[7]

It eliminates rotator cuff (too lateral), spinal accessory nerve (too posterior), and internal carotid (too lateral/deep)

Turn 1 (Reasoning):The agent maps the wound location (parasternal clavicular region) to underlying anatomy: the lung apex extends above the clavicle in this area, making pleural injury the primary concern. It eliminates rotator cuff (too lateral), spinal accessory nerve (too posterior), and internal carotid (too lateral/deep)
[8]

penetrating injury clavicle parasternal pleural injury

Turns 2–3 (search_pubmed, retrieve_evidence):Searches for “penetrating injury clavicle parasternal pleural injury” and retrieves medical literature on supraclavicular penetrating trauma complications
[9]

Turn 4 (think):Synthesizes retrieved evidence with anatomical knowledge, confirming that the pleural dome extends 2.5 cm above the medial third of the clavicle. 4.Turn 5 (submit_answer):SubmitsB(pneumothorax due to pleural injury).Score: 1.18 J.3 Example 3: Transplant Medicine (MedQA, Correct) Question:A 37-year-old man, 7 years post-liver transplant on i...
[10]

Turn 1 (think):The agent identifies this as chronic transplant rejection based on: 7-year post- transplant timeline (late onset), recurrent jaundice despite adherence, reduced vascular flow on Doppler (suggesting vascular pathology), and severely elevated enzymes indicating ongoing graft injury
[11]

Turn 2 (analyze_answer_options):Systematically compares options: bile duct prolifera- tion (acute rejection), hepatocyte ballooning (fatty liver disease), granulomatous inflammation (sarcoidosis/TB), obliterative arteritis with fibrosis (chronic rejection hallmark)
[12]

chronic liver transplant rejection obliterative arteri- tis histology

Turn 3 (search_medical_wiki):Searches “chronic liver transplant rejection obliterative arteri- tis histology” for confirmation
[13]

Turn 4 (Reasoning):Integrates clinical and histological knowledge: obliterative arteritis is pathognomonic for chronic rejection, explaining reduced Doppler flow
[14]

Turn 5 (submit_answer):SubmitsDwith detailed reasoning linking clinical presentation to histopathology.Score: 0.80 Key observations.These trajectories reveal three consistent patterns in TT-OPD-trained agents: (1)Reason-first: the agent formulates a hypothesis before searching, reducing irrelevant tool calls; (2)Graceful degradation: when specific search ...

[1] [1]

PathVQA (He et al., 2020) shows TT-OPD at 45.3%, outperforming both base text (40.5%) and GRPO (41.5%)

On VQA-RAD (Lau et al., 2018), Base+AR leads (63.2%) with TT-OPD close behind (63.1%). PathVQA (He et al., 2020) shows TT-OPD at 45.3%, outperforming both base text (40.5%) and GRPO (41.5%). SLAKE (Liu et al., 2021) and PMC-VQA exhibit large gaps between text-based evaluation (79.0%, 57.9%†) and multi-turn agentic evaluation (30.6%, 35.1%), consistent wit...

2018

[2] [2]

It reasons that relative to losartan, lisinopril would increase bradykinin and decrease angiotensin II

Turn 1 (Reasoning):The agent identifies that losartan is an ARB (blocks AT1 receptors) while lisinopril is an ACE inhibitor (blocks angiotensin I → angiotensin II conversionandbradykinin degradation). It reasons that relative to losartan, lisinopril would increase bradykinin and decrease angiotensin II

[3] [3]

ACE inhibitor vs ARB bradykinin angiotensin II mechanism

Turn 2 (retrieve_evidence):Searches for “ACE inhibitor vs ARB bradykinin angiotensin II mechanism” to verify its reasoning

[4] [4]

lisinopril mechanism bradykinin

Turn 3 (search_medical_wiki):Queries “lisinopril mechanism bradykinin” for additional confirmation

[5] [5]

Turn 4 (analyze_answer_options):Systematically evaluates all four options against RAAS pharmacology

[6] [6]

ACE inhibitors prevent bradykinin degradation (increase) and block Ang I→Ang II conversion (decrease)

Turn 5 ( submit_answer):SubmitsAwith reasoning: “ACE inhibitors prevent bradykinin degradation (increase) and block Ang I→Ang II conversion (decrease).”Score: 1.18 J.2 Example 2: Emergency Medicine (MedQA, Correct) Question:A 56-year-old man is brought to the ED 30 minutes after falling from a height of 3 feet onto a sharp metal fence pole. He is unconsci...

[7] [7]

It eliminates rotator cuff (too lateral), spinal accessory nerve (too posterior), and internal carotid (too lateral/deep)

Turn 1 (Reasoning):The agent maps the wound location (parasternal clavicular region) to underlying anatomy: the lung apex extends above the clavicle in this area, making pleural injury the primary concern. It eliminates rotator cuff (too lateral), spinal accessory nerve (too posterior), and internal carotid (too lateral/deep)

[8] [8]

penetrating injury clavicle parasternal pleural injury

Turns 2–3 (search_pubmed, retrieve_evidence):Searches for “penetrating injury clavicle parasternal pleural injury” and retrieves medical literature on supraclavicular penetrating trauma complications

[9] [9]

Turn 4 (think):Synthesizes retrieved evidence with anatomical knowledge, confirming that the pleural dome extends 2.5 cm above the medial third of the clavicle. 4.Turn 5 (submit_answer):SubmitsB(pneumothorax due to pleural injury).Score: 1.18 J.3 Example 3: Transplant Medicine (MedQA, Correct) Question:A 37-year-old man, 7 years post-liver transplant on i...

[10] [10]

Turn 1 (think):The agent identifies this as chronic transplant rejection based on: 7-year post- transplant timeline (late onset), recurrent jaundice despite adherence, reduced vascular flow on Doppler (suggesting vascular pathology), and severely elevated enzymes indicating ongoing graft injury

[11] [11]

Turn 2 (analyze_answer_options):Systematically compares options: bile duct prolifera- tion (acute rejection), hepatocyte ballooning (fatty liver disease), granulomatous inflammation (sarcoidosis/TB), obliterative arteritis with fibrosis (chronic rejection hallmark)

[12] [12]

chronic liver transplant rejection obliterative arteri- tis histology

Turn 3 (search_medical_wiki):Searches “chronic liver transplant rejection obliterative arteri- tis histology” for confirmation

[13] [13]

Turn 4 (Reasoning):Integrates clinical and histological knowledge: obliterative arteritis is pathognomonic for chronic rejection, explaining reduced Doppler flow

[14] [14]

Turn 5 (submit_answer):SubmitsDwith detailed reasoning linking clinical presentation to histopathology.Score: 0.80 Key observations.These trajectories reveal three consistent patterns in TT-OPD-trained agents: (1)Reason-first: the agent formulates a hypothesis before searching, reducing irrelevant tool calls; (2)Graceful degradation: when specific search ...