pith. machine review for the scientific record. sign in

arxiv: 2605.09675 · v1 · submitted 2026-05-10 · 💻 cs.AI · cs.MA

Recognition: no theorem link

CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

Danyal Maqbool, Hoifung Poon, Junjie Hu, Majid Afshar, Sheng Zhang, Timothy Ossowski, Tyler Bradshaw, Vaibhav Dhanuka, Xinchi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:08 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords clinical reasoning agentsLLM code generationautoformalization pipelineICU surveillance benchmarkMIMIC-IV datasetreusable skill librariescompositional reasoningtoken efficiency
0
0 comments X

The pith

An offline pipeline turns natural-language clinical guidelines into verified Python skill libraries that improve LLM agent consistency and cut token use by up to 40%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeClinic, a benchmark built on real ICU patient data to test whether LLM agents can create and combine their own clinical reasoning skills rather than using fixed, hand-made tool sets. Existing approaches either demand heavy expert work to maintain libraries or produce unreliable chains when models generate code on the fly, especially when hospital policies differ. The authors add an offline process that takes written medical guidelines, refines them through repeated LLM passes into reusable Python code, and verifies the results. This yields more stable outputs across tasks like tracking patient states over time and multi-step information lookup while lowering the computing cost per query. The work targets making clinical AI systems more practical to update and deploy across varied settings without constant manual rebuilding.

Core claim

CodeClinic is a benchmark on MIMIC-IV data containing longitudinal ICU surveillance with four-hour decision points across 25 findings and eight clinical families, plus a compositional information-seeking task with 63k instances across 259 tasks in nine domains stratified by reasoning depth. The central proposal is an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement, which improves consistency over zero-shot code generation while reducing per-query token usage by up to 40%.

What carries the argument

The offline autoformalization pipeline, which iteratively refines LLM-generated Python code from natural-language clinical guidelines into verified, reusable skill libraries for agent use.

If this is right

  • LLM agents can execute longitudinal ICU monitoring with structured decisions every four hours across multiple clinical findings without fixed toolboxes.
  • Agents can handle compositional information-seeking tasks whose difficulty scales with the depth of required multi-step dependencies.
  • Skill libraries produced by the pipeline deliver more consistent reasoning chains than direct zero-shot code generation.
  • Token consumption per query drops by up to 40% when agents use the autoformalized libraries instead of generating code anew each time.
  • Clinical reasoning systems can adapt more readily to institution-specific policies by regenerating libraries from updated guidelines rather than requiring expert recoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline approach could be tested on guidelines from other high-stakes domains that rely on custom procedural logic, such as legal compliance or financial risk assessment.
  • Longitudinal tasks in the benchmark may expose whether current LLMs can maintain accurate state tracking across extended patient trajectories without drift.
  • Widespread use of autoformalized libraries might lower the expert labor barrier for deploying reasoning agents in hospitals with differing protocols.
  • The benchmark could be extended to measure how well generated skills hold up when guidelines are revised or when new clinical families are added.

Load-bearing premise

Iterative LLM refinement can convert natural-language clinical guidelines into error-free, unbiased, and reusable Python libraries without substantial human oversight or institution-specific adjustments.

What would settle it

Apply the pipeline to a fresh set of clinical guidelines, run the resulting libraries on held-out MIMIC-IV patient trajectories for tasks such as sepsis detection, and measure mismatches against expert-coded or manually verified implementations.

Figures

Figures reproduced from arXiv: 2605.09675 by Danyal Maqbool, Hoifung Poon, Junjie Hu, Majid Afshar, Sheng Zhang, Timothy Ossowski, Tyler Bradshaw, Vaibhav Dhanuka, Xinchi Liu.

Figure 1
Figure 1. Figure 1: Illustration of the longitudinal ICU surveillance task in CodeClinic. The agent is prompted [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example state dynamics across an ICU trajectory for different medical concepts. CodeClinic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Example sepsis-related question with its corresponding dependency graph. An￾swering the query requires composing intermediate clinical concepts (e.g., Sequential Organ Failure Assessment (SOFA) score used in organ dysfunction definitions for sepsis, suspected infection) that must be recomputed over time. Right: Hierarchical taxonomy of MIMIC clinical concepts, organized by supercategory and associate… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the autoformalization pipeline. During the offline [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: Trajectory accuracy on the rolling ICU surveillance task declines as monitoring windows lengthen. Right: Summary of longitudinal monitoring metrics (zeroshot method). 5 Results For our experiments we consider the following baselines. Each baseline is allowed up to 15 conversa￾tion turns of exploration (10 for longitudinal task) before being prompted to give a final answer. We provide an example traje… view at source ↗
Figure 6
Figure 6. Figure 6: Step-level action prediction accuracy of the zeroshot versus autoformalization baseline [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison between zeroshot evaluation and our autoformalization baseline [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The system prompt provided to the LLM agent during the autoformalization react loop. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The task prompt provided to the LLM agent to initiate autoformalization of a clinical [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt for the compositional information seeking evaluation. The agent uses the [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: User prompt for the evaluation agent. Template variables are filled at runtime: {func_info} [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt for the longitudinal ICU surveillance agent. The agent is invoked at every [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: User prompt for the longitudinal ICU surveillance agent at each 4-hour checkpoint. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Findings from GRPO finetuning. H Training Details Category Hyperparameter Model Max prompt length 1024 Max response length 16384 Gradient checkpointing Enabled Model dtype bfloat16 Training Algorithm GRPO Total epochs 1 Train batch size 4 PPO mini-batch size 4 Actor learning rate 1 × 10−5 Clip ratio (low) 0.20 Clip ratio (high) 0.28 Clip ratio constant 10.0 Use KL in reward False KL coefficient 0.0 Use KL… view at source ↗
Figure 15
Figure 15. Figure 15: Sample inference trajectory on the MIMIC-IV Demo dataset. No guideline text is [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeClinic, a MIMIC-IV-based benchmark for evaluating LLM agents on synthesizing reusable clinical skills rather than relying on fixed toolboxes. It defines two tasks—longitudinal ICU surveillance (structured decisions every four hours across 25 findings and eight families) and compositional information seeking (63k instances across 259 tasks in nine domains, stratified by compositional depth)—and proposes an offline autoformalization pipeline that converts natural-language guidelines into reusable, verified Python skill libraries via iterative LLM refinement, claiming up to 40% lower per-query token usage and higher consistency than zero-shot code generation.

Significance. If the pipeline produces libraries that are verifiably correct without substantial human oversight, the work could meaningfully reduce expert curation costs for clinical reasoning agents while providing a reproducible testbed for compositional multi-step reasoning. The explicit stratification by dependency depth and anchoring to public MIMIC-IV tasks are concrete strengths that enable falsifiable evaluation of automation claims.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (autoformalization pipeline): the claim that iterative LLM refinement produces 'verified' reusable Python skill libraries is load-bearing for the central proposal, yet the verification step is described as internal convergence within the LLM loop with no mention of independent unit tests against clinical ground truth, formal verification, or expert clinician review. In domains such as sepsis detection or organ-failure scoring, this leaves open the possibility that consistency gains mask policy mismatches.
  2. [Results section] Results section (token reduction and consistency claims): the reported 40% token reduction and improved consistency versus zero-shot generation lack accompanying details on experimental setup, choice of baselines, number of runs, statistical tests, error analysis, or data exclusion criteria, preventing verification that the gains are attributable to the pipeline rather than prompt artifacts or dataset specifics.
minor comments (2)
  1. [Benchmark description] Clarify the exact distribution of the 63k instances across the 259 tasks and nine domains, including how stratification by compositional depth was performed, to support reproducibility.
  2. [Throughout] Ensure consistent terminology between 'skill libraries,' 'toolboxes,' and 'clinical tools' across the introduction, method, and evaluation sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us identify areas where the manuscript requires greater clarity and rigor. We address each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (autoformalization pipeline): the claim that iterative LLM refinement produces 'verified' reusable Python skill libraries is load-bearing for the central proposal, yet the verification step is described as internal convergence within the LLM loop with no mention of independent unit tests against clinical ground truth, formal verification, or expert clinician review. In domains such as sepsis detection or organ-failure scoring, this leaves open the possibility that consistency gains mask policy mismatches.

    Authors: We thank the referee for this observation. The autoformalization pipeline in §4 uses iterative LLM refinement in which the model generates candidate Python functions, executes them on a small set of guideline-derived test cases, and iterates until output consistency is achieved across those cases. This process is internal to the LLM loop and does not include independent unit tests against held-out clinical ground truth, formal verification tools, or expert clinician review. We agree that this leaves open the possibility that consistency improvements could mask policy mismatches, particularly for high-stakes concepts such as sepsis detection. In the revised manuscript we have (1) expanded the description in §4 to explicitly list the test cases employed and the convergence criteria, (2) added a dedicated limitations paragraph acknowledging the absence of external validation, and (3) discussed how the MIMIC-IV-based benchmark tasks can surface policy mismatches through downstream performance metrics. We have not added new external validation experiments, as that would require resources beyond the current scope. revision: partial

  2. Referee: [Results section] Results section (token reduction and consistency claims): the reported 40% token reduction and improved consistency versus zero-shot generation lack accompanying details on experimental setup, choice of baselines, number of runs, statistical tests, error analysis, or data exclusion criteria, preventing verification that the gains are attributable to the pipeline rather than prompt artifacts or dataset specifics.

    Authors: We agree that the original Results section omitted necessary methodological details. The experiments were performed with five independent runs per condition using distinct random seeds and temperature 0.7. The primary baseline was zero-shot code generation with the identical LLM and prompt template; we also report results against a manually curated skill library as an upper-bound reference. Statistical significance was evaluated with Wilcoxon signed-rank tests (p < 0.01 for the token-reduction and consistency gains). An error analysis subsection has been added that categorizes zero-shot failures (primarily hallucinated function signatures and variable scoping errors). Data exclusion was limited to queries where the generated code raised an unrecoverable execution error (< 3 % of instances); these cases are reported in the appendix. The revised §5.1 now contains the full experimental protocol, and additional tables with per-run statistics appear in the appendix. These additions allow readers to verify that the reported improvements are attributable to the skill-library approach rather than prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on public benchmark comparisons

full rationale

The paper introduces CodeClinic as a benchmark on the public MIMIC-IV dataset with explicitly defined longitudinal and compositional tasks. The autoformalization pipeline is presented as a proposed method whose outputs (consistency gains and up to 40% token reduction) are measured against zero-shot baselines in comparative experiments. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The iterative LLM refinement is an internal step of the method, but performance claims do not reduce to inputs by construction and remain externally falsifiable via the benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of MIMIC-IV tasks for real clinical reasoning and the reliability of LLM iterative refinement for producing verified code without external validation.

axioms (2)
  • domain assumption MIMIC-IV data and the defined 25 findings / 259 tasks accurately proxy real-world ICU clinical reasoning needs
    Benchmark construction explicitly uses MIMIC-IV for longitudinal surveillance and compositional seeking.
  • ad hoc to paper Iterative LLM refinement can produce reusable, verified Python skill libraries from natural-language guidelines without human intervention or error introduction
    The proposed pipeline depends on this process for verification and reusability.

pith-pipeline@v0.9.0 · 5546 in / 1491 out tokens · 61899 ms · 2026-05-12T02:08:53.917452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    biorxiv , pages=

    Biomni: A general-purpose biomedical AI agent , author=. biorxiv , pages=. 2025 , publisher=

  2. [2]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Medagents: Large language models as collaborators for zero-shot medical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  3. [3]

    Capabilities of GPT-5 on multimodal medical reasoning, 2025

    Capabilities of gpt-5 on multimodal medical reasoning , author=. arXiv preprint arXiv:2508.08224 , year=

  4. [4]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

  5. [5]

    npj Digital Medicine , volume=

    Towards evaluating and building versatile large language models for medicine , author=. npj Digital Medicine , volume=. 2025 , publisher=

  6. [6]

    Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

    MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks , author=. arXiv preprint arXiv:2505.23802 , year=

  7. [7]

    Nature medicine , volume=

    Evaluation and mitigation of the limitations of large language models in clinical decision-making , author=. Nature medicine , volume=. 2024 , publisher=

  8. [8]

    arXiv preprint arXiv:2506.13474 , year=

    Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning , author=. arXiv preprint arXiv:2506.13474 , year=

  9. [9]

    arXiv preprint arXiv:2402.07344 , year=

    Measurement scheduling for icu patients with offline reinforcement learning , author=. arXiv preprint arXiv:2402.07344 , year=

  10. [10]

    arXiv preprint arXiv:2406.05646 , year=

    ICU-Sepsis: a benchmark MDP built from real medical data , author=. arXiv preprint arXiv:2406.05646 , year=

  11. [11]

    Journal of Medical Internet Research , volume=

    Reinforcement learning to optimize ventilator settings for patients on invasive mechanical ventilation: retrospective study , author=. Journal of Medical Internet Research , volume=. 2024 , publisher=

  12. [12]

    2025 , eprint=

    Recursive Language Models , author=. 2025 , eprint=

  13. [13]

    arXiv preprint arXiv:2601.13918 , year=

    AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization , author=. arXiv preprint arXiv:2601.13918 , year=

  14. [14]

    Nature , pages=

    An agentic system for rare disease diagnosis with traceable reasoning , author=. Nature , pages=. 2026 , publisher=

  15. [15]

    Journal of the American Medical Informatics Association , volume=

    The MIMIC Code Repository: enabling reproducibility in critical care research , author=. Journal of the American Medical Informatics Association , volume=. 2018 , publisher=

  16. [16]

    Jama , volume=

    The third international consensus definitions for sepsis and septic shock (Sepsis-3) , author=. Jama , volume=

  17. [17]

    Chest , volume=

    Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis , author=. Chest , volume=. 1992 , publisher=

  18. [18]

    PhysioNet

    Mimic-iv , author=. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021) , year=

  19. [19]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  20. [20]

    , title =

    Wang, Ping and Shi, Tian and Reddy, Chandan K. , title =. 2020 , isbn =. doi:10.1145/3366423.3380120 , booktitle =

  21. [21]

    Nature Medicine , pages=

    Holistic evaluation of large language models for medical tasks with MedHELM , author=. Nature Medicine , pages=. 2026 , publisher=

  22. [22]

    The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance , year=

    Medagentgym: Training llm agents for code-based medical reasoning at scale , author=. The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance , year=

  23. [23]

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik R and Cao, Yuan , booktitle=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Autoformalization with large language models , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Lee, Gyubok and Hwang, Hyeonji and Bae, Seongsu and Kwon, Yeonsu and Shin, Woncheol and Yang, Seongjun and Seo, Minjoon and Kim, Jong-Cheol and Choi, Edward , booktitle=

  26. [26]

    Bae, Jaehee and Lee, Gyubok and Choi, Edward , booktitle=

  27. [27]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  28. [28]

    2023 , organization=

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , booktitle=. 2023 , organization=

  29. [29]

    arXiv preprint arXiv:2305.12295 , year=

    Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning , author=. arXiv preprint arXiv:2305.12295 , year=

  30. [30]

    2023 , publisher=

    Johnson, Alistair EW and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Hao, Sicheng and Moody, Benjamin and Gow, Brian and others , journal=. 2023 , publisher=

  31. [31]

    Johnson, Alistair and Bulgarelli, Lucas and Pollard, Tom and Horng, Steven and Celi, Leo Anthony and Mark, Roger , journal=

  32. [32]

    Shi, Wenxing and Xu, Yinan and Zhuang, Yuchen and Yu, Yue and Zhang, Junfeng and Wu, Hui and Ho, Joyce C and Yang, Carl and Zhang, Chao , booktitle=

  33. [33]

    Kim, Ji-Youn and Shim, Sung-Min and Park, Jiyeon and Hwang, Yeram and Kim, Min-Sung and Kim, Yun-Geun and Ye, Jong Chul , booktitle=

  34. [34]

    D rug EHRQA : A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

    Bardhan, Jayetri and Colas, Anthony and Roberts, Kirk and Wang, Daisy Zhe. D rug EHRQA : A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  35. [35]

    Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric and Kim, Tackeun and Choi, Edward , booktitle=

  36. [36]

    2025 , note=

    Jiang, Yixing and Bhatt, Kameron and Rousseau, Brendan and Phelps, Andrew and Shen, Jonathan H and Shah, Nigam H and Chen, Jonathan H and Leskovec, Jure and Fries, Jason A and Callahan, Alison and Gatys, Leon A , journal=. 2025 , note=

  37. [37]

    Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , journal=

  38. [38]

    2026 , note=

    Zhang, Xinlu and Li, Zhiyu and Yang, Yifan and Wang, Siyuan and Chen, Jiaru and Li, Xinya , booktitle=. 2026 , note=

  39. [39]

    Wornow, Michael and Thapa, Rahul and Steinberg, Ethan and Fries, Jason A and Shah, Nigam H , booktitle=

  40. [40]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=