Recognition: unknown
Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
Pith reviewed 2026-05-08 03:33 UTC · model grok-4.3
The pith
An agentic reasoning system reaches 79.6 percent concordance with expert consensus on longitudinal myeloma records, outperforming retrieval baselines especially on complex questions and long histories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An agentic reasoning system, which decomposes queries, retrieves targeted segments from longitudinal records, and iterates until a synthesized answer is reached, achieved 79.6 percent concordance with double-annotated expert consensus on 469 patient-question pairs. This exceeded the 75.4-75.8 percent performance shared by iterative RAG and full-context baselines, with gains of 9.4 percentage points on criteria-based synthesis and 13.5 percentage points in the longest-record decile. While overall error rates were comparable to the 13.6 percent expert disagreement rate, 57.8 percent of system errors were clinically significant versus 18.8 percent of expert disagreements.
What carries the argument
Agentic reasoning loop that plans sub-steps, retrieves from distributed clinical documents and lab values, and iterates to produce a final synthesis for each patient question.
If this is right
- Agentic decomposition and iteration allow performance to improve with question complexity where flat retrieval methods plateau.
- Gains concentrate in the longest records, indicating the approach scales to patients with the most extensive treatment histories.
- Only the agentic method crossed the shared performance ceiling of the other tested architectures.
- Similar overall error rates to experts do not imply equal safety, given the higher proportion of clinically significant system mistakes.
Where Pith is reading between the lines
- The pattern of larger gains on complex synthesis tasks suggests the same agentic structure could be tested on other chronic conditions with multi-year, multi-document records.
- The inversion in error severity implies that any deployment would need additional safeguards such as clinician review flags for high-stakes outputs.
- External validation on MIMIC-IV hints at possible generalization beyond a single tertiary center, but center-specific documentation practices remain a variable to measure.
Load-bearing premise
Double-annotated expert consensus on the 469 questions provides a stable and unbiased ground truth, even though experts disagreed on 13.6 percent of cases and system errors proved more clinically consequential.
What would settle it
A prospective trial in which the agentic system assists live clinical decisions for new myeloma patients and the rate of clinically significant errors is measured against expert-led care.
Figures
read the original abstract
Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates an agentic clinical reasoning system against single-pass RAG, iterative RAG, and full-context baselines on 469 patient-question pairs drawn from longitudinal myeloma records (811 patients, 44,962 documents). Using double-annotated expert consensus (four oncologists with senior adjudication) as reference, it reports 79.6% concordance (95% CI 76.4-82.8) for the agentic system versus 75.4-75.8% for the converged baselines (p=0.006 and 0.007), with larger gains on criteria-based synthesis (+9.4 pp) and longest records (+13.5 pp). Error rates are comparable to expert disagreement (12.2% vs 13.6%), but 57.8% of system errors are clinically significant versus 18.8% of expert disagreements. External validation on MIMIC-IV is mentioned.
Significance. If the evaluation holds, the work shows that agentic multi-step reasoning can exceed the performance plateau of standard retrieval and context-window approaches for complex longitudinal synthesis in oncology. The concentration of gains on high-complexity questions and long records provides concrete evidence for the value of agentic designs in handling real-world clinical data distributions. The statistical reporting (CIs, p-values) and head-to-head design against independent expert labels are strengths that make the empirical comparison falsifiable and reproducible in principle.
major comments (2)
- [Abstract] Abstract: The central superiority claim (79.6% vs 75.4-75.8%, p<0.01) rests on expert consensus as an unbiased ground truth, yet the manuscript reports 13.6% expert disagreement and an inversion in clinical significance (57.8% of system errors clinically significant vs 18.8% of expert disagreements). This raises a load-bearing concern that the reference standard may systematically understate the practical cost of agent errors; additional stratified analysis of error types by severity and question complexity is required to interpret the 3.8-4.2 pp margin.
- [Methods] Methods/evaluation setup: No details are provided on the precise agent architecture (tool use, memory, planning loop), retrieval implementation, prompt templates, or the construction and validation of the 48 question templates across complexity levels. Without these, the performance differential cannot be audited or attributed to specific design choices versus implementation artifacts.
minor comments (2)
- Consider adding a supplementary table or figure breaking down concordance by the three complexity levels and by record-length deciles to make the reported trends (+9.4 pp and +13.5 pp) directly verifiable.
- The external validation on MIMIC-IV is mentioned but not quantified; a brief summary of concordance or error patterns on that cohort would strengthen generalizability claims.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript evaluating an agentic clinical reasoning system for longitudinal myeloma records. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central superiority claim (79.6% vs 75.4-75.8%, p<0.01) rests on expert consensus as an unbiased ground truth, yet the manuscript reports 13.6% expert disagreement and an inversion in clinical significance (57.8% of system errors clinically significant vs 18.8% of expert disagreements). This raises a load-bearing concern that the reference standard may systematically understate the practical cost of agent errors; additional stratified analysis of error types by severity and question complexity is required to interpret the 3.8-4.2 pp margin.
Authors: We agree that a more detailed breakdown of errors is valuable for interpreting the results. The manuscript already presents the overall expert disagreement rate and the proportion of clinically significant errors for both the system and experts. To address the referee's concern, we will include additional stratified analyses in the revised version, specifically cross-tabulating error severity with question complexity levels and record length. This will clarify whether the performance gains are accompanied by acceptable error profiles in the most challenging cases. revision: yes
-
Referee: [Methods] Methods/evaluation setup: No details are provided on the precise agent architecture (tool use, memory, planning loop), retrieval implementation, prompt templates, or the construction and validation of the 48 question templates across complexity levels. Without these, the performance differential cannot be audited or attributed to specific design choices versus implementation artifacts.
Authors: We appreciate this feedback on reproducibility. Although the Methods section describes the overall evaluation setup, we acknowledge that specific implementation details such as the agent's planning loop, tool definitions, retrieval parameters, and prompt templates were not fully expanded. In the revision, we will add a comprehensive description of the agent architecture, including pseudocode for the reasoning loop, the retrieval system (embedding model, indexing strategy), all prompt templates, and the methodology for creating and validating the 48 question templates, including how complexity levels were assigned and any validation steps performed. revision: yes
Circularity Check
No circularity: purely empirical head-to-head evaluation against external labels
full rationale
The paper reports a retrospective comparison of an agentic LLM system versus RAG baselines on 469 patient-question pairs, with performance measured directly against double-annotated expert consensus labels obtained independently of the system. No equations, parameter fits, uniqueness theorems, or self-citations are invoked to derive the 79.6% concordance figure or the reported gains; the result is a straightforward accuracy count against an external reference standard. Even though the paper notes 13.6% expert disagreement and severity differences, these are empirical observations about the ground truth rather than any reduction of the claimed superiority to the inputs by construction. The evaluation is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Double annotation by four oncologists with senior adjudication produces a stable ground-truth label for each patient-question pair.
- domain assumption The 48 question templates at three complexity levels adequately sample the space of clinically relevant decisions in myeloma care.
Reference graph
Works this paper leans on
-
[1]
Multiple myeloma: 2022 update on diagnosis, risk stratification, and management
Rajkumar SV. Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology. 2022;97(8):1086-107
2022
-
[2]
Timer: Temporal instruction modeling and evaluation for longitudinal clinical records
Cui H, Unell A, Chen B, Fries JA, Alsentzer E, Koyejo S, et al. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records. npj Digital Medicine. 2025;8(1):577
2025
-
[3]
National trends in oncology specialists’ EHR inbox work, 2019–2022
Holmgren AJ, Apathy NC, Crews J, Shanafelt T. National trends in oncology specialists’ EHR inbox work, 2019–2022. JNCI: Journal of the National Cancer Institute. 2025;117(6):1253-9
2019
-
[4]
Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review
Du X, Zhou Z, Wang Y, Chuang YW, Li Y, Yang R, et al. Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review. International Journal of Medical Informatics. 2025:106091
2025
-
[5]
Improving large language model applications in biomedicine with retrieval-augmentedgeneration: asystematicreview,meta-analysis,andclinicaldevelopmentguidelines
Liu S, McCoy AB, Wright A. Improving large language model applications in biomedicine with retrieval-augmentedgeneration: asystematicreview,meta-analysis,andclinicaldevelopmentguidelines. Journal of the American Medical Informatics Association. 2025;32(4):605-15
2025
-
[6]
LLM-based agentic systems in medicine and healthcare
Qiu J, Lam K, Li G, Acharya A, Wong TY, Darzi A, et al. LLM-based agentic systems in medicine and healthcare. Nature Machine Intelligence. 2024;6(12):1418-20
2024
-
[7]
Artificial intelligence agents in cancer research and oncology
Truhn D, Azizi S, Zou J, Cerda-Alberich L, Mahmood F, Kather JN. Artificial intelligence agents in cancer research and oncology. Nature Reviews Cancer. 2026:1-14
2026
-
[8]
GPT-4 for information retrieval and comparison of medical oncology guidelines
Ferber D, Wiest IC, Wölflein G, Ebert MP, Beutel G, Eckardt JN, et al. GPT-4 for information retrieval and comparison of medical oncology guidelines. Nejm Ai. 2024;1(6):AIcs2300235
2024
-
[9]
Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data
Lu KH, Mehdinia S, Man K, Wong CW, Mao A, Eftekhari Z. Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data. JCO Clinical Cancer Informatics. 2025;9:e2500011
2025
-
[10]
Verifact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts
Liu X, Zhang L, Munir S, Gu Y, Wang L. Verifact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; 2025. p. 17919-36
2025
-
[11]
Verifying Facts in PatientCareDocumentsGeneratedbyLargeLanguageModelsUsingElectronicHealthRecords
Chung P, Swaminathan A, Goodell AJ, Kim Y, Momsen Reincke S, Han L, et al. Verifying Facts in PatientCareDocumentsGeneratedbyLargeLanguageModelsUsingElectronicHealthRecords. NEJM AI. 2025;3(1):AIdbp2500418. 11
2025
-
[12]
Evaluating Retrieval-Augmented Generationvs.Long-ContextInputforClinicalReasoningoverEHRsf.arXiv[preprint]arXiv:250814817
Myers S, Dligach D, Miller TA, Barr S, Gao Y, Churpek M, et al. Evaluating Retrieval-Augmented Generationvs.Long-ContextInputforClinicalReasoningoverEHRsf.arXiv[preprint]arXiv:250814817. 2025
2025
-
[13]
Towardsconversationaldiagnostic artificial intelligence
TuT,SchaekermannM,PalepuA,SaabK,FreybergJ,TannoR,etal. Towardsconversationaldiagnostic artificial intelligence. Nature. 2025;642(8067):442-50
2025
-
[14]
Agentclinic: amultimodalagentbenchmark to evaluate ai in simulated clinical environments
SchmidgallS,ZiaeiR,HarrisC,ReisE,JoplingJ,MoorM. Agentclinic: amultimodalagentbenchmark to evaluate ai in simulated clinical environments. arXiv [preprint] arXiv:240507960. 2024
2024
-
[15]
Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records
Shi W, Xu R, Zhuang Y, Yu Y, Zhang J, Wu H, et al. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; 2024. p. 22315-39
2024
-
[16]
MedAgentBench: a virtual EHR environment to benchmark medical LLM agents
Jiang Y, Black KC, Geng G, Park D, Zou J, Ng AY, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. Nejm Ai. 2025;2(9):AIdbp2500144
2025
-
[17]
Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering
Lee G, Bach E, Yang E, Pollard T, Johnson A, Choi E, et al. Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering. arXiv [preprint] arXiv:250919319. 2025
2025
-
[18]
MIMIC-IV- Note: Deidentified free-text clinical notes.PhysioNet, January 2023
Johnson A, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet. 2023 Jan. Version 2.2. Available from:https://doi.org/10.13026/1n74-ne17
-
[19]
MIMIC-IV,afreelyaccessible electronic health record dataset
JohnsonAE,BulgarelliL,ShenL,GaylesA,ShammoutA,HorngS,etal. MIMIC-IV,afreelyaccessible electronic health record dataset. Scientific data. 2023;10(1):1
2023
-
[20]
PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals
Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation. 2000;101(23):e215-20
2000
-
[21]
gpt-oss-120b & gpt-oss-20b model card
Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, Arbus E, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv [preprint] arXiv:250810925. 2025
2025
-
[22]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020;33:9459-74
2020
-
[23]
In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024); 2024
DoanNN,HärmäA,CelebiR,GottardoV.Ahybridretrievalapproachforadvancingretrieval-augmented generation systems. In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024); 2024. p. 397-409
2024
-
[24]
Lost in the middle: How language models use long contexts
Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics. 2024;12:157-73. 12 Figures Figure 1 Radiology Report a Discharge Summary Medical History Diagnosis Treatment Laboratory Results 1,527 patients across 2 institutions ...
2024
-
[25]
2025-10-09 discharge_note
2025
-
[26]
2023-05-29 tumor_board
2023
-
[27]
2020-12-04 rad_report
2020
-
[28]
Nicht dokumentiert
2026-02-14 Creatinine VI. Formulate answer with sources Task context Retrieved information Patient a b c d Report Type Report Date Figure 3: Agentic system enables structured, traceable clinical reasoning across longitudinal patient records.(a) User-facing workflow illustrating query input and generation of citation-backed answers grounded in patient reco...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.