arxiv: 2604.24473 · v1 · submitted 2026-04-27 · 💻 cs.AI · cs.CL

Recognition: unknown

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll , Jannik L\"ubberstedt , Christoph Nuernbergk , Jacob Stroh , Luisa Mertens , Anna Purcarea , Christopher Zirn , Zeineb Benchaaben

show 16 more authors

Fabian Drexel Hartmut H\"antze Anirudh Narayanan Friedrich Puttkammer Andrei Zhukov Jacqueline Lammert Sebastian Ziegelmayer Markus Graf Marion H\"ogner Marcus Makowski Florian Bassermann Lisa C. Adams Jiazhen Pan Daniel Rueckert Krischan Braitsch Keno K. Bressem

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multiple myelomaagentic reasoninglongitudinal clinical recordsretrieval-augmented generationexpert consensusclinical decision supportLLM evaluation

0 comments

The pith

An agentic reasoning system reaches 79.6 percent concordance with expert consensus on longitudinal myeloma records, outperforming retrieval baselines especially on complex questions and long histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can synthesize evidence spread across years of heterogeneous clinical documents for multiple myeloma patients at a level comparable to oncologists. It compares an agentic system that plans, retrieves, and iterates against single-pass RAG, iterative RAG, and full-context approaches on 469 questions drawn from 811 patients. The agentic method exceeds the performance ceiling shared by the other techniques, with the largest margins appearing on criteria-based synthesis questions and in the longest records. This matters because myeloma management depends on cumulative history that no single document or simple search can capture fully. The work shows that error rates are similar to expert disagreement but that system mistakes tend to carry greater clinical weight.

Core claim

An agentic reasoning system, which decomposes queries, retrieves targeted segments from longitudinal records, and iterates until a synthesized answer is reached, achieved 79.6 percent concordance with double-annotated expert consensus on 469 patient-question pairs. This exceeded the 75.4-75.8 percent performance shared by iterative RAG and full-context baselines, with gains of 9.4 percentage points on criteria-based synthesis and 13.5 percentage points in the longest-record decile. While overall error rates were comparable to the 13.6 percent expert disagreement rate, 57.8 percent of system errors were clinically significant versus 18.8 percent of expert disagreements.

What carries the argument

Agentic reasoning loop that plans sub-steps, retrieves from distributed clinical documents and lab values, and iterates to produce a final synthesis for each patient question.

If this is right

Agentic decomposition and iteration allow performance to improve with question complexity where flat retrieval methods plateau.
Gains concentrate in the longest records, indicating the approach scales to patients with the most extensive treatment histories.
Only the agentic method crossed the shared performance ceiling of the other tested architectures.
Similar overall error rates to experts do not imply equal safety, given the higher proportion of clinically significant system mistakes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern of larger gains on complex synthesis tasks suggests the same agentic structure could be tested on other chronic conditions with multi-year, multi-document records.
The inversion in error severity implies that any deployment would need additional safeguards such as clinician review flags for high-stakes outputs.
External validation on MIMIC-IV hints at possible generalization beyond a single tertiary center, but center-specific documentation practices remain a variable to measure.

Load-bearing premise

Double-annotated expert consensus on the 469 questions provides a stable and unbiased ground truth, even though experts disagreed on 13.6 percent of cases and system errors proved more clinically consequential.

What would settle it

A prospective trial in which the agentic system assists live clinical decisions for new myeloma patients and the rate of clinically significant errors is measured against expert-led care.

Figures

Figures reproduced from arXiv: 2604.24473 by Andrei Zhukov, Anirudh Narayanan, Anna Purcarea, Christopher Zirn, Christoph Nuernbergk, Daniel Rueckert, Fabian Drexel, Florian Bassermann, Friedrich Puttkammer, Hartmut H\"antze, Jacob Stroh, Jacqueline Lammert, Jannik L\"ubberstedt, Jiazhen Pan, Johannes Moll, Keno K. Bressem, Krischan Braitsch, Lisa C. Adams, Luisa Mertens, Marcus Makowski, Marion H\"ogner, Markus Graf, Sebastian Ziegelmayer, Zeineb Benchaaben.

**Figure 1.** Figure 1: Construction of longitudinal cohorts and expert-annotated evaluation dataset enabling clinically view at source ↗

**Figure 2.** Figure 2: Cohort selection yields representative evaluation sets for longitudinal clinical reasoning tasks. view at source ↗

**Figure 3.** Figure 3: Agentic system enables structured, traceable clinical reasoning across longitudinal patient records. view at source ↗

**Figure 4.** Figure 4: Agentic reasoning improves accuracy, with largest gains in clinically complex tasks requiring longitudinal view at source ↗

read the original abstract

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The agentic setup beats the RAG baselines by a few points on expert concordance for longitudinal myeloma questions, but the 13.6% expert disagreement plus the inverted severity of system errors makes the practical gain look smaller than the headline numbers.

read the letter

The paper's core result is that an agentic reasoning system hits 79.6% agreement with expert consensus on 469 questions drawn from real multi-year myeloma records, beating iterative RAG and full-context baselines by 3.8–4.2 points. Gains are larger on complex synthesis tasks and the longest records. That pattern is the useful part: it shows where extra reasoning steps help most when the input is messy and spread out over time. They also report external validation on MIMIC-IV and clear confidence intervals, which is better than many similar papers.

Referee Report

2 major / 2 minor

Summary. The paper evaluates an agentic clinical reasoning system against single-pass RAG, iterative RAG, and full-context baselines on 469 patient-question pairs drawn from longitudinal myeloma records (811 patients, 44,962 documents). Using double-annotated expert consensus (four oncologists with senior adjudication) as reference, it reports 79.6% concordance (95% CI 76.4-82.8) for the agentic system versus 75.4-75.8% for the converged baselines (p=0.006 and 0.007), with larger gains on criteria-based synthesis (+9.4 pp) and longest records (+13.5 pp). Error rates are comparable to expert disagreement (12.2% vs 13.6%), but 57.8% of system errors are clinically significant versus 18.8% of expert disagreements. External validation on MIMIC-IV is mentioned.

Significance. If the evaluation holds, the work shows that agentic multi-step reasoning can exceed the performance plateau of standard retrieval and context-window approaches for complex longitudinal synthesis in oncology. The concentration of gains on high-complexity questions and long records provides concrete evidence for the value of agentic designs in handling real-world clinical data distributions. The statistical reporting (CIs, p-values) and head-to-head design against independent expert labels are strengths that make the empirical comparison falsifiable and reproducible in principle.

major comments (2)

[Abstract] Abstract: The central superiority claim (79.6% vs 75.4-75.8%, p<0.01) rests on expert consensus as an unbiased ground truth, yet the manuscript reports 13.6% expert disagreement and an inversion in clinical significance (57.8% of system errors clinically significant vs 18.8% of expert disagreements). This raises a load-bearing concern that the reference standard may systematically understate the practical cost of agent errors; additional stratified analysis of error types by severity and question complexity is required to interpret the 3.8-4.2 pp margin.
[Methods] Methods/evaluation setup: No details are provided on the precise agent architecture (tool use, memory, planning loop), retrieval implementation, prompt templates, or the construction and validation of the 48 question templates across complexity levels. Without these, the performance differential cannot be audited or attributed to specific design choices versus implementation artifacts.

minor comments (2)

Consider adding a supplementary table or figure breaking down concordance by the three complexity levels and by record-length deciles to make the reported trends (+9.4 pp and +13.5 pp) directly verifiable.
The external validation on MIMIC-IV is mentioned but not quantified; a brief summary of concordance or error patterns on that cohort would strengthen generalizability claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript evaluating an agentic clinical reasoning system for longitudinal myeloma records. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central superiority claim (79.6% vs 75.4-75.8%, p<0.01) rests on expert consensus as an unbiased ground truth, yet the manuscript reports 13.6% expert disagreement and an inversion in clinical significance (57.8% of system errors clinically significant vs 18.8% of expert disagreements). This raises a load-bearing concern that the reference standard may systematically understate the practical cost of agent errors; additional stratified analysis of error types by severity and question complexity is required to interpret the 3.8-4.2 pp margin.

Authors: We agree that a more detailed breakdown of errors is valuable for interpreting the results. The manuscript already presents the overall expert disagreement rate and the proportion of clinically significant errors for both the system and experts. To address the referee's concern, we will include additional stratified analyses in the revised version, specifically cross-tabulating error severity with question complexity levels and record length. This will clarify whether the performance gains are accompanied by acceptable error profiles in the most challenging cases. revision: yes
Referee: [Methods] Methods/evaluation setup: No details are provided on the precise agent architecture (tool use, memory, planning loop), retrieval implementation, prompt templates, or the construction and validation of the 48 question templates across complexity levels. Without these, the performance differential cannot be audited or attributed to specific design choices versus implementation artifacts.

Authors: We appreciate this feedback on reproducibility. Although the Methods section describes the overall evaluation setup, we acknowledge that specific implementation details such as the agent's planning loop, tool definitions, retrieval parameters, and prompt templates were not fully expanded. In the revision, we will add a comprehensive description of the agent architecture, including pseudocode for the reasoning loop, the retrieval system (embedding model, indexing strategy), all prompt templates, and the methodology for creating and validating the 48 question templates, including how complexity levels were assigned and any validation steps performed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical head-to-head evaluation against external labels

full rationale

The paper reports a retrospective comparison of an agentic LLM system versus RAG baselines on 469 patient-question pairs, with performance measured directly against double-annotated expert consensus labels obtained independently of the system. No equations, parameter fits, uniqueness theorems, or self-citations are invoked to derive the 79.6% concordance figure or the reported gains; the result is a straightforward accuracy count against an external reference standard. Even though the paper notes 13.6% expert disagreement and severity differences, these are empirical observations about the ground truth rather than any reduction of the claimed superiority to the inputs by construction. The evaluation is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that expert double-annotation provides a reliable reference standard and that the 469 questions are representative of real clinical decision points. No free parameters are fitted to data in the reported results. No new physical or mathematical entities are introduced.

axioms (2)

domain assumption Double annotation by four oncologists with senior adjudication produces a stable ground-truth label for each patient-question pair.
Invoked when the paper states reference labels came from this process and uses it to compute concordance.
domain assumption The 48 question templates at three complexity levels adequately sample the space of clinically relevant decisions in myeloma care.
Used to generate the 469 evaluation items.

pith-pipeline@v0.9.0 · 5767 in / 1523 out tokens · 45609 ms · 2026-05-08T03:33:37.608764+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 1 canonical work pages

[1]

Multiple myeloma: 2022 update on diagnosis, risk stratification, and management

Rajkumar SV. Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology. 2022;97(8):1086-107

2022
[2]

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records

Cui H, Unell A, Chen B, Fries JA, Alsentzer E, Koyejo S, et al. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records. npj Digital Medicine. 2025;8(1):577

2025
[3]

National trends in oncology specialists’ EHR inbox work, 2019–2022

Holmgren AJ, Apathy NC, Crews J, Shanafelt T. National trends in oncology specialists’ EHR inbox work, 2019–2022. JNCI: Journal of the National Cancer Institute. 2025;117(6):1253-9

2019
[4]

Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review

Du X, Zhou Z, Wang Y, Chuang YW, Li Y, Yang R, et al. Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review. International Journal of Medical Informatics. 2025:106091

2025
[5]

Improving large language model applications in biomedicine with retrieval-augmentedgeneration: asystematicreview,meta-analysis,andclinicaldevelopmentguidelines

Liu S, McCoy AB, Wright A. Improving large language model applications in biomedicine with retrieval-augmentedgeneration: asystematicreview,meta-analysis,andclinicaldevelopmentguidelines. Journal of the American Medical Informatics Association. 2025;32(4):605-15

2025
[6]

LLM-based agentic systems in medicine and healthcare

Qiu J, Lam K, Li G, Acharya A, Wong TY, Darzi A, et al. LLM-based agentic systems in medicine and healthcare. Nature Machine Intelligence. 2024;6(12):1418-20

2024
[7]

Artificial intelligence agents in cancer research and oncology

Truhn D, Azizi S, Zou J, Cerda-Alberich L, Mahmood F, Kather JN. Artificial intelligence agents in cancer research and oncology. Nature Reviews Cancer. 2026:1-14

2026
[8]

GPT-4 for information retrieval and comparison of medical oncology guidelines

Ferber D, Wiest IC, Wölflein G, Ebert MP, Beutel G, Eckardt JN, et al. GPT-4 for information retrieval and comparison of medical oncology guidelines. Nejm Ai. 2024;1(6):AIcs2300235

2024
[9]

Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data

Lu KH, Mehdinia S, Man K, Wong CW, Mao A, Eftekhari Z. Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data. JCO Clinical Cancer Informatics. 2025;9:e2500011

2025
[10]

Verifact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts

Liu X, Zhang L, Munir S, Gu Y, Wang L. Verifact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; 2025. p. 17919-36

2025
[11]

Verifying Facts in PatientCareDocumentsGeneratedbyLargeLanguageModelsUsingElectronicHealthRecords

Chung P, Swaminathan A, Goodell AJ, Kim Y, Momsen Reincke S, Han L, et al. Verifying Facts in PatientCareDocumentsGeneratedbyLargeLanguageModelsUsingElectronicHealthRecords. NEJM AI. 2025;3(1):AIdbp2500418. 11

2025
[12]

Evaluating Retrieval-Augmented Generationvs.Long-ContextInputforClinicalReasoningoverEHRsf.arXiv[preprint]arXiv:250814817

Myers S, Dligach D, Miller TA, Barr S, Gao Y, Churpek M, et al. Evaluating Retrieval-Augmented Generationvs.Long-ContextInputforClinicalReasoningoverEHRsf.arXiv[preprint]arXiv:250814817. 2025

2025
[13]

Towardsconversationaldiagnostic artificial intelligence

TuT,SchaekermannM,PalepuA,SaabK,FreybergJ,TannoR,etal. Towardsconversationaldiagnostic artificial intelligence. Nature. 2025;642(8067):442-50

2025
[14]

Agentclinic: amultimodalagentbenchmark to evaluate ai in simulated clinical environments

SchmidgallS,ZiaeiR,HarrisC,ReisE,JoplingJ,MoorM. Agentclinic: amultimodalagentbenchmark to evaluate ai in simulated clinical environments. arXiv [preprint] arXiv:240507960. 2024

2024
[15]

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Shi W, Xu R, Zhuang Y, Yu Y, Zhang J, Wu H, et al. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; 2024. p. 22315-39

2024
[16]

MedAgentBench: a virtual EHR environment to benchmark medical LLM agents

Jiang Y, Black KC, Geng G, Park D, Zou J, Ng AY, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. Nejm Ai. 2025;2(9):AIdbp2500144

2025
[17]

Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering

Lee G, Bach E, Yang E, Pollard T, Johnson A, Choi E, et al. Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering. arXiv [preprint] arXiv:250919319. 2025

2025
[18]

MIMIC-IV- Note: Deidentified free-text clinical notes.PhysioNet, January 2023

Johnson A, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet. 2023 Jan. Version 2.2. Available from:https://doi.org/10.13026/1n74-ne17

work page doi:10.13026/1n74-ne17 2023
[19]

MIMIC-IV,afreelyaccessible electronic health record dataset

JohnsonAE,BulgarelliL,ShenL,GaylesA,ShammoutA,HorngS,etal. MIMIC-IV,afreelyaccessible electronic health record dataset. Scientific data. 2023;10(1):1

2023
[20]

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals

Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation. 2000;101(23):e215-20

2000
[21]

gpt-oss-120b & gpt-oss-20b model card

Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, Arbus E, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv [preprint] arXiv:250810925. 2025

2025
[22]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020;33:9459-74

2020
[23]

In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024); 2024

DoanNN,HärmäA,CelebiR,GottardoV.Ahybridretrievalapproachforadvancingretrieval-augmented generation systems. In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024); 2024. p. 397-409

2024
[24]

Lost in the middle: How language models use long contexts

Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics. 2024;12:157-73. 12 Figures Figure 1 Radiology Report a Discharge Summary Medical History Diagnosis Treatment Laboratory Results 1,527 patients across 2 institutions ...

2024
[25]

2025-10-09 discharge_note

2025
[26]

2023-05-29 tumor_board

2023
[27]

2020-12-04 rad_report

2020
[28]

Nicht dokumentiert

2026-02-14 Creatinine VI. Formulate answer with sources Task context Retrieved information Patient a b c d Report Type Report Date Figure 3: Agentic system enables structured, traceable clinical reasoning across longitudinal patient records.(a) User-facing workflow illustrating query input and generation of citation-backed answers grounded in patient reco...

2026