RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Benjamin Gundersen; Bjoern Menze; Christian Bluethgen; Farhad Nooralahzadeh; Jean-Benoit Delbrouck; Jiwoong Sohn; Julia E. Vogt; Kenneth Styppa; M\'elanie Roschewitz; Michael Krauthammer

arxiv: 2604.15231 · v2 · pith:VXWVNR5Snew · submitted 2026-04-16 · 💻 cs.AI

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

M\'elanie Roschewitz , Kenneth Styppa , Yitian Tao , Jiwoong Sohn , Jean-Benoit Delbrouck , Benjamin Gundersen , Nicolas Deperrois , Christian Bluethgen

show 5 more authors

Julia E. Vogt Bjoern Menze Farhad Nooralahzadeh Michael Krauthammer Michael Moor

This is my paper

Pith reviewed 2026-05-10 10:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords chest CTAI agentreport generationinterpretable AIvision-language modelsmedical imagingtool usefaithfulness

0 comments

The pith

A tool-using AI agent generates more accurate, robust, and faithful chest CT reports by producing explicit stepwise reasoning traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RadAgent, a tool-using AI agent that generates chest CT reports through a series of inspectable steps involving tool interactions and decisions. This is evaluated against a standard 3D vision-language model, CT-Chat, on clinical report generation tasks. The agent approach delivers higher clinical accuracy, improved robustness to adversarial inputs, and introduces a faithfulness metric that the baseline cannot achieve. A sympathetic reader cares because this provides a way for clinicians to review the reasoning behind AI-generated reports in complex medical imaging, potentially increasing trust and usability.

Core claim

RadAgent generates CT reports through a stepwise and interpretable process, with each report accompanied by a fully inspectable trace of intermediate decisions and tool interactions. This results in clinical accuracy improvements of 6.0 points in macro-F1 and 5.4 points in micro-F1 over CT-Chat, 24.7 points higher robustness under adversarial conditions, and 37.0% faithfulness, a capability absent in the 3D VLM counterpart.

What carries the argument

The tool-using stepwise agent that structures chest CT interpretation as an explicit, iterative reasoning trace with recorded tool interactions.

If this is right

Clinicians can inspect and validate the derivation of each reported finding.
AI outputs become more resistant to adversarial attacks that could mislead direct models.
A new faithfulness metric quantifies alignment between reasoning trace and final report.
Radiology AI gains transparency for clinical inspection, validation, and refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent architectures could extend to other medical imaging types such as MRI to improve interpretability there as well.
In practice, these traces might allow clinicians to intervene at intermediate steps to correct potential errors before the final report.
Testing in live clinical environments could reveal whether the added transparency actually changes diagnostic outcomes or clinician confidence.

Load-bearing premise

The gains in accuracy, robustness, and faithfulness result from the tool-using stepwise agent architecture rather than differences in training data, model size, or evaluation methods.

What would settle it

Train a version of the baseline model with the same data and parameters but without the agent structure, and verify whether the performance advantages disappear.

Figures

Figures reproduced from arXiv: 2604.15231 by Benjamin Gundersen, Bjoern Menze, Christian Bluethgen, Farhad Nooralahzadeh, Jean-Benoit Delbrouck, Jiwoong Sohn, Julia E. Vogt, Kenneth Styppa, M\'elanie Roschewitz, Michael Krauthammer, Michael Moor, Nicolas Deperrois, Yitian Tao.

**Figure 1.** Figure 1: Overview of RadAgent. a Given a 3D CT volume and a user query, RadAgent first produces an initial report draft and then enters an agent loop guided by a clinician-inspired diagnostic checklist. At each step, the agent plans the next diagnostic action, selects an appropriate tool from a toolbox, updates its memory with tool outputs, and refines its findings until sufficient evidence is collected for generat… view at source ↗

**Figure 2.** Figure 2: Report generation quality comparison between the trained RadAgent system and the CT-Chat report generation baseline. A Results on the CT-RATE validation set, B results on the CT-RATE test set, C results on RadChestCT, and D per-pathology F1 scores on the CT-RATE test set. The trained RadAgent system significantly outperforms the baseline across the validation, test, and external test datasets. On the CT-RA… view at source ↗

**Figure 3.** Figure 3: Faithfulness and robustness of RadAgent under injected prompt hints. a, Standard CT report generation with a 3D VLM baseline, where the CT volume and instruction are mapped directly to a report through a largely intransparent inference process. b, CT report generation with RadAgent, which adds an agentic diagnostic trajectory that iteratively uses specialized tools and yields a traceable intermediate reaso… view at source ↗

**Figure 4.** Figure 4: The RadAgent toolbox. Panel A-I illustrate the individual tools added to the toolbox. 3D and 2D visual question answering For image understanding and interactive reasoning, the toolbox includes visual question answering tools that enable direct querying of CT data at both volume and slice level. First, we leverage CT-Chat 2 in VQA-mode, as our ct_vqa() tool. This tool accepts a volumetric CT scan together … view at source ↗

read the original abstract

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 5.8 points (35.4% relative) in macro-F1 and 5.1 points (18.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces RadAgent, a tool-using AI agent for generating chest CT reports via an explicit, stepwise, and inspectable reasoning trace with tool interactions. It claims that RadAgent outperforms the 3D VLM baseline CT-Chat on clinical accuracy (macro-F1 +6.0 points / 36.4% relative; micro-F1 +5.4 points / 19.6% relative), robustness under adversarial conditions (+24.7 points / 41.9% relative), and introduces a new faithfulness metric (37.0%) that is absent in the baseline.

Significance. If the reported gains can be shown to arise from the agent architecture rather than differences in base model, training data, or evaluation protocol, the work would provide a concrete advance in interpretable medical AI by enabling clinicians to inspect intermediate decisions. The introduction of a faithfulness metric is a positive step toward verifiable outputs, but its value depends on rigorous validation.

major comments (3)

[Abstract] Abstract and experimental section: the central claim that performance deltas are due to the tool-using stepwise agent requires explicit controls showing that RadAgent and CT-Chat use identical base 3D VLM backbones, identical pre-training/fine-tuning data, and matched parameter counts; without these, the 6.0-point macro-F1 and 24.7-point robustness gains cannot be attributed to the agentic trace.
[Abstract] Abstract: no datasets, statistical tests, or evaluation protocols are described for the reported F1 scores, adversarial robustness, or faithfulness metric, so the quantitative improvements lack verifiable support and the weakest assumption (gains attributable to architecture) remains untested.
[Methods] Methods or experimental setup: the faithfulness metric (37.0%) is presented as a new capability, but its definition, computation, and human or automated validation protocol are not detailed, making it impossible to assess whether it genuinely measures inspectable reasoning or is an artifact of the evaluation design.

minor comments (1)

[Abstract] The abstract uses relative percentage improvements without stating the absolute baseline values for CT-Chat, which would aid interpretation of the practical significance of the deltas.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to clarify controls, evaluation details, and the faithfulness metric.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the central claim that performance deltas are due to the tool-using stepwise agent requires explicit controls showing that RadAgent and CT-Chat use identical base 3D VLM backbones, identical pre-training/fine-tuning data, and matched parameter counts; without these, the 6.0-point macro-F1 and 24.7-point robustness gains cannot be attributed to the agentic trace.

Authors: We agree that explicit controls are essential to attribute gains to the agent architecture. In the revised manuscript, we will add a new paragraph in the Experimental Setup section (cross-referenced from the abstract) explicitly confirming that RadAgent and CT-Chat share the identical 3D VLM backbone, the same pre-training and fine-tuning datasets, and matched parameter counts. This was the case in our experiments, as CT-Chat was used as the unmodified 3D VLM counterpart. revision: yes
Referee: [Abstract] Abstract: no datasets, statistical tests, or evaluation protocols are described for the reported F1 scores, adversarial robustness, or faithfulness metric, so the quantitative improvements lack verifiable support and the weakest assumption (gains attributable to architecture) remains untested.

Authors: The abstract prioritizes brevity, but we will revise it to briefly reference the key chest CT datasets, note that statistical significance was assessed with appropriate tests (e.g., paired t-tests), and direct readers to the Experimental section for full protocols on F1, adversarial robustness, and the faithfulness metric. These details are already present in the full experimental setup but will be more clearly signposted. revision: yes
Referee: [Methods] Methods or experimental setup: the faithfulness metric (37.0%) is presented as a new capability, but its definition, computation, and human or automated validation protocol are not detailed, making it impossible to assess whether it genuinely measures inspectable reasoning or is an artifact of the evaluation design.

Authors: We will add a dedicated subsection in Methods titled 'Faithfulness Metric' that defines the metric (proportion of traceable reasoning steps aligning with final findings via tool interactions), specifies its computation (scoring intermediate decisions against ground-truth), and details the validation protocol combining automated alignment checks with human review by radiologists to confirm it captures inspectable reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external baseline with no derivations or self-referential reductions

full rationale

The paper presents an empirical agent architecture for CT report generation and reports performance deltas versus the external CT-Chat 3D VLM baseline. No equations, derivations, fitted parameters renamed as predictions, or self-definitional steps exist. The central claims rest on observed metric improvements (macro-F1, micro-F1, robustness, faithfulness) rather than any chain that reduces to author-defined inputs by construction. The CT-Chat comparison is to a separately published counterpart model and does not rely on load-bearing self-citations or uniqueness theorems imported from the authors' prior work. This is a standard experimental evaluation without circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, free parameters, axioms, or invented entities; the contribution is an empirical system description and comparison.

pith-pipeline@v0.9.0 · 5592 in / 1086 out tokens · 33175 ms · 2026-05-10T10:50:55.404179+00:00 · methodology

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)