arxiv: 2605.08439 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: no theorem link

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

Danielle S. Bitterman, Daphna Spiegel, Natalie Seah, Thomas Hartvigsen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsside effectsbreast cancerradiation therapyoncologyinformed consentsurvivorshipprompt engineering

0 comments

The pith

Grounding large language models in clinician-curated lists improves reliability when listing side effects of breast cancer radiation treatments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates seven instruction-tuned large language models on generating side effect lists for breast cancer radiation regimens using 21 paired patient profiles. Outputs are compared against a reference list built by over seven breast radiation oncologists from informed consent documents at two academic centers, with toxicities mapped to dose, fields, frequency, and onset. Models prove sensitive to small input changes, trade off precision against recall, and systematically under-recall rare or long-term effects. Grounding model outputs directly in the curated list raises reliability and robustness, while limiting the number of listed effects lowers precision. These patterns matter for safer use of language models in informed consent and survivorship discussions where incomplete information can affect patient decisions.

Core claim

Large language models can assist in listing radiation side effects for breast cancer but remain limited by sensitivity to documentation changes, precision-recall trade-offs, and under-recall of rare and long-term toxicities; grounding outputs in a clinician-curated reference substantially improves reliability and robustness.

What carries the argument

The deployment-oriented stress-testing framework that constructs paired clinical scenarios differing only in radiotherapy regimens and evaluates outputs against a clinician-curated reference mapping dose-fractionation, fields, and locations to toxicities by frequency and temporal onset.

If this is right

Grounding outputs in curated lists should be adopted as a standard design choice for oncology applications of language models.
Constraints on the number of generated side effects should be avoided because they reduce precision.
Additional safeguards are needed to address systematic under-recall of rare and long-term toxicities.
Prompting strategies must be tested for robustness against minor changes in clinical documentation.
The framework provides a repeatable method for stress-testing language models before use in informed consent or survivorship settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar grounding techniques could be tested on other cancer sites and treatment modalities to create broader patient-education tools.
Embedding the reference lists into electronic health record systems might help overcome the fragmentation problem noted in the paper.
Fine-tuning models on the curated side-effect mappings could reduce dependence on post-hoc grounding in future work.

Load-bearing premise

The clinician-curated reference derived from informed consent documents at two academic centers accurately and comprehensively captures all relevant toxicities broken down by frequency and temporal onset.

What would settle it

A larger multi-center review by additional oncologists that identifies clinically important toxicities missing from the reference list, or real-world deployment data showing that grounded LLM lists still produce harmful omissions or inaccuracies for patients.

Figures

Figures reproduced from arXiv: 2605.08439 by Danielle S. Bitterman, Daphna Spiegel, Natalie Seah, Thomas Hartvigsen.

**Figure 1.** Figure 1: Deployment-oriented evaluation framework. Breast cancer patient profiles are constructed in paired base and specified forms that differ only in radiation documentation specificity. Profiles are converted into prompts and passed to large language models, which generate side-effect lists. Outputs are evaluated along two axes: robustness to documentation perturbations and accuracy relative to a clinician-cura… view at source ↗

read the original abstract

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a deployment-oriented evaluation of seven instruction-tuned LLMs for generating radiation side-effect lists in breast cancer using 21 patient profiles that vary only in radiotherapy regimens. LLM outputs are compared against a clinician-curated reference constructed from informed consent documents at two academic centers by a team of more than seven breast radiation oncologists; the reference maps dose, fractionation, and fields to toxicities by frequency and onset. The study reports prompt sensitivity, systematic under-recall of rare and long-term effects, precision-recall trade-offs when constraining output length, and substantial gains in reliability when outputs are grounded in the clinician-curated lists.

Significance. If the core empirical findings hold after addressing reference validation, the work supplies a concrete stress-testing framework for LLM use in oncology survivorship and informed consent. It supplies actionable evidence on the value of grounding and the risks of unconstrained generation, which could inform safer clinical deployment of LLMs for patient communication.

major comments (2)

[Abstract / reference construction] The headline claims of systematic under-recall and the reliability gains from grounding both treat the clinician-curated reference as an exhaustive gold standard. The manuscript provides no evidence of inter-rater reliability among the >7 curators, cross-validation against CTCAE/QUANTEC/meta-analyses, or external review by oncologists outside the two institutions. Informed consent forms are known to omit low-incidence events and can lag behind current data; without such checks the measured under-recall and grounding improvements are only relative to this particular list rather than true clinical coverage. (Abstract and Methods section describing reference construction.)
[Abstract / evaluation] The abstract states that results are compared to a multi-oncologist reference and reports under-recall and prompt sensitivity, yet no details are supplied on exact prompting templates, statistical tests for differences, or inter-rater agreement metrics between LLM outputs and the reference. These omissions leave the quantitative claims only partially supported and hinder reproducibility. (Abstract and evaluation/results sections.)

minor comments (1)

[Methods] The paper would benefit from a table or appendix listing the precise prompting regimes and the exact wording of the grounding instructions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to clarify the scope of our reference and to improve reproducibility of the evaluation. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract / reference construction] The headline claims of systematic under-recall and the reliability gains from grounding both treat the clinician-curated reference as an exhaustive gold standard. The manuscript provides no evidence of inter-rater reliability among the >7 curators, cross-validation against CTCAE/QUANTEC/meta-analyses, or external review by oncologists outside the two institutions. Informed consent forms are known to omit low-incidence events and can lag behind current data; without such checks the measured under-recall and grounding improvements are only relative to this particular list rather than true clinical coverage. (Abstract and Methods section describing reference construction.)

Authors: We agree that the reference is not presented with formal inter-rater reliability metrics, external validation against CTCAE/QUANTEC, or review by oncologists outside the two centers. We have revised the abstract and methods to state explicitly that under-recall and grounding gains are measured relative to this clinician-curated list derived from informed consent documents. A limitations section has been added acknowledging that informed consent forms may omit rare events and that the reference is not claimed to be exhaustive. We retain the reference because it directly reflects materials used in patient communication at the participating institutions, providing a deployment-relevant benchmark. revision: partial
Referee: [Abstract / evaluation] The abstract states that results are compared to a multi-oncologist reference and reports under-recall and prompt sensitivity, yet no details are supplied on exact prompting templates, statistical tests for differences, or inter-rater agreement metrics between LLM outputs and the reference. These omissions leave the quantitative claims only partially supported and hinder reproducibility. (Abstract and evaluation/results sections.)

Authors: We have updated the abstract to note the prompting regimes and added a concise description of the statistical comparisons (paired tests for prompt sensitivity and precision-recall differences). The exact templates are now referenced in the main text with full versions in the supplement. We have also included the overlap-based agreement metrics used to compare LLM outputs against the reference list. These additions support the quantitative claims while preserving abstract length. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external clinician-curated reference

full rationale

The paper performs an empirical stress-test of LLMs by generating side-effect lists for 21 patient profiles and comparing outputs to an independently constructed clinician-curated reference list derived from informed consent documents at two external academic centers. No equations, fitted parameters, predictions derived from the same data, or self-citations are used to establish the central claims. The reference serves as an external benchmark rather than being defined in terms of the LLM outputs or vice versa, and the reported precision/recall trade-offs and under-recall observations are direct measurements against this benchmark. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that the clinician-curated reference is complete and authoritative; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The clinician-curated reference from informed consent documents accurately maps radiation regimens to toxicities by frequency and onset.
This reference serves as the sole ground truth for all precision, recall, and under-recall claims.

pith-pipeline@v0.9.0 · 5571 in / 1189 out tokens · 45629 ms · 2026-05-12T02:34:30.458924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

JNCI: Journal of the National Cancer Institute , volume=

Prevalence of cancer survivors in the United States , author=. JNCI: Journal of the National Cancer Institute , volume=. 2024 , publisher=

work page 2024
[2]

Cancer , volume=

Defining concepts in cancer survivorship , author=. Cancer , volume=

work page
[3]

CA: A cancer journal for clinicians , volume=

Cancer treatment and survivorship statistics, 2025 , author=. CA: A cancer journal for clinicians , volume=. 2025 , publisher=

work page 2025
[4]

Ca , volume=

Cancer statistics, 2026 , author=. Ca , volume=

work page 2026
[5]

Journal of the National Cancer Institute Monographs , volume=

The interface between primary and oncology specialty care: treatment through survivorship , author=. Journal of the National Cancer Institute Monographs , volume=. 2010 , publisher=

work page 2010
[6]

Journal of Cancer Survivorship , volume=

Decision aids for cancer survivors’ engagement with survivorship care services after primary treatment: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=

work page 2024
[7]

, author=

Engaging TEAM medicine in patient care: redefining cancer survivorship from diagnosis. , author=. American Society of Clinical Oncology Educational book. American Society of Clinical Oncology. Annual Meeting , volume=

work page
[8]

The lancet oncology , volume=

Integrating primary care providers in the care of cancer survivors: gaps in evidence and future opportunities , author=. The lancet oncology , volume=. 2017 , publisher=

work page 2017
[9]

Journal of Cancer Survivorship , volume=

Family physician preferences and knowledge gaps regarding the care of adolescent and young adult survivors of childhood cancer , author=. Journal of Cancer Survivorship , volume=. 2013 , publisher=

work page 2013
[10]

Journal of Clinical Oncology , volume=

Promise and perils of large language models for cancer survivorship and supportive care , author=. Journal of Clinical Oncology , volume=

work page
[11]

The Oncologist , volume=

Medical accuracy of artificial intelligence chatbots in oncology: a scoping review , author=. The Oncologist , volume=. 2025 , publisher=

work page 2025
[12]

, author=

Navigating artificial intelligence (AI) accuracy: A meta-analysis of hallucination incidence in large language model (LLM) responses to oncology questions. , author=. 2025 , publisher=

work page 2025
[13]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

work page 2023
[14]

CA: a cancer journal for clinicians , volume=

Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=

work page 2020
[15]

Gland Surgery , volume=

Evolution of radiotherapy techniques in breast conservation treatment , author=. Gland Surgery , volume=

work page
[16]

NEJM AI , volume=

Exploring large language models for specialist-level oncology care , author=. NEJM AI , volume=. 2025 , publisher=

work page 2025
[17]

arXiv preprint arXiv:2310.17703 , year=

The impact of using an AI chatbot to respond to patient messages , author=. arXiv preprint arXiv:2310.17703 , year=

work page arXiv
[18]

Cancers , volume=

Leveraging large language models for precision monitoring of chemotherapy-induced toxicities: a pilot study with expert comparisons and future directions , author=. Cancers , volume=. 2024 , publisher=

work page 2024
[19]

BMJ oncology , volume=

Large language models in oncology: a review , author=. BMJ oncology , volume=

work page
[20]

PLOS Digital Health , volume=

Development and evaluation of large-language models (LLMs) for oncology: A scoping review , author=. PLOS Digital Health , volume=. 2025 , publisher=

work page 2025
[21]

JAMA Network Open , volume=

Performance of large language models on medical oncology examination questions , author=. JAMA Network Open , volume=

work page
[22]

NPJ Precision Oncology , volume=

Large language model use in clinical oncology , author=. NPJ Precision Oncology , volume=. 2024 , publisher=

work page 2024
[23]

NPJ Digital Medicine , volume=

Large language model integrations in cancer decision-making: a systematic review and meta-analysis , author=. NPJ Digital Medicine , volume=. 2025 , publisher=

work page 2025
[24]

Nejm Ai , volume=

A cross-sectional study of GPT-4--based plain language translation of clinical notes to improve patient comprehension of disease course and management , author=. Nejm Ai , volume=. 2025 , publisher=

work page 2025
[25]

JMIR Medical Informatics , volume=

Transforming informed consent generation using large language models: mixed methods study , author=. JMIR Medical Informatics , volume=. 2025 , publisher=

work page 2025
[26]

Nature cancer , volume=

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology , author=. Nature cancer , volume=. 2025 , publisher=

work page 2025
[27]

Medical Clinics , volume=

Long-term and latent side effects of specific cancer types , author=. Medical Clinics , volume=. 2017 , publisher=

work page 2017
[28]

Clinical Medicine , volume=

The late medical effects of cancer treatments: a growing challenge for all medical professionals , author=. Clinical Medicine , volume=. 2017 , publisher=

work page 2017
[29]

Current Oncology , volume=

Toxicities and quality of life during cancer treatment in advanced solid tumors , author=. Current Oncology , volume=. 2023 , publisher=

work page 2023
[30]

Journal of Cancer Survivorship , volume=

Primary care physicians’ knowledge and confidence in providing cancer survivorship care: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=

work page 2024
[31]

International journal of environmental research and public health , volume=

Still lost in transition? Perspectives of ongoing cancer survivorship care needs from comprehensive cancer control programs, survivors, and health care providers , author=. International journal of environmental research and public health , volume=. 2022 , publisher=

work page 2022
[32]

CA: a cancer journal for clinicians , volume=

Radiation therapy-associated toxicity: Etiology, management, and prevention , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=

work page 2021
[33]

Journal of Cancer Education , volume=

Non-oncologist physician knowledge of radiation therapy at an urban community hospital , author=. Journal of Cancer Education , volume=. 2021 , publisher=

work page 2021
[34]

Technical Innovations & Patient Support in Radiation Oncology , volume=

Perceptions, educational expectations and knowledge gaps of patients with non-metastatic breast cancer regarding radiotherapy: Integrative review , author=. Technical Innovations & Patient Support in Radiation Oncology , volume=. 2025 , publisher=

work page 2025
[35]

New England Journal of Medicine , volume=

Effects of radiotherapy in normal tissue , author=. New England Journal of Medicine , volume=. 2026 , publisher=

work page 2026
[36]

JAMA oncology , volume=

Differences in the acute toxic effects of breast radiotherapy by fractionation schedule: comparative analysis of physician-assessed and patient-reported outcomes in a large multicenter cohort , author=. JAMA oncology , volume=

work page
[37]

JAMA oncology , volume=

Acute and short-term toxic effects of conventionally fractionated vs hypofractionated whole-breast irradiation: a randomized clinical trial , author=. JAMA oncology , volume=

work page