Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?
Pith reviewed 2026-05-20 22:27 UTC · model grok-4.3
The pith
Large language models under-recall rare and long-term side effects when listing breast cancer radiation toxicities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When prompted to list radiation side effects for breast cancer, large language models systematically under-recall rare and long-term toxicities relative to a clinician-curated reference derived from informed consent documents; they are also sensitive to minor input variations, and number constraints on outputs reduce precision, while direct grounding in the clinician-curated side effect lists measurably improves reliability and robustness.
What carries the argument
The deployment-oriented stress-testing framework that constructs paired patient scenarios differing only in radiotherapy regimens and scores LLM outputs against the clinician-curated reference broken down by frequency and temporal onset.
If this is right
- Grounding model outputs in clinician-curated lists offers a concrete way to raise reliability for survivorship tasks.
- Models should not be deployed alone for comprehensive side-effect communication because of consistent under-recall of rare events.
- Prompt designs must avoid hard limits on output size to preserve precision.
- Small changes in documentation can alter model behavior, so input standardization matters for consistent use.
Where Pith is reading between the lines
- The same evaluation approach could be applied to other cancer types or treatment modalities to check whether the under-recall pattern holds.
- Hybrid workflows that combine model generation with clinician review might address the gaps without requiring perfect standalone performance.
- Testing the framework on real electronic health record excerpts rather than constructed profiles would reveal how well it transfers outside controlled settings.
Load-bearing premise
The clinician-curated reference list derived from informed consent documents is treated as a complete and accurate gold standard for all toxicities linked to the tested regimens.
What would settle it
A controlled review of actual patient records or expert consensus that finds a documented toxicity absent from the reference list, or shows models correctly surfacing a side effect the reference omitted, would undermine the evaluation results.
Figures
read the original abstract
Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates seven instruction-tuned LLMs on generating side-effect lists for breast cancer radiotherapy using 21 patient profiles that form paired scenarios differing only in dose-fractionation, fields, and locations. Outputs are compared against a clinician-curated reference list derived from informed consent documents at two academic medical centers and developed with input from more than seven breast radiation oncologists; the reference maps regimens to toxicities broken down by frequency and temporal onset. The study reports model sensitivity to minor documentation changes, precision-recall trade-offs, systematic under-recall of rare and long-term effects, and substantial reliability gains when outputs are grounded in the clinician-curated lists.
Significance. If the central findings hold, the work supplies a practical, deployment-oriented stress-testing framework for LLM use in oncology survivorship and informed-consent settings. The explicit demonstration that grounding in an independently constructed clinician reference improves robustness, together with the identification of under-recall patterns for rare/long-term toxicities, offers actionable design guidance for safer clinical applications. The use of paired scenarios that isolate regimen differences and the multi-center clinician reference are notable strengths that enhance the evaluation's relevance.
major comments (2)
- [Methods (reference construction) and Results (under-recall analysis)] The claim of systematic under-recall of rare and long-term side effects (abstract and results) rests on treating the clinician-curated reference as a complete gold standard. Because the reference is constructed from informed consent documents, which are required to emphasize common, actionable risks rather than exhaustively enumerate every literature-reported toxicity (e.g., very rare cardiac, pulmonary, or secondary-malignancy effects with long latency), any observed under-recall may be partly an artifact of reference incompleteness rather than a pure model limitation. A concrete validation step against broader oncology literature or additional expert review is needed to separate these effects.
- [Evaluation setup and Results sections] The reported improvements from grounding and the precision-recall trade-offs under different prompting regimes lack accompanying details on inter-rater reliability for the reference list, the exact prompting templates, and statistical tests for the observed differences. Without these, the strength of evidence for the central claims about model behavior and the benefits of grounding cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract would benefit from explicitly stating the number of models and the precise prompting regimes tested.
- [Methods] Notation for frequency and temporal-onset categories in the reference mapping could be clarified with a small example table.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where we agree and the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods (reference construction) and Results (under-recall analysis)] The claim of systematic under-recall of rare and long-term side effects (abstract and results) rests on treating the clinician-curated reference as a complete gold standard. Because the reference is constructed from informed consent documents, which are required to emphasize common, actionable risks rather than exhaustively enumerate every literature-reported toxicity (e.g., very rare cardiac, pulmonary, or secondary-malignancy effects with long latency), any observed under-recall may be partly an artifact of reference incompleteness rather than a pure model limitation. A concrete validation step against broader oncology literature or additional expert review is needed to separate these effects.
Authors: We agree that informed consent documents prioritize common, actionable risks and may not exhaustively list every rare or long-latency toxicity reported in the broader literature. Our reference was deliberately constructed from these documents to mirror the information actually conveyed in clinical informed-consent settings, which is the deployment context we target. To address the concern, we will revise the manuscript to explicitly acknowledge this scope limitation and add a supplementary analysis that cross-references the clinician-curated list against toxicities extracted from recent comprehensive oncology reviews. This will help readers distinguish reference scope from model behavior. revision: partial
-
Referee: [Evaluation setup and Results sections] The reported improvements from grounding and the precision-recall trade-offs under different prompting regimes lack accompanying details on inter-rater reliability for the reference list, the exact prompting templates, and statistical tests for the observed differences. Without these, the strength of evidence for the central claims about model behavior and the benefits of grounding cannot be fully assessed.
Authors: We concur that these details are necessary to fully evaluate the evidence. In the revised version we will report inter-rater reliability metrics (e.g., percentage agreement and Cohen’s kappa) from the multi-oncologist reference construction process, include the complete prompting templates as an appendix, and add statistical tests (paired t-tests or bootstrap confidence intervals) for the reported differences in precision, recall, and grounding improvements. revision: yes
Circularity Check
No significant circularity: empirical evaluation against independent external reference
full rationale
The paper conducts an empirical stress-test of LLMs on side-effect list generation for breast cancer radiotherapy, comparing outputs to a clinician-curated reference list built from informed consent documents at two academic centers by a team of more than seven breast radiation oncologists. This reference is presented as an external gold standard mapping dose-fractionation, fields, and locations to toxicities by frequency and onset. No equations, derivations, fitted parameters, or self-citations are invoked to define the reference, the metrics (precision/recall), or the reported findings in a way that reduces them to the authors' own inputs by construction. The evaluation setup is self-contained against this independently created benchmark, with no self-definitional loops, renamed known results, or load-bearing self-citations. Claims of under-recall and grounding benefits rest on direct comparison to the external list rather than any internal reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The clinician-curated reference accurately and comprehensively maps radiation dose-fractionation, fields, and locations to associated toxicities broken down by frequency and temporal onset.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care... compare LLM outputs to a clinician-curated reference derived from informed consent documents... maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
JNCI: Journal of the National Cancer Institute , volume=
Prevalence of cancer survivors in the United States , author=. JNCI: Journal of the National Cancer Institute , volume=. 2024 , publisher=
work page 2024
- [2]
-
[3]
CA: A cancer journal for clinicians , volume=
Cancer treatment and survivorship statistics, 2025 , author=. CA: A cancer journal for clinicians , volume=. 2025 , publisher=
work page 2025
- [4]
-
[5]
Journal of the National Cancer Institute Monographs , volume=
The interface between primary and oncology specialty care: treatment through survivorship , author=. Journal of the National Cancer Institute Monographs , volume=. 2010 , publisher=
work page 2010
-
[6]
Journal of Cancer Survivorship , volume=
Decision aids for cancer survivors’ engagement with survivorship care services after primary treatment: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=
work page 2024
- [7]
-
[8]
Integrating primary care providers in the care of cancer survivors: gaps in evidence and future opportunities , author=. The lancet oncology , volume=. 2017 , publisher=
work page 2017
-
[9]
Journal of Cancer Survivorship , volume=
Family physician preferences and knowledge gaps regarding the care of adolescent and young adult survivors of childhood cancer , author=. Journal of Cancer Survivorship , volume=. 2013 , publisher=
work page 2013
-
[10]
Journal of Clinical Oncology , volume=
Promise and perils of large language models for cancer survivorship and supportive care , author=. Journal of Clinical Oncology , volume=
-
[11]
Medical accuracy of artificial intelligence chatbots in oncology: a scoping review , author=. The Oncologist , volume=. 2025 , publisher=
work page 2025
- [12]
-
[13]
Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=
work page 2023
-
[14]
CA: a cancer journal for clinicians , volume=
Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=
work page 2020
-
[15]
Evolution of radiotherapy techniques in breast conservation treatment , author=. Gland Surgery , volume=
-
[16]
Exploring large language models for specialist-level oncology care , author=. NEJM AI , volume=. 2025 , publisher=
work page 2025
-
[17]
arXiv preprint arXiv:2310.17703 , year=
The impact of using an AI chatbot to respond to patient messages , author=. arXiv preprint arXiv:2310.17703 , year=
-
[18]
Leveraging large language models for precision monitoring of chemotherapy-induced toxicities: a pilot study with expert comparisons and future directions , author=. Cancers , volume=. 2024 , publisher=
work page 2024
-
[19]
Large language models in oncology: a review , author=. BMJ oncology , volume=
-
[20]
Development and evaluation of large-language models (LLMs) for oncology: A scoping review , author=. PLOS Digital Health , volume=. 2025 , publisher=
work page 2025
-
[21]
Performance of large language models on medical oncology examination questions , author=. JAMA Network Open , volume=
-
[22]
NPJ Precision Oncology , volume=
Large language model use in clinical oncology , author=. NPJ Precision Oncology , volume=. 2024 , publisher=
work page 2024
-
[23]
NPJ Digital Medicine , volume=
Large language model integrations in cancer decision-making: a systematic review and meta-analysis , author=. NPJ Digital Medicine , volume=. 2025 , publisher=
work page 2025
-
[24]
A cross-sectional study of GPT-4--based plain language translation of clinical notes to improve patient comprehension of disease course and management , author=. Nejm Ai , volume=. 2025 , publisher=
work page 2025
-
[25]
JMIR Medical Informatics , volume=
Transforming informed consent generation using large language models: mixed methods study , author=. JMIR Medical Informatics , volume=. 2025 , publisher=
work page 2025
-
[26]
Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology , author=. Nature cancer , volume=. 2025 , publisher=
work page 2025
-
[27]
Long-term and latent side effects of specific cancer types , author=. Medical Clinics , volume=. 2017 , publisher=
work page 2017
-
[28]
The late medical effects of cancer treatments: a growing challenge for all medical professionals , author=. Clinical Medicine , volume=. 2017 , publisher=
work page 2017
-
[29]
Toxicities and quality of life during cancer treatment in advanced solid tumors , author=. Current Oncology , volume=. 2023 , publisher=
work page 2023
-
[30]
Journal of Cancer Survivorship , volume=
Primary care physicians’ knowledge and confidence in providing cancer survivorship care: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=
work page 2024
-
[31]
International journal of environmental research and public health , volume=
Still lost in transition? Perspectives of ongoing cancer survivorship care needs from comprehensive cancer control programs, survivors, and health care providers , author=. International journal of environmental research and public health , volume=. 2022 , publisher=
work page 2022
-
[32]
CA: a cancer journal for clinicians , volume=
Radiation therapy-associated toxicity: Etiology, management, and prevention , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=
work page 2021
-
[33]
Journal of Cancer Education , volume=
Non-oncologist physician knowledge of radiation therapy at an urban community hospital , author=. Journal of Cancer Education , volume=. 2021 , publisher=
work page 2021
-
[34]
Technical Innovations & Patient Support in Radiation Oncology , volume=
Perceptions, educational expectations and knowledge gaps of patients with non-metastatic breast cancer regarding radiotherapy: Integrative review , author=. Technical Innovations & Patient Support in Radiation Oncology , volume=. 2025 , publisher=
work page 2025
-
[35]
New England Journal of Medicine , volume=
Effects of radiotherapy in normal tissue , author=. New England Journal of Medicine , volume=. 2026 , publisher=
work page 2026
-
[36]
Differences in the acute toxic effects of breast radiotherapy by fractionation schedule: comparative analysis of physician-assessed and patient-reported outcomes in a large multicenter cohort , author=. JAMA oncology , volume=
-
[37]
Acute and short-term toxic effects of conventionally fractionated vs hypofractionated whole-breast irradiation: a randomized clinical trial , author=. JAMA oncology , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.