arxiv: 2605.00904 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Robustness of Transformer-Based Fluence Map Prediction Under Clinically Realistic Perturbations

Ujunwa Mgboh , Rafi Ibn Sultan , Joshua Kim , Kundan Thind , Dongxiao Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords transformerfluence mapIMRTrobustnessperturbationphysics-informed lossradiation therapydeep learning

0 comments

The pith

Hierarchical attention transformers maintain lower energy errors in fluence map prediction under moderate clinical perturbations than global or hybrid models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a two-stage transformer system that turns patient CT scans and contours into predicted radiation doses and then into beamlet fluence maps for IMRT planning. It trains the models with a loss term that enforces physical energy consistency and evaluates them on a prostate dataset under controlled changes in geometry, noise, data volume, and imaging domain. Hierarchical attention backbones show slower increases in upper-quartile energy error as perturbations grow, while all models break under large rotations or strong noise. The work also demonstrates that the standard image metric SSIM does not track the clinically and physically relevant errors, so evaluation must incorporate the underlying physics.

Core claim

In the studied prostate IMRT setting, hierarchical transformer backbones exhibit slower growth in upper-quartile energy error than global or hybrid attention models when subjected to geometric perturbations, radiometric noise, reduced training data, and domain shifts; the same experiments show that SSIM alone fails to reflect clinically relevant errors and therefore physics-informed evaluation is required.

What carries the argument

Two-stage transformer pipeline that first maps anatomy to dose and then dose to fluence, equipped with hierarchical, global, or hybrid attention and trained under a physics-informed loss that enforces energy consistency.

If this is right

Moderate perturbations produce gradual performance loss while severe rotations and high noise cause abrupt failures across all tested architectures.
Hierarchical attention models keep upper-quartile energy error lower as perturbation strength increases compared with global and hybrid attention.
SSIM values can remain high even when energy consistency and clinical acceptability are lost.
Any reliable deployment of learned fluence prediction must include physics-informed metrics rather than relying on image-similarity scores alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the robustness advantage holds on broader multi-site data, hierarchical attention could become the default backbone for learned IMRT planning tools.
The results suggest that future models should be stress-tested against measured clinical distribution shifts rather than synthetic perturbations alone.
Combining the two-stage pipeline with explicit beam-angle or leaf-sequencing constraints may further reduce the sharp failures observed at high perturbation levels.

Load-bearing premise

The geometric perturbations, noise levels, and domain shifts chosen for testing are representative of the variations that actually occur in real patient scans and clinical practice.

What would settle it

Direct measurement on real clinical cases with recorded positioning errors or scanner differences showing that hierarchical models do not exhibit slower upper-quartile energy error growth than other attention types.

Figures

Figures reproduced from arXiv: 2605.00904 by Dongxiao Zhu, Joshua Kim, Kundan Thind, Rafi Ibn Sultan, Ujunwa Mgboh.

**Figure 1.** Figure 1: Qualitative geometric robustness. Ground-truth fluence maps (repeated for visual reference) remain invariant, as geometric perturbations are applied only at inference [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Data efficiency and radiometric robustness. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: DVH robustness on public datasets. DVH curves for a representative organ-at-risk on the OpenKBP head-and-neck cohort (top row) and a matched prostate ROI on the CORT dataset (bottom row) under (a) mild bias-field perturbation, (b) low Gaussian noise (σ = 0.05), and (c) higher noise (σ = 0.15). Shaded regions denote ground-truth mean±std; colored curves show different backbones. the lowest tail energy error… view at source ↗

read the original abstract

Learning-based fluence map prediction offers a fast alternative to iterative inverse planning in intensity-modulated radiation therapy (IMRT), but its robustness under realistic distribution shifts remains unclear. We study a two-stage transformer pipeline that maps anatomy (CT and contours) to dose and then to beamlet fluence maps. We compare fluence-stage transformer backbones with hierarchical, global, and hybrid attention, trained with a physics-informed loss enforcing energy consistency. Robustness is evaluated under geometric perturbations, radiometric noise, reduced training data, and domain shifts using a prostate IMRT dataset, with additional evaluation of the dose stage on public datasets. Results show smooth degradation under moderate perturbations but sharp failures under severe rotations and noise. Hierarchical transformers (e.g., SwinUNETR) exhibit slower growth in upper-quartile energy error, indicating improved robustness. We further show that SSIM alone fails to capture clinically relevant errors, highlighting the need for physics-informed evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hierarchical transformers degrade more slowly under the tested perturbations for fluence prediction, and SSIM misses clinically relevant errors, but the synthetic shifts may not match real clinical variation.

read the letter

Hierarchical transformers show slower error growth under perturbations in this fluence map prediction task, and the work makes a clear case that physics-based checks are needed beyond SSIM. The paper sets up a two-stage transformer pipeline from CT and contours to dose to fluence maps. They train it with a loss that keeps the energy consistent and then test several attention backbones on a prostate dataset. The tests cover geometric changes, noise, reduced data, and domain shifts. Hierarchical models like SwinUNETR come out with better robustness in the upper quartile of energy errors, while all models fail sharply at high perturbation levels. They also show SSIM does not align with the energy errors that matter. This is a straightforward empirical study that adds a targeted robustness evaluation to the literature on learning-based IMRT planning. The comparison across attention types and the metric critique are the useful parts. The soft spot is the gap between the synthetic perturbations and real clinical variation. The abstract gives no numbers on how the rotation ranges or noise levels compare to observed inter-patient or inter-scanner differences, so it is possible the robustness advantage is tied to this particular test distribution. More detail on the exact setups and any statistical support would help too. Readers working on AI for radiation oncology would find this relevant as a benchmark. It is worth a full referee review to check the methods and see if the findings hold with more realistic shift data.

Referee Report

2 major / 1 minor

Summary. The manuscript studies robustness of a two-stage transformer pipeline for IMRT fluence map prediction, mapping CT/contours to dose then to fluence using hierarchical, global, or hybrid attention backbones trained with a physics-informed energy-consistency loss. On a prostate IMRT dataset it evaluates degradation under geometric perturbations, radiometric noise, reduced training data, and domain shifts, reporting smooth degradation for moderate perturbations, sharp failures for severe rotations/noise, slower upper-quartile energy-error growth for hierarchical models such as SwinUNETR, and the inadequacy of SSIM for clinically relevant errors.

Significance. If the reported robustness ordering and metric limitations hold under validated conditions, the work would be significant for AI-assisted radiotherapy planning by showing that hierarchical attention can improve robustness to distribution shift and by demonstrating the value of physics-informed losses and evaluation over standard image metrics. The empirical comparisons across attention types and the use of energy consistency are concrete strengths.

major comments (2)

[Evaluation / Abstract] The central claim that hierarchical transformers exhibit improved robustness (slower growth in upper-quartile energy error) rests on the test perturbations being a faithful proxy for clinical distribution shifts. The manuscript applies geometric (rotations, translations), radiometric noise, and domain-shift perturbations but provides no quantitative comparison of their magnitudes or statistics against observed inter-patient, inter-scanner, or intra-fraction variations in a multi-center cohort (see Evaluation section and abstract). If the chosen ranges are narrower or lack the correlated structure of real data, the observed ordering could be an artifact of the test regime.
[Abstract / Results] Key experimental details required to support the reported trends (smooth vs. sharp degradation, hierarchical advantage) are missing: exact perturbation magnitudes, data-split sizes, statistical tests, and quantitative tables. The abstract outlines qualitative trends but leaves the comparisons only partially supported, undermining reproducibility and verification of the robustness conclusions.

minor comments (1)

[Abstract] The abstract states that the dose stage is additionally evaluated on public datasets but does not name the datasets or summarize the key quantitative outcomes, which would strengthen context for the fluence-stage claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript studying the robustness of transformer-based fluence map prediction for IMRT. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the support for our claims without overstating the current results.

read point-by-point responses

Referee: [Evaluation / Abstract] The central claim that hierarchical transformers exhibit improved robustness (slower growth in upper-quartile energy error) rests on the test perturbations being a faithful proxy for clinical distribution shifts. The manuscript applies geometric (rotations, translations), radiometric noise, and domain-shift perturbations but provides no quantitative comparison of their magnitudes or statistics against observed inter-patient, inter-scanner, or intra-fraction variations in a multi-center cohort (see Evaluation section and abstract). If the chosen ranges are narrower or lack the correlated structure of real data, the observed ordering could be an artifact of the test regime.

Authors: We agree that linking the chosen perturbations more explicitly to real clinical variations would better substantiate the central claim. Our perturbations were selected to span mild to severe regimes based on typical prostate IMRT setup uncertainties (e.g., 3-10 mm translations and 2-10 degree rotations drawn from standard literature ranges), but the manuscript does not include a direct side-by-side quantitative mapping to multi-center statistics. In revision we will add a paragraph and supporting table in the Evaluation section that cites representative values from inter-patient and intra-fraction studies and shows how our test magnitudes align with or exceed those ranges. This addition will clarify that moderate perturbations correspond to common clinical shifts while severe cases probe failure boundaries, thereby reducing the risk that the hierarchical advantage is an artifact of the test design. revision: partial
Referee: [Abstract / Results] Key experimental details required to support the reported trends (smooth vs. sharp degradation, hierarchical advantage) are missing: exact perturbation magnitudes, data-split sizes, statistical tests, and quantitative tables. The abstract outlines qualitative trends but leaves the comparisons only partially supported, undermining reproducibility and verification of the robustness conclusions.

Authors: We concur that these specifics are necessary for full reproducibility and verification. The current manuscript reports trends at a high level but omits the precise parameter values, split sizes, and supporting tables. In the revised manuscript we will insert: (i) a table listing exact perturbation magnitudes (rotation angles, translation distances, noise standard deviations), (ii) the patient counts for each data split, (iii) any statistical comparisons performed across models, and (iv) expanded quantitative tables of energy-error metrics (including upper-quartile values) for every condition and backbone. These additions will directly underpin the abstract statements on smooth versus sharp degradation and the relative robustness of hierarchical attention. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical robustness evaluation is self-contained

full rationale

The paper is an experimental study that trains transformer models on a prostate IMRT dataset, applies geometric/radiometric/domain perturbations, and measures degradation in energy error and other metrics. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. Claims about hierarchical attention (e.g., SwinUNETR) showing slower error growth rest on direct held-out test results, not on any reduction to inputs by construction. The physics-informed loss and multi-metric evaluation are independent of the reported robustness ordering.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about perturbation realism and pipeline fidelity rather than new mathematical axioms or invented entities.

axioms (2)

domain assumption The two-stage transformer pipeline (anatomy to dose to fluence) accurately models the clinical IMRT inverse planning workflow.
This underpins the entire evaluation setup described in the abstract.
domain assumption The applied geometric, radiometric, and domain-shift perturbations are representative of real clinical distribution shifts.
This is required for the robustness conclusions to transfer to practice.

pith-pipeline@v0.9.0 · 5473 in / 1421 out tokens · 50582 ms · 2026-05-09T19:49:56.408489+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages

[1]

Robust radiotherapy planning.Physics in Medicine & Biology, 2018

Jan Unkelbach, Markus Alber, Mark Bangert, Rasmus Bokrantz, Timothy CY Chan, Joseph O Deasy, Albin Fredriksson, Bram L Gorissen, Marcel Van Herk, Wei Liu, et al. Robust radiotherapy planning.Physics in Medicine & Biology, 2018

2018
[2]

Trdosepred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy,

C. Hu, H. Wang, W. Zhang, Y. Xie, L. Jiao, and S. Cui, “Trdosepred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy,”Journal of Applied Clinical Medical Physics, 2023

2023
[3]

Fluence map prediction with deep learning: A transformer-based approach,

U. Mgboh, R. Sultan, D. Zhu, and J. Kim, “Fluence map prediction with deep learning: A transformer-based approach,”arXiv preprint arXiv:2511.08645, 2025

work page arXiv 2025
[4]

Deep learning of fluence map patterns for deliverable IMRT planning,

H. Lee and K. Sheng, “Deep learning of fluence map patterns for deliverable IMRT planning,”Medical Physics, 2019

2019
[5]

Machine Learning with Multi-Site Imaging Data: An Empirical Study on the Impact of Scanner Effects,

B. Glocker, R. Robinson, D. C. Castro, Q. Dou, and E. Konukoglu, “Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects,”arXiv preprint arXiv:1910.04597, 2019

work page arXiv 1910
[6]

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study,

J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study,”PLoS Medicine, 2018

2018
[7]

Benchmarking neural network robustness to com- mon corruptions and perturbations,

D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to com- mon corruptions and perturbations,” inInternational Conference on Learning Rep- resentations (ICLR), 2019

2019
[8]

Domain adaptation for medical image analysis: a survey,

H. Guan and M. Liu, “Domain adaptation for medical image analysis: a survey,” IEEE Transactions on Biomedical Engineering, 2021

2021
[9]

Impact of deep learning model uncer- tainty on manual corrections to auto-segmentation in prostate cancer radiotherapy

Viktor Rogowski, Angelica Svalkvist, Matteo Maspero, Tomas Janssen, Feder- ica Carmen Maruccio, Jenny Gorgisyan, Jonas Scherman, Ida Häggström, Victor Wåhlstrand, Adalsteinn Gunnlaugsson, et al. Impact of deep learning model uncer- tainty on manual corrections to auto-segmentation in prostate cancer radiotherapy. arXiv preprint arXiv:2502.18973, 2025

work page arXiv 2025
[10]

An introduction to the intensity-modulated radiation therapy (IMRT) techniques, tomotherapy, and vmat,

C. Elith, S. E. Dempsey, N. Findlay, and H. M. Warren-Forward, “An introduction to the intensity-modulated radiation therapy (IMRT) techniques, tomotherapy, and vmat,”Journal of Medical Imaging and Radiation Sciences, 2011

2011
[11]

Fluenceformer: Transformer-driven multi-beam fluence map regression for radiotherapy planning,

U. Mgboh, R. I. Sultan, J. Kim, K. Thind, and D. Zhu, “Fluenceformer: Transformer-driven multi-beam fluence map regression for radiotherapy planning,” arXiv preprint arXiv:2512.22425, 2025

work page arXiv 2025
[12]

Openkbp: the open-access knowledge-based planning grand challenge and dataset,

A. Babier, B. Zhang, R. Mahmood, K. L. Moore, T. G. Purdie, A. L. McNiven, and T. C. Chan, “Openkbp: the open-access knowledge-based planning grand challenge and dataset,”Medical Physics, 2021

2021
[13]

Shared data for intensity modulated radiation therapy (IMRT) optimization research: the cort dataset,

D. Craft et al., “Shared data for intensity modulated radiation therapy (IMRT) optimization research: the cort dataset,”GigaScience, 2014

2014
[14]

Errors and margins in radiotherapy,

M. van Herk, “Errors and margins in radiotherapy,”Seminars in Radiation Oncol- ogy, 2004

2004
[15]

Quantitative comparison of noise texture across CT scanners from different manufacturers,

J. Solomon et al., “Quantitative comparison of noise texture across CT scanners from different manufacturers,”Medical Physics, 2012

2012
[16]

Deep learning–based fluence map prediction for pancreas stereotactic body radiation therapy with simultaneous integrated boost,

Wentao Wang et al., “Deep learning–based fluence map prediction for pancreas stereotactic body radiation therapy with simultaneous integrated boost,”Advances in Radiation Oncology, 2021

2021
[17]

Deep evidential learning for radiotherapy dose prediction.Computers in Biology and Medicine, 182:109172, 2024

Hai Siong Tan, Kuancheng Wang, and Rafe McBeth. Deep evidential learning for radiotherapy dose prediction.Computers in Biology and Medicine, 182:109172, 2024

2024