Recognition: unknown
Robustness of Transformer-Based Fluence Map Prediction Under Clinically Realistic Perturbations
Pith reviewed 2026-05-09 19:49 UTC · model grok-4.3
The pith
Hierarchical attention transformers maintain lower energy errors in fluence map prediction under moderate clinical perturbations than global or hybrid models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the studied prostate IMRT setting, hierarchical transformer backbones exhibit slower growth in upper-quartile energy error than global or hybrid attention models when subjected to geometric perturbations, radiometric noise, reduced training data, and domain shifts; the same experiments show that SSIM alone fails to reflect clinically relevant errors and therefore physics-informed evaluation is required.
What carries the argument
Two-stage transformer pipeline that first maps anatomy to dose and then dose to fluence, equipped with hierarchical, global, or hybrid attention and trained under a physics-informed loss that enforces energy consistency.
If this is right
- Moderate perturbations produce gradual performance loss while severe rotations and high noise cause abrupt failures across all tested architectures.
- Hierarchical attention models keep upper-quartile energy error lower as perturbation strength increases compared with global and hybrid attention.
- SSIM values can remain high even when energy consistency and clinical acceptability are lost.
- Any reliable deployment of learned fluence prediction must include physics-informed metrics rather than relying on image-similarity scores alone.
Where Pith is reading between the lines
- If the robustness advantage holds on broader multi-site data, hierarchical attention could become the default backbone for learned IMRT planning tools.
- The results suggest that future models should be stress-tested against measured clinical distribution shifts rather than synthetic perturbations alone.
- Combining the two-stage pipeline with explicit beam-angle or leaf-sequencing constraints may further reduce the sharp failures observed at high perturbation levels.
Load-bearing premise
The geometric perturbations, noise levels, and domain shifts chosen for testing are representative of the variations that actually occur in real patient scans and clinical practice.
What would settle it
Direct measurement on real clinical cases with recorded positioning errors or scanner differences showing that hierarchical models do not exhibit slower upper-quartile energy error growth than other attention types.
Figures
read the original abstract
Learning-based fluence map prediction offers a fast alternative to iterative inverse planning in intensity-modulated radiation therapy (IMRT), but its robustness under realistic distribution shifts remains unclear. We study a two-stage transformer pipeline that maps anatomy (CT and contours) to dose and then to beamlet fluence maps. We compare fluence-stage transformer backbones with hierarchical, global, and hybrid attention, trained with a physics-informed loss enforcing energy consistency. Robustness is evaluated under geometric perturbations, radiometric noise, reduced training data, and domain shifts using a prostate IMRT dataset, with additional evaluation of the dose stage on public datasets. Results show smooth degradation under moderate perturbations but sharp failures under severe rotations and noise. Hierarchical transformers (e.g., SwinUNETR) exhibit slower growth in upper-quartile energy error, indicating improved robustness. We further show that SSIM alone fails to capture clinically relevant errors, highlighting the need for physics-informed evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies robustness of a two-stage transformer pipeline for IMRT fluence map prediction, mapping CT/contours to dose then to fluence using hierarchical, global, or hybrid attention backbones trained with a physics-informed energy-consistency loss. On a prostate IMRT dataset it evaluates degradation under geometric perturbations, radiometric noise, reduced training data, and domain shifts, reporting smooth degradation for moderate perturbations, sharp failures for severe rotations/noise, slower upper-quartile energy-error growth for hierarchical models such as SwinUNETR, and the inadequacy of SSIM for clinically relevant errors.
Significance. If the reported robustness ordering and metric limitations hold under validated conditions, the work would be significant for AI-assisted radiotherapy planning by showing that hierarchical attention can improve robustness to distribution shift and by demonstrating the value of physics-informed losses and evaluation over standard image metrics. The empirical comparisons across attention types and the use of energy consistency are concrete strengths.
major comments (2)
- [Evaluation / Abstract] The central claim that hierarchical transformers exhibit improved robustness (slower growth in upper-quartile energy error) rests on the test perturbations being a faithful proxy for clinical distribution shifts. The manuscript applies geometric (rotations, translations), radiometric noise, and domain-shift perturbations but provides no quantitative comparison of their magnitudes or statistics against observed inter-patient, inter-scanner, or intra-fraction variations in a multi-center cohort (see Evaluation section and abstract). If the chosen ranges are narrower or lack the correlated structure of real data, the observed ordering could be an artifact of the test regime.
- [Abstract / Results] Key experimental details required to support the reported trends (smooth vs. sharp degradation, hierarchical advantage) are missing: exact perturbation magnitudes, data-split sizes, statistical tests, and quantitative tables. The abstract outlines qualitative trends but leaves the comparisons only partially supported, undermining reproducibility and verification of the robustness conclusions.
minor comments (1)
- [Abstract] The abstract states that the dose stage is additionally evaluated on public datasets but does not name the datasets or summarize the key quantitative outcomes, which would strengthen context for the fluence-stage claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript studying the robustness of transformer-based fluence map prediction for IMRT. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the support for our claims without overstating the current results.
read point-by-point responses
-
Referee: [Evaluation / Abstract] The central claim that hierarchical transformers exhibit improved robustness (slower growth in upper-quartile energy error) rests on the test perturbations being a faithful proxy for clinical distribution shifts. The manuscript applies geometric (rotations, translations), radiometric noise, and domain-shift perturbations but provides no quantitative comparison of their magnitudes or statistics against observed inter-patient, inter-scanner, or intra-fraction variations in a multi-center cohort (see Evaluation section and abstract). If the chosen ranges are narrower or lack the correlated structure of real data, the observed ordering could be an artifact of the test regime.
Authors: We agree that linking the chosen perturbations more explicitly to real clinical variations would better substantiate the central claim. Our perturbations were selected to span mild to severe regimes based on typical prostate IMRT setup uncertainties (e.g., 3-10 mm translations and 2-10 degree rotations drawn from standard literature ranges), but the manuscript does not include a direct side-by-side quantitative mapping to multi-center statistics. In revision we will add a paragraph and supporting table in the Evaluation section that cites representative values from inter-patient and intra-fraction studies and shows how our test magnitudes align with or exceed those ranges. This addition will clarify that moderate perturbations correspond to common clinical shifts while severe cases probe failure boundaries, thereby reducing the risk that the hierarchical advantage is an artifact of the test design. revision: partial
-
Referee: [Abstract / Results] Key experimental details required to support the reported trends (smooth vs. sharp degradation, hierarchical advantage) are missing: exact perturbation magnitudes, data-split sizes, statistical tests, and quantitative tables. The abstract outlines qualitative trends but leaves the comparisons only partially supported, undermining reproducibility and verification of the robustness conclusions.
Authors: We concur that these specifics are necessary for full reproducibility and verification. The current manuscript reports trends at a high level but omits the precise parameter values, split sizes, and supporting tables. In the revised manuscript we will insert: (i) a table listing exact perturbation magnitudes (rotation angles, translation distances, noise standard deviations), (ii) the patient counts for each data split, (iii) any statistical comparisons performed across models, and (iv) expanded quantitative tables of energy-error metrics (including upper-quartile values) for every condition and backbone. These additions will directly underpin the abstract statements on smooth versus sharp degradation and the relative robustness of hierarchical attention. revision: yes
Circularity Check
No circularity: empirical robustness evaluation is self-contained
full rationale
The paper is an experimental study that trains transformer models on a prostate IMRT dataset, applies geometric/radiometric/domain perturbations, and measures degradation in energy error and other metrics. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. Claims about hierarchical attention (e.g., SwinUNETR) showing slower error growth rest on direct held-out test results, not on any reduction to inputs by construction. The physics-informed loss and multi-metric evaluation are independent of the reported robustness ordering.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The two-stage transformer pipeline (anatomy to dose to fluence) accurately models the clinical IMRT inverse planning workflow.
- domain assumption The applied geometric, radiometric, and domain-shift perturbations are representative of real clinical distribution shifts.
Reference graph
Works this paper leans on
-
[1]
Robust radiotherapy planning.Physics in Medicine & Biology, 2018
Jan Unkelbach, Markus Alber, Mark Bangert, Rasmus Bokrantz, Timothy CY Chan, Joseph O Deasy, Albin Fredriksson, Bram L Gorissen, Marcel Van Herk, Wei Liu, et al. Robust radiotherapy planning.Physics in Medicine & Biology, 2018
2018
-
[2]
Trdosepred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy,
C. Hu, H. Wang, W. Zhang, Y. Xie, L. Jiao, and S. Cui, “Trdosepred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy,”Journal of Applied Clinical Medical Physics, 2023
2023
-
[3]
Fluence map prediction with deep learning: A transformer-based approach,
U. Mgboh, R. Sultan, D. Zhu, and J. Kim, “Fluence map prediction with deep learning: A transformer-based approach,”arXiv preprint arXiv:2511.08645, 2025
-
[4]
Deep learning of fluence map patterns for deliverable IMRT planning,
H. Lee and K. Sheng, “Deep learning of fluence map patterns for deliverable IMRT planning,”Medical Physics, 2019
2019
-
[5]
Machine Learning with Multi-Site Imaging Data: An Empirical Study on the Impact of Scanner Effects,
B. Glocker, R. Robinson, D. C. Castro, Q. Dou, and E. Konukoglu, “Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects,”arXiv preprint arXiv:1910.04597, 2019
-
[6]
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study,
J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study,”PLoS Medicine, 2018
2018
-
[7]
Benchmarking neural network robustness to com- mon corruptions and perturbations,
D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to com- mon corruptions and perturbations,” inInternational Conference on Learning Rep- resentations (ICLR), 2019
2019
-
[8]
Domain adaptation for medical image analysis: a survey,
H. Guan and M. Liu, “Domain adaptation for medical image analysis: a survey,” IEEE Transactions on Biomedical Engineering, 2021
2021
-
[9]
Viktor Rogowski, Angelica Svalkvist, Matteo Maspero, Tomas Janssen, Feder- ica Carmen Maruccio, Jenny Gorgisyan, Jonas Scherman, Ida Häggström, Victor Wåhlstrand, Adalsteinn Gunnlaugsson, et al. Impact of deep learning model uncer- tainty on manual corrections to auto-segmentation in prostate cancer radiotherapy. arXiv preprint arXiv:2502.18973, 2025
-
[10]
An introduction to the intensity-modulated radiation therapy (IMRT) techniques, tomotherapy, and vmat,
C. Elith, S. E. Dempsey, N. Findlay, and H. M. Warren-Forward, “An introduction to the intensity-modulated radiation therapy (IMRT) techniques, tomotherapy, and vmat,”Journal of Medical Imaging and Radiation Sciences, 2011
2011
-
[11]
Fluenceformer: Transformer-driven multi-beam fluence map regression for radiotherapy planning,
U. Mgboh, R. I. Sultan, J. Kim, K. Thind, and D. Zhu, “Fluenceformer: Transformer-driven multi-beam fluence map regression for radiotherapy planning,” arXiv preprint arXiv:2512.22425, 2025
-
[12]
Openkbp: the open-access knowledge-based planning grand challenge and dataset,
A. Babier, B. Zhang, R. Mahmood, K. L. Moore, T. G. Purdie, A. L. McNiven, and T. C. Chan, “Openkbp: the open-access knowledge-based planning grand challenge and dataset,”Medical Physics, 2021
2021
-
[13]
Shared data for intensity modulated radiation therapy (IMRT) optimization research: the cort dataset,
D. Craft et al., “Shared data for intensity modulated radiation therapy (IMRT) optimization research: the cort dataset,”GigaScience, 2014
2014
-
[14]
Errors and margins in radiotherapy,
M. van Herk, “Errors and margins in radiotherapy,”Seminars in Radiation Oncol- ogy, 2004
2004
-
[15]
Quantitative comparison of noise texture across CT scanners from different manufacturers,
J. Solomon et al., “Quantitative comparison of noise texture across CT scanners from different manufacturers,”Medical Physics, 2012
2012
-
[16]
Deep learning–based fluence map prediction for pancreas stereotactic body radiation therapy with simultaneous integrated boost,
Wentao Wang et al., “Deep learning–based fluence map prediction for pancreas stereotactic body radiation therapy with simultaneous integrated boost,”Advances in Radiation Oncology, 2021
2021
-
[17]
Deep evidential learning for radiotherapy dose prediction.Computers in Biology and Medicine, 182:109172, 2024
Hai Siong Tan, Kuancheng Wang, and Rafe McBeth. Deep evidential learning for radiotherapy dose prediction.Computers in Biology and Medicine, 182:109172, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.