TRACE: A Concept Bottleneck Model for Longitudinal 3D Glioblastoma Response Assessment
Pith reviewed 2026-06-30 06:49 UTC · model grok-4.3
The pith
TRACE frames glioblastoma response assessment as structured concept reasoning using RANO-aligned bottlenecks on longitudinal 3D MRI.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRACE processes paired multimodal MRI scans with a shared 3D vision encoder to predict root concepts of tumor measurements, computes downstream RANO-derived concepts through deterministic rules, and incorporates scan interval and new-lesion information as passthrough concepts. On 5-fold patient-wise cross-validation, it achieves a 4-class macro F1 of 0.4769 and binary progression F1 of 0.7085, with ablations confirming the value of the expert RANO graph and intervention-consistency training. Intervention experiments show that correcting concepts can improve predictions.
What carries the argument
The RANO 2.0-aligned concept bottleneck that separates root tumor measurement concepts from deterministic downstream reasoning and passthrough concepts.
If this is right
- The expert RANO graph and intervention-consistency training are important for performance.
- Correcting predicted concepts can improve downstream response predictions.
- Structured concept bottlenecks offer a transparent direction for longitudinal response assessment.
- Larger protocol-aligned datasets and external validation are needed.
Where Pith is reading between the lines
- This structured approach could extend to other standardized response criteria such as RECIST in different cancers.
- Improving root concept accuracy with better segmentation would raise final label performance without altering the reasoning layer.
- The model enables clinician corrections at the measurement level rather than only at the final label.
- It demonstrates value in aligning AI outputs with existing clinical workflows instead of bypassing them.
Load-bearing premise
The predicted tumor measurements from imaging are accurate enough that applying the deterministic RANO rules yields clinically valid response labels.
What would settle it
A validation set where the model's concept predictions match expert tumor measurements but the resulting response labels still disagree with expert consensus would show that the deterministic RANO rules do not fully capture clinical judgment.
Figures
read the original abstract
Longitudinal glioblastoma response assessment requires comparing subtle tumor changes across MRI time points using structured clinical criteria such as RANO. However, most deep learning methods predict response labels directly from imaging features, which limits clinical inspection, verification, and correction. We introduce TRACE, a RANO 2.0-aligned concept bottleneck model for interpretable 4-class glioblastoma response classification on longitudinal 3D MRI. TRACE processes paired baseline and follow-up multimodal MRI scans with a shared 3D vision encoder, predicts clinically meaningful tumor measurements as root concepts, computes downstream RANO-derived concepts through deterministic rules, and incorporates scan interval and new-lesion information as passthrough concepts. This design frames response assessment as structured concept reasoning rather than direct image-to-label prediction. Using 5-fold patient-wise cross-validation on the LUMIERE dataset, TRACE achieves a 4-class macro F1 of 0.4769 and a binary progression-versus-non-progression macro F1 of 0.7085. It improves over a concept bottleneck baseline and remains within the range of published non-interpretable deep learning approaches. Ablation studies show that the expert RANO graph and intervention-consistency training are important for performance, while intervention experiments demonstrate that correcting concepts can improve downstream predictions. These results suggest that structured concept bottlenecks offer a transparent and clinically aligned direction for longitudinal glioblastoma response assessment, while highlighting the need for larger protocol-aligned datasets and external validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TRACE, a RANO 2.0-aligned concept bottleneck model for 4-class glioblastoma response classification on longitudinal 3D MRI. It uses a shared 3D vision encoder to predict root concepts (tumor measurements), applies deterministic rules to compute downstream RANO-derived concepts, incorporates passthrough concepts for scan interval and new lesions, and reports 4-class macro F1 of 0.4769 and binary progression F1 of 0.7085 via 5-fold patient-wise CV on LUMIERE, with ablations claiming importance of the expert graph and intervention-consistency training plus intervention experiments showing concept correction benefits.
Significance. If root-concept fidelity is verified, the approach supplies a clinically aligned, inspectable alternative to direct image-to-label models in a domain where RANO criteria are standard; the deterministic rule pathway and intervention mechanism are genuine strengths that could enable verification and correction. The reported F1 values sit within published ranges for non-interpretable methods, but the absence of concept-level metrics prevents assessing whether the gains arise from structured reasoning.
major comments (3)
- [Results] Results section: no quantitative metrics (volume MAE, diameter error, Dice overlap, or equivalent) are supplied for the root-concept predictions of tumor measurements on the same folds used for the final F1 scores. Because the central claim is that clinically valid labels are produced by accurate concept prediction followed by deterministic RANO rules, the lack of these numbers leaves open the possibility that the vision encoder is learning a direct image-to-label mapping that merely correlates with the derived labels.
- [Methods and Experiments] Methods and Experiments: the intervention-consistency training objective and the ablation studies that supposedly demonstrate the importance of the expert RANO graph are described only at high level; no table or figure quantifies the performance drop when either component is removed, so their claimed contribution to the reported 0.4769 / 0.7085 F1 scores cannot be evaluated.
- [Abstract and Results] Abstract and Results: the 4-class and binary macro F1 figures are given without error bars, standard deviations across folds, or statistical tests against the concept-bottleneck baseline, making it impossible to judge whether the stated improvement is reliable or within the variability of the 5-fold patient-wise split.
minor comments (2)
- [Figures and Captions] Figure captions and text should explicitly state how concept-prediction accuracy was (or was not) measured during training and evaluation.
- [Dataset] The LUMIERE dataset description would benefit from a table summarizing the distribution of response classes and scan intervals to allow readers to assess class balance and temporal coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of validating the concept-bottleneck design. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Results] Results section: no quantitative metrics (volume MAE, diameter error, Dice overlap, or equivalent) are supplied for the root-concept predictions of tumor measurements on the same folds used for the final F1 scores. Because the central claim is that clinically valid labels are produced by accurate concept prediction followed by deterministic RANO rules, the lack of these numbers leaves open the possibility that the vision encoder is learning a direct image-to-label mapping that merely correlates with the derived labels.
Authors: We agree that root-concept fidelity metrics are necessary to substantiate the claim that performance derives from structured reasoning rather than direct image-to-label correlation. The current manuscript reports only downstream classification F1 and does not include volume MAE, diameter error, or Dice scores for the root tumor measurements on the same 5-fold splits. We will add these metrics (computed on held-out folds) to the Results section and an expanded supplementary table in the revision. revision: yes
-
Referee: [Methods and Experiments] Methods and Experiments: the intervention-consistency training objective and the ablation studies that supposedly demonstrate the importance of the expert RANO graph are described only at high level; no table or figure quantifies the performance drop when either component is removed, so their claimed contribution to the reported 0.4769 / 0.7085 F1 scores cannot be evaluated.
Authors: The manuscript states that ablation studies show the expert RANO graph and intervention-consistency training are important, yet provides no numerical performance drops. We will expand the Experiments section with a dedicated ablation table reporting 4-class and binary macro F1 for the full model versus variants without the graph and without the consistency loss, using the same 5-fold splits. revision: yes
-
Referee: [Abstract and Results] Abstract and Results: the 4-class and binary macro F1 figures are given without error bars, standard deviations across folds, or statistical tests against the concept-bottleneck baseline, making it impossible to judge whether the stated improvement is reliable or within the variability of the 5-fold patient-wise split.
Authors: We acknowledge that the reported F1 scores lack fold-wise variability measures and statistical comparison. In the revision we will add standard deviations across the five patient-wise folds to all reported metrics, include error bars on relevant figures, and report paired statistical tests (e.g., McNemar or Wilcoxon) against the concept-bottleneck baseline. revision: yes
Circularity Check
No significant circularity; derivation uses external deterministic rules
full rationale
The paper's central chain predicts root concepts (tumor measurements) from a 3D vision encoder, then applies fixed external RANO 2.0 deterministic rules to produce downstream concepts and final labels. This separation means the output labels are not equivalent to the model inputs by construction, nor are any fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation are present in the provided text. The intervention-consistency objective is noted at high level but does not create a self-definitional loop. The reported F1 scores therefore reflect an independent evaluation of the structured pipeline rather than a tautological reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RANO response categories can be faithfully recovered from a small set of tumor measurements via deterministic rules without loss of clinical validity
Reference graph
Works this paper leans on
-
[1]
J. P. Thakkar, T. A. Dolecek, C. Horbinski, Q. T. Ostrom, D. D. Lightner, J. S. Barnholtz-Sloan, J. L. Villano, Epidemiologic and molecular prognostic review of glioblastoma, Cancer Epidemiology, Biomarkers & Prevention 23 (2014) 1985–1996. doi:10.1158/1055-9965.EPI-14-0275
-
[2]
P. Y. Wen, M. van den Bent, G. Youssef, T. F. Cloughesy, B. M. Ellingson, M. Weller, E. Galanis, D. P. Barboriak, J. de Groot, M. R. Gilbert, R. Huang, A. B. Lassman, M. Mehta, A. M. Molinaro, M. Preusser, R. Rahman, L. K. Shankar, R. Stupp, J. E. Villanueva-Meyer, W. Wick, D. R. Macdonald, D. A. Reardon, M. A. Vogelbaum, S. M. Chang, RANO 2.0: Update to ...
- [3]
-
[4]
M. Moassefi, S. Faghani, G. M. Conte, R. O. Kowalchuk, S. Vahdati, D. J. Crompton, C. Perez-Vega, R. A. D. Cabreja, S. A. Vora, A. Quiñones-Hinojosa, I. F. Parney, D. M. Trifiletti, B. J. Erickson, A deep learning model for discriminating true progression from pseudoprogression in glioblastoma patients, Journal of Neuro-Oncology 159 (2022) 447–455. doi: 1...
-
[5]
S. Khalighi, K. Reddy, A. Midya, K. B. Pandav, A. Madabhushi, M. Abedalthagafi, Artificial intelligence in neuro-oncology: advances and challenges in brain tumor diagnosis, prognosis, and precision treatment, npj Precision Oncology 8 (2024) 80. doi:10.1038/s41698-024-00575-0
-
[6]
Rončević, N
A. Rončević, N. Koruga, A. S. Koruga, R. Rončević, Artificial intelligence in glioblastoma — transforming diagnosis and treatment, Chinese Neurosurgical Journal 11 (2025) 6. doi: 10.1186/ s41016-025-00399-2
2025
-
[7]
D. J. Ghadimi, A. M. Vahdani, H. Karimi, P. Ebrahimi, M. Fathi, F. Moodi, A. Habibzadeh, F. Kho- dadadi Shoushtari, G. Valizadeh, H. Mobarak Salari, H. Saligheh Rad, Deep Learning-Based Techniques in Glioma Brain Tumor Segmentation Using Multi-Parametric MRI: A Review on Clinical Applications and Future Outlooks, Journal of Magnetic Resonance Imaging 61 (...
-
[8]
M. Hagenbuchner, The black box problem of ai in oncology, Journal of Physics: Conference Series 1662 (2020) 012012. doi:10.1088/1742-6596/1662/1/012012
-
[9]
M. A. Gulum, C. M. Trombley, M. Kantardzic, A review of explainable deep learning cancer detec- tion models in medical imaging, Applied Sciences 11 (2021) 4573. doi:10.3390/app11104573
-
[10]
H. Charaabi, H. Mzoughi, R. E. Hamdi, M. Njah, EXplainable Artificial Intelligence (XAI) for MRI Brain Tumor Diagnosis: A Survey, in: Proceedings of the International Conference on Cyberworlds, 2023. doi:10.1109/CW58918.2023.00033
-
[11]
K. Desai, P. K. Patel, A. Barve, Enhancing trust in ai-driven diagnostics: A review of brain tumor classification using cnns with a hybrid grad-cam and counterfactual xai framework, in: 2025 4th International Conference on Applied Artificial Intelligence and Computing (ICAAIC), IEEE, Salem, India, 2025, pp. 1592–1598. doi:10.1109/ICAAIC64647.2025.11330252
-
[12]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. doi:10.1109/ICCV.2017.74
-
[13]
P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, P. Liang, Concept bottleneck models, in: Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 5338–5348. URL: https://proceedings. mlr.press/v119/koh20a.html
2020
-
[14]
H. M. T. Alam, D. Srivastav, M. A. Kadir, D. Sonntag, Towards interpretable radiology report generation via concept bottlenecks using a multi-agentic RAG, in: C. Hauff, C. Macdonald, D. Jannach, G. Kazai, F. M. Nardini, F. Pinelli, F. Silvestri, N. Tonellotto (Eds.), Advances in Information Retrieval - 47th European Conference on Information Retrieval, EC...
-
[15]
S. Shin, Y. Jo, S. Ahn, N. Lee, A closer look at the intervention procedure of concept bottleneck models, in: Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022,
2022
-
[16]
URL: https://openreview.net/forum?id=PUspzfGsgY
-
[17]
H. M. T. Alam, D. Srivastav, A. Mohamed Selim, M. A. Kadir, M. M. H. Shuvo, D. Sonntag, Cbm-rag: Demonstrating enhanced interpretability in radiology report generation with multi-agent rag and concept bottleneck models, in: Companion Proceedings of the 17th ACM SIGCHI Symposium on Engineering Interactive Computing Systems, EICS ’25 Companion, Association ...
-
[18]
G. D. Felice, A. C. Flores, F. D. Santis, S. Santini, J. Schneider, P. Barbiero, A. Termine, Causally reliable concept bottleneck models, in: The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL: https://openreview.net/forum?id=UX143QGvb8
2026
-
[19]
Y. Suter, U. Knecht, W. Valenzuela, M. Notter, E. Hewer, P. Schucht, R. Wiest, M. Reyes, The lumiere dataset: Longitudinal glioblastoma mri with expert rano evaluation, Scientific Data 9 (2022) 768. doi:10.1038/s41597-022-01881-7
-
[20]
N. Abu Khalaf, A. Desjardins, J. J. Vredenburgh, D. P. Barboriak, Repeatability of automated image segmentation with BraTumIA in patients with recurrent glioblastoma, AJNR. American Journal of Neuroradiology 42 (2021) 1080–1086. doi:10.3174/ajnr.A7071
-
[21]
P. Kickingereder, F. Isensee, I. Tursunova, J. Petersen, U. Neuberger, D. Bonekamp, G. Brugnara, M. Schell, T. Kessler, M. Foltyn, et al., Automated quantitative tumour response assessment of mri in neuro-oncology with artificial neural networks: a multicentre, retrospective study, The Lancet Oncology 20 (2019) 728–740. doi:10.1016/S1470-2045(19)30098-1
-
[22]
F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2020) 203–211. doi:10.1038/s41592-020-01008-z
-
[23]
F. Isensee, M. Schell, I. Pflueger, G. Brugnara, D. Bonekamp, U. Neuberger, A. Wick, H.-P. Schlemmer, S. Heiland, W. Wick, M. Bendszus, K. H. Maier-Hein, P. Kickingereder, Automated brain extraction of multisequence MRI using artificial neural networks, Human Brain Mapping 40 (2019) 4952–4964. doi:10.1002/hbm.24750
-
[24]
Y. Suter, M. Notter, R. Meier, T. Loosli, P. Schucht, R. Wiest, M. Reyes, U. Knecht, Evaluating automated longitudinal tumor measurements for glioblastoma response assessment, Frontiers in Radiology 3 (2023). doi:10.3389/fradi.2023.1211859
- [25]
-
[26]
D. Amato, S. Calderaro, L. D. Reitano, G. Lo Bosco, R. Rizzo, F. Vella, Integrating deep learning and radiomic features for glioblastoma treatment response classification, in: 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2025. doi:10.1109/BIBM66473. 2025.11356532
-
[27]
D. Tikhonov, M. Scatolin, M. Banerjee, Q. Ji, A. Jaheen, M. Salem, A. Elsayed, H. Wang, S. Hashmi, M. Yaqub, Predicting Brain Tumor Response to Therapy using a Hybrid Deep Learning and Radiomics Approach, 2025. doi:10.48550/arXiv.2509.06511
-
[28]
J. V. Jeyakumar, A. Sarker, L. A. Garcia, M. Srivastava, X-CHAR: A Concept-based Explainable Complex Human Activity Recognition Model, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7 (2023) 17:1–17:28. doi:10.1145/3580804
-
[29]
P. Knab, S. Marton, P. J. Schubert, D. Guggiana, C. Bartelt, Concepts in Motion: Temporal Con- cept Bottleneck Model for Interpretable Video Classification, 2026. doi:10.48550/arXiv.2509. 20899, arXiv:2509.20899 [cs.CV] version: 3
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509 2026
-
[30]
F. Bai, Y. Du, T. Huang, M. Q.-H. Meng, B. Zhao, M3d: Advancing 3d medical image analysis with multi-modal large language models, 2024. doi: 10.48550/arXiv.2404.00578. arXiv:2404.00578
- [31]
-
[32]
A. Bunnell, Y. Glaser, D. Valdez, T. Wolfgruber, A. Altamirano, C. Zamora González, B. Y. Her- nandez, P. Sadowski, J. A. Shepherd, Learning a Clinically-Relevant Concept Bottleneck for Lesion Detection in Breast Ultrasound, in: M. G. Linguraru, Q. Dou, A. Feragen, S. Gian- narou, B. Glocker, K. Lekadir, J. A. Schnabel (Eds.), Medical Image Computing and ...
-
[33]
S. J. Magny, R. Shikhman, A. L. Keppke, Breast Imaging Reporting and Data System, 2023. URL: http://www.ncbi.nlm.nih.gov/books/NBK459169/
2023
-
[34]
J. Kim, Z. Wang, Q. Qiu, Constructing Concept-Based Models to Mitigate Spurious Correlations with Minimal Human Effort, in: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXX, Springer-Verlag, Berlin, Heidelberg, 2024, pp. 137–153. doi:10.1007/978-3-031-72989-8_8
-
[35]
Fokkema, T
H. Fokkema, T. van Erven, S. Magliacane, Sample-efficient learning of concepts with theoretical guarantees: from data to concepts without interventions, in: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025. URL: https://openreview.net/forum?id= RCXF0UEmuE
2025
-
[36]
Y. Wu, Y. Liu, Y. Yang, M. S. Yao, W. Yang, X. Shi, L. Yang, D. Li, Y. Liu, S. Yin, C. Lei, M. Zhang, J. C. Gee, X. Yang, W. Wei, S. Gu, A concept-based interpretable model for the diagnosis of choroid neoplasias using multimodal data, Nature Communications 16 (2025) 3504. doi: 10.1038/ s41467-025-58801-7
2025
-
[37]
Oikarinen, S
T. Oikarinen, S. Das, L. M. Nguyen, T.-W. Weng, Label-free concept bottleneck models, in: The Eleventh International Conference on Learning Representations, 2023. URL: https://openreview. net/forum?id=FlCg47MNvBA
2023
-
[38]
Prasse, P
K. Prasse, P. Knab, S. Marton, C. Bartelt, M. Keuper, Dcbm: Data-efficient visual concept bottleneck models, in: Forty-second International Conference on Machine Learning, 2025. URL: https: //openreview.net/forum?id=BdO4R6XxUH
2025
-
[39]
I. Shrier, R. W. Platt, Reducing bias through directed acyclic graphs, BMC Medical Research Methodology 8 (2008) 70. doi:10.1186/1471-2288-8-70
-
[40]
D. Evans, B. Chaix, T. Lobbedez, C. Verger, A. Flahault, Combining directed acyclic graphs and the change-in-estimate procedure as a novel approach to adjustment-variable selection in epidemiology, BMC Medical Research Methodology 12 (2012) 156. doi:10.1186/1471-2288-12-156
-
[41]
Piccininni, S
M. Piccininni, S. Konigorski, J. L. Rohmann, T. Kurth, Directed acyclic graphs and causal thinking in clinical risk prediction modeling, BMC Medical Research Methodology 20 (2020) 179. doi:10. 1186/s12874-020-01058-z
2020
-
[42]
P. Barbiero, M. E. Zarlenga, F. Giannini, A. Termine, F. Bonchi, M. Jamnik, G. Marra, Actionable Interpretability Must Be Defined in Terms of Symmetries, 2026. doi: 10.48550/arXiv.2601. 12913, arXiv:2601.12913 [cs.AI]
- [43]
-
[44]
S. Chen, K. Ma, Y. Zheng, Med3D: Transfer Learning for 3D Medical Image Analysis, 2019. doi:10. 48550/arXiv.1904.00625, arXiv:2601.12913 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [45]
-
[46]
Havasi, S
M. Havasi, S. Parbhoo, F. Doshi-Velez, Addressing leakage in concept bottleneck models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2022, pp. 23386–23397. URL: https://openreview. net/forum?id=tglniD_fn9
2022
-
[47]
L. Gagnon, D. Gupta, U. Nguyen, M. Correia de Verdier, R. Saluja, G. Mastorakos, N. White, V. Goodwill, C. R. McDonald, T. Beaumont, C. Conlin, T. M. Seibert, J. Hattangadi-Gluth, S. Kesari, J. D. Schulte, D. Piccioni, K. M. Schmainda, N. Farid, A. M. Dale, J. D. Rudie, The University of California San Diego Post-Treatment Glioblastoma (UCSD-PTGBM) annota...
-
[48]
B. K. K. Fields, E. Calabrese, J. Mongan, S. Cha, C. P. Hess, L. P. Sugrue, S. M. Chang, T. L. Luks, J. E. Villanueva-Meyer, A. M. Rauschecker, J. D. Rudie, The university of california san francisco adult longitudinal post-treatment diffuse glioma mri dataset, Radiology: Artificial Intelligence 6 (2024) e230182. doi:10.1148/ryai.230182
- [49]
-
[50]
Loshchilov, F
I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7
2019
-
[51]
Goyal, A
Y. Goyal, A. Feder, U. Shalit, B. Kim, Explaining Classifiers with Causal Concept Effect (CaCE),
-
[52]
doi:10.48550/arXiv.1907.07165, arXiv:1907.07165 [cs.LG]. A. Additional Implementation Details A.1. Clinical Background on SPD The Sum of Products of Diameters (SPD) is defined as the product of the two largest perpendicular tumor diameters measured on a single imaging slice. It is retained in RANO-based response assessment because the response thresholds ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.