QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging
Pith reviewed 2026-06-26 18:24 UTC · model grok-4.3
The pith
QG-MIL uses four transformer components to reduce attention concentration and improve stability in medical multiple instance learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QG-MIL is a gated transformer aggregator for multiple instance learning that combines RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. These components together stabilize training and produce more uniform attention weights across instances without auxiliary losses, masking, or multi-stage regularization. When evaluated on six benchmarks spanning whole-slide pathology and cell-level hematology, the best QG-MIL variants outperform leading baselines on all tasks with an average gain of 6.1 mean macro F1 points and show more distributed instance weighting in attention analysis.
What carries the argument
The QG-MIL gated transformer aggregator, which applies the four listed normalization and gating mechanisms inside each transformer block to control how attention is computed and passed forward.
If this is right
- The best QG-MIL variants exceed leading baselines on all six benchmarks with an average +6.1 mean macro F1 improvement.
- Attention overlays and mass analysis show more uniform weighting across instances than in prior aggregators.
- The full combination of the four components yields the most consistent cross-domain results and the smallest variance compared with selected baselines.
- Individual components can reach the full model's score on some single datasets, yet only the complete design maintains performance across all six tasks.
Where Pith is reading between the lines
- The same normalization and gating pattern could be tested in non-medical MIL settings where attention concentration is also observed.
- The design suggests that internal transformer adjustments can sometimes substitute for external regularization techniques in attention models.
- Releasing the configurable code makes it straightforward to apply the aggregator to new medical imaging datasets or to ablate the components further.
Load-bearing premise
The observed gains in accuracy and stability come from the four architectural components rather than from differences in hyperparameter tuning or training schedules.
What would settle it
Retraining the baseline models under identical hyperparameter, schedule, and regularization conditions as the QG-MIL runs and finding that the performance gap disappears would falsify the central claim.
Figures
read the original abstract
Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces QG-MIL, a gated transformer aggregator for multiple instance learning in medical imaging. It proposes four components—RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style FFN modules—to stabilize training and promote distributed attention without auxiliary losses, masking, or multi-stage regularization. Evaluated on six benchmarks spanning whole-slide pathology and cell-level hematology, the best QG-MIL variants are claimed to outperform leading baselines on all datasets with an average +6.1 mean macro F1 improvement; attention overlays and mass analysis support more uniform instance weighting, while ablations show consistent cross-domain performance. A configurable implementation is released on GitHub.
Significance. If the gains prove attributable to the architectural choices under matched optimization, the work could provide a useful domain-agnostic MIL aggregator that improves stability and interpretability in medical imaging without extra training stages. The release of reproducible code is a clear strength that supports verification and extension.
major comments (2)
- [Abstract] Abstract: the central claim of +6.1 mean macro F1 improvement and reduced variance due to the four components lacks error bars, standard deviations across runs, or statistical significance tests, preventing assessment of whether the reported lift exceeds baseline variability.
- [Abstract] Abstract and implied Experiments section: the attribution of performance and stability gains to RMSNorm pre-norm, per-head QK normalization, output gating, and SwiGLU FFN is load-bearing for the headline result, yet no evidence is provided that the leading baselines received equivalent hyperparameter search budgets, learning-rate schedules, or regularization strength; internal ablations do not address this confound.
minor comments (2)
- [Abstract] The abstract should name the specific leading baselines and report the number of independent runs underlying the mean F1 scores.
- Notation for the four components would benefit from explicit equations in the methods to allow precise reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential utility of QG-MIL as a domain-agnostic aggregator. We address each major comment below and commit to revisions that strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of +6.1 mean macro F1 improvement and reduced variance due to the four components lacks error bars, standard deviations across runs, or statistical significance tests, preventing assessment of whether the reported lift exceeds baseline variability.
Authors: We agree that the absence of error bars, run-wise standard deviations, and formal significance tests limits the strength of the headline claims. Although the manuscript already states that QG-MIL yields the tightest variance among compared methods, we will add these elements in the revision: results from five independent random seeds per method, mean ± std tables, and paired statistical tests (Wilcoxon signed-rank or t-tests with Bonferroni correction) to quantify whether the +6.1 average improvement exceeds baseline variability. revision: yes
-
Referee: [Abstract] Abstract and implied Experiments section: the attribution of performance and stability gains to RMSNorm pre-norm, per-head QK normalization, output gating, and SwiGLU FFN is load-bearing for the headline result, yet no evidence is provided that the leading baselines received equivalent hyperparameter search budgets, learning-rate schedules, or regularization strength; internal ablations do not address this confound.
Authors: We acknowledge that the manuscript does not demonstrate exhaustive, matched hyperparameter search budgets across all baselines. Baselines were re-implemented from their original papers and tuned using the hyperparameter ranges and schedules reported in those works plus standard MIL practices for the respective datasets. To address the concern, the revised manuscript will include an expanded experimental appendix that enumerates every hyperparameter, optimizer setting, and regularization choice used for each baseline and QG-MIL variant. We will also clarify that the internal component ablations isolate the contribution of each architectural choice inside the same training pipeline, while the cross-domain consistency and variance reduction remain the primary empirical arguments. Full re-optimization of every baseline under an identical search budget would require substantial additional compute and is not feasible within the current study; we will therefore note this limitation explicitly. revision: partial
Circularity Check
No circularity: empirical claims rest on external benchmarks
full rationale
The paper advances an empirical architecture (QG-MIL) and reports performance on six held-out medical imaging benchmarks. No derivation chain, first-principles prediction, or fitted quantity is presented that reduces the claimed +6.1 macro-F1 improvement or attention-distribution results to a quantity defined inside the model by construction. The four components are introduced as design choices and validated via ablation and external comparison; no self-citation supplies a uniqueness theorem or ansatz that the main result depends upon. The reader's assessment that no equations collapse the improvement to internal fitted parameters is accurate, placing the work in the normal non-circular category for an empirical ML paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of supervised deep learning (i.i.d. train/test splits, gradient-based optimization, macro F1 as appropriate metric)
Reference graph
Works this paper leans on
-
[1]
Attention-based Deep Multiple Instance Learning
Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning, July 2018. URLhttps://proceedings. mlr.press/v80/ilse18a.html
2018
-
[2]
Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification
Yunlong Zhang, Honglin Li, Yunxuan Sun, Sunyi Zheng, Chenglu Zhu, and Lin Yang. Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-Octo...
-
[3]
Bin Li, Yin Li, and Kevin W. Eliceiri. Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14313–14323, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-6654- 4509-2.10.1109/CVPR46437.2021.01409. URL...
-
[4]
Bi-directional weakly supervised knowledge distillation for whole slide image classification.Advances in Neural Information Processing Systems, 35:15368–15381, 2022
Linhao Qu, Manning Wang, Zhijian Song, et al. Bi-directional weakly supervised knowledge distillation for whole slide image classification.Advances in Neural Information Processing Systems, 35:15368–15381, 2022
2022
-
[5]
Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4055–4064, Paris, France, October 2023. IEEE. ISBN 979-8-3503-0718-4. 10.1109/ICCV51070.2023.0037...
-
[6]
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, May 2025. URL http://arxiv.org/abs/2505.06708. arXiv:2505.06708 [cs]
Pith/arXiv arXiv 2025
-
[7]
Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.Nature medicine, 25(8):1301–1309, 2019
Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.Nature medicine, 25(8):1301–1309, 2019
2019
-
[8]
Jorge Diosdado, Pere Gilabert, Santi Seguí, and Henar Borrego. Lunghist700: A dataset of histological images for deep learning in pulmonary pathology.Scientific Data, 11(1):1088, 2024. 10.1038/s41597-024-03944-3 . URLhttps://doi.org/10.1038/s41597-024-03944-3
-
[9]
A clinical prostate biopsy dataset with undetected cancer.Scientific Data, 12(1):423, 2025
Eduard Chelebian, Christophe Avenel, Helena Järemo, Pernilla Andersson, Carolina Wählby, and An- ders Bergh. A clinical prostate biopsy dataset with undetected cancer.Scientific Data, 12(1):423, 2025. 10.1038/s41597-025-04758-7. URLhttps://doi.org/10.1038/s41597-025-04758-7
-
[10]
Richard J. Chen, Tong Ding, Ming Y . Lu, and et al. Towards a general-purpose foundation model for computational pathology.Nature Medicine, 30:850–862, 2024. 10.1038/s41591-024-02857-3 . URL https://doi.org/ 10.1038/s41591-024-02857-3
-
[11]
A whole-slide foundation model for digital pathology from real-world data
Haotian Xu, Naoto Usuyama, Jatin Bagga, and et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630:181–188, 2024. 10.1038/s41586-024-07441-w . URL https://doi.org/10. 1038/s41586-024-07441-w
-
[12]
and Chen, Bowen and Williamson, Drew F
Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, and et al. A visual-language foundation model for computational pathology.Nature Medicine, 30:863–874, 2024. 10.1038/s41591-024-02856-4 . URL https://doi.org/ 10.1038/s41591-024-02856-4
-
[13]
Deep residual learning for image recognition, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URLhttps://arxiv.org/abs/1512.03385
Pith/arXiv arXiv 2015
-
[14]
Explainable ai identifies diagnostic cells of genetic aml subtypes.PLOS Digital Health, 2, 2023
Matthias Hehr, Ario Sadafi, Christian Matek, Peter Lienemann, Christian Pohlkamp, Torsten Haferlach, Karsten Spiekermann, and Carsten Marr. Explainable ai identifies diagnostic cells of genetic aml subtypes.PLOS Digital Health, 2, 2023. URLhttps://api.semanticscholar.org/CorpusID:257534165. 7 PRIME AI paper
2023
-
[15]
John W. Sidhom, I. J. Siddarthan, Brian S. Lai, and et al. Deep learning for diagnosis of acute promyelocytic leukemia via recognition of genomically imprinted morphologic features.npj Precision Oncology, 5:38, 2021. 10.1038/s41698-021-00179-y
-
[16]
Transformer-based hematological malignancy prediction from peripheral blood smears in a real-world cohort, 2025
Muhammed Furkan Dasdelen, Ivan Kukuljan, Peter Lienemann, Fatih Ozlugedik, Ario Sadafi, Matthias Hehr, Karsten Spiekermann, Christian Pohlkamp, and Carsten Marr. Transformer-based hematological malignancy prediction from peripheral blood smears in a real-world cohort, 2025. URL https://arxiv.org/abs/2509. 20402
2025
-
[17]
Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia A
Valentin Koch, Sophia J. Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia A. Schnabel, Tingying Peng, and Carsten Marr. DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, volume LNCS 15012. Springer Nature Switzerland, October 2024
2024
-
[18]
Lu, Tong Ding, , and Faisal Mahmood
Daniel Shao, Richard J Chen, Andrew H Song, Joel Runevic, Ming Y . Lu, Tong Ding, , and Faisal Mahmood. Do multiple instance learning models transfer? InInternational conference on machine learning, 2025
2025
-
[19]
Liang Yixuan. Cost-sensitive multi-kernel elm based on reduced expectation kernel auto-encoder.PLOS ONE, 20 (2):1–20, 02 2025. 10.1371/journal.pone.0314851. URL https://doi.org/10.1371/journal.pone. 0314851
-
[20]
Explainable AI identifies diagnostic cells of genetic AML subtypes.PLOS digital health, 2023
Matthias Hehr, Ario Sadafi, Christian Matek, Peter Lienemann, Christian Pohlkamp, Torsten Haferlach, Karsten Spiekermann, and Carsten Marr. Explainable AI identifies diagnostic cells of genetic AML subtypes.PLOS digital health, 2023. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC10016704/. 8
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.