pith. machine review for the scientific record. sign in

arxiv: 2605.12575 · v1 · submitted 2026-05-12 · 📡 eess.IV · cs.AI· cs.CV

Recognition: no theorem link

Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:35 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CV
keywords WSI-MILmultiple instance learningmodel interpretabilitytile selectionrationalesfrozen backbonesattention mechanismshistopathology imaging
0
0 comments X

The pith

FOCI reveals that compact rationales for frozen WSI-MIL predictions depend on the choice of backbone aggregator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether trained WSI-MIL classifiers can have their slide predictions recovered from small, model-consistent tile subsets without any retraining. It does this by adding FOCI, a lightweight layer that learns to select or drop tiles using sufficiency and exclusion objectives. Evaluation with a sequential reveal protocol across multiple benchmarks and models shows that some backbones, especially transformers, support compact rationales while others quickly saturate or conflict with the external selection. This matters for interpretability because it offers a way to audit when a black-box MIL decision can be localized to a reviewable number of tiles.

Core claim

Across three WSI benchmarks and seven MIL backbones, FOCI shows that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, FOCI reduces the Minimum Sufficient K tile count by 32-56% relative to CLS-proxy ranking, and ACMIL+FOCI attains the highest mean SHI of +0.465.

What carries the argument

FOCI, a lightweight rationale-readout layer trained over a frozen MIL backbone with model-output sufficiency and exclusion objectives on keep/drop tile subsets.

Load-bearing premise

That the sufficiency and exclusion objectives produce tile subsets that are genuinely sufficient for the original model without introducing readout artifacts.

What would settle it

A direct test showing that, for a backbone with high reported SHI, the FOCI-selected minimal tiles fail to match the full-slide prediction accuracy while random same-sized subsets succeed.

Figures

Figures reproduced from arXiv: 2605.12575 by Hwiyoung Kim, Hyun Do Jung, Jungwon Choi, Soojung Choi, Yujin Oh.

Figure 1
Figure 1. Figure 1: Selection headroom for post-hoc rationale highlighting in frozen WSI-MIL: a frozen MIL [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FOCI as a frozen rationale-readout probe. The frozen encoder maps WSI tiles to features [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Slide-level AUC and SHI are decoupled. Each point is a (backbone, dataset) pair; color [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative illustration of FOCI selections on two LUSC slides. Each slide is shown twice. Top row of each pair: WSI thumbnail with FOCI’s top-32 selected tiles outlined in yellow and the top-3 highlighted in orange (#1, #2, #3), plus three zoom-in crops at 20× magnification. Bottom row of each pair: same WSI rendered three times with each method’s top-32 ranked tiles outlined; cyan = TransMIL CLS-proxy ra… view at source ↗
Figure 5
Figure 5. Figure 5: Extended SRP reveal curves for seven backbones [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-method SRP confidence curves on TCGA-NSCLC, TCGA-BRCA, and PANDA. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FOCI, a lightweight rationale-readout layer trained on frozen WSI-MIL backbones using sufficiency and exclusion objectives over keep/drop tile subsets. It evaluates these via an adapted Sequential Reveal Protocol (SRP) and the Selection Headroom Index (SHI) across three benchmarks and seven MIL architectures, claiming architecture-dependent selection headroom: transformers admit compact rationales (e.g., 32-56% MSK reduction for TransMIL vs. CLS-proxy), attention-pooling baselines saturate, and hard-selection models conflict with external readouts, with ACMIL+FOCI yielding the highest mean SHI (+0.465). Complementary deletion perturbations and selected-only downstream checks are included.

Significance. If the central claims hold, this provides a practical model-level interpretability and audit tool for WSI-MIL, quantifying when slide-level predictions can be recovered from compact, output-consistent tile subsets without retraining. The multi-backbone, multi-benchmark scope plus deletion and downstream checks constitute a strength, offering a falsifiable protocol for distinguishing architectures by inherent selection headroom in computational pathology.

major comments (2)
  1. [FOCI training procedure (Section 3) and SRP evaluation (Section 4)] FOCI training procedure (Section 3) and SRP evaluation (Section 4): the joint optimization of FOCI on the exact keep/drop subsets later used in SRP creates a risk that MSK reductions (32-56% for TransMIL) and SHI gains (+0.465 for ACMIL) partly reflect objective-induced biases rather than intrinsic backbone headroom. While deletion checks and selected-only evaluation are noted as mitigations, no dedicated ablation on sensitivity to the keep/drop training procedure is reported; this is load-bearing for the architecture-dependent claim.
  2. [Experimental results (Section 5)] Experimental results (Section 5): the manuscript reports specific quantitative improvements (e.g., 32-56% MSK reductions, +0.465 SHI) but provides no details on statistical testing, error bars across runs, or sensitivity to FOCI hyperparameters and random seeds. This weakens confidence in the cross-architecture comparisons given the empirical protocol.
minor comments (2)
  1. [Abstract] Abstract: the three WSI benchmarks are referenced but not named; specifying them (e.g., CAMELYON16, TCGA-LUAD) would improve immediate readability.
  2. [Notation and definitions (Section 2-3)] Notation and definitions: SHI, MSK, and the precise formulation of the sufficiency/exclusion losses would benefit from an explicit notation table or expanded initial presentation to aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, as well as the positive assessment of the work's significance. We address each major comment point-by-point below, with clarifications on the design choices and revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [FOCI training procedure (Section 3) and SRP evaluation (Section 4)] FOCI training procedure (Section 3) and SRP evaluation (Section 4): the joint optimization of FOCI on the exact keep/drop subsets later used in SRP creates a risk that MSK reductions (32-56% for TransMIL) and SHI gains (+0.465 for ACMIL) partly reflect objective-induced biases rather than intrinsic backbone headroom. While deletion checks and selected-only evaluation are noted as mitigations, no dedicated ablation on sensitivity to the keep/drop training procedure is reported; this is load-bearing for the architecture-dependent claim.

    Authors: We appreciate the referee highlighting this potential circularity. The joint use of keep/drop subsets is by design: FOCI optimizes a readout to recover the frozen backbone's output from minimal sufficient subsets (sufficiency objective) while penalizing reliance on excluded tiles (exclusion objective), directly quantifying selection headroom. SRP then evaluates the resulting minimal K in an insertion-style protocol. The reported MSK reductions and SHI values thus measure how compactly each backbone's decision can be localized, rather than claiming independence from the readout. Deletion perturbations and selected-only downstream checks were included precisely as orthogonal validations that the subsets remain predictive outside the training distribution. Nevertheless, to further isolate any sensitivity, we have added a dedicated ablation in the revised Section 5 varying keep/drop sampling ratios, loss weighting, and subset generation strategies; the relative architecture ordering by SHI is preserved, supporting that the headroom differences are backbone-intrinsic. revision: yes

  2. Referee: [Experimental results (Section 5)] Experimental results (Section 5): the manuscript reports specific quantitative improvements (e.g., 32-56% MSK reductions, +0.465 SHI) but provides no details on statistical testing, error bars across runs, or sensitivity to FOCI hyperparameters and random seeds. This weakens confidence in the cross-architecture comparisons given the empirical protocol.

    Authors: We agree that explicit variability and statistical reporting are necessary to support the cross-architecture claims. The original manuscript focused on mean trends across benchmarks but omitted these details. In the revised version, we now report standard deviations over five independent random seeds for FOCI training, SRP evaluation, and hyperparameter sweeps (including sufficiency/exclusion loss coefficients and subset sampling temperature). We additionally include paired t-test p-values for key SHI and MSK differences between architectures, confirming statistical significance of the reported gaps (e.g., TransMIL vs. attention-pooling baselines). These additions appear in the updated Section 5 and supplementary material. revision: yes

Circularity Check

0 steps flagged

Empirical protocol with new objectives exhibits no reduction by construction

full rationale

The paper defines FOCI as a new lightweight readout trained on sufficiency/exclusion objectives over keep/drop subsets and evaluates via the newly introduced SRP and SHI metrics. No equations, self-citations, or claims reduce the reported MSK reductions (32-56%) or SHI gains (+0.465) to quantities that are tautologically equivalent to the training inputs. Complementary deletion checks and selected-only evaluation are presented as independent verifications. This is a standard empirical measurement setup on frozen backbones; the central claims about architecture-dependent headroom rest on observable performance differences rather than definitional loops or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard MIL bag assumptions and the new sufficiency/exclusion objectives; no independent evidence for any postulated entities is provided.

pith-pipeline@v0.9.0 · 5638 in / 1264 out tokens · 49137 ms · 2026-05-14T20:35:07.746936+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J

    Gabriele Campanella, Matthew G. Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J. Busam, Edi Brogi, Victor E. Reuter, David S. Klimstra, and Thomas J. Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.Nature Medicine, 25(8):1301–1309, 2019

  2. [2]

    Attention-based deep multiple instance learning

    Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In Proceedings of the 35th International Conference on Machine Learning, pages 2127–2136. PMLR, 2018

  3. [3]

    Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

    Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

  4. [4]

    Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

  5. [5]

    Deep learning for whole slide image analysis: an overview.Frontiers in medicine, 6:264, 2019

    Neofytos Dimitriou, Ognjen Arandjelovi´c, and Peter D Caie. Deep learning for whole slide image analysis: an overview.Frontiers in medicine, 6:264, 2019

  6. [6]

    Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential.Computerized Medical Imaging and Graphics, 112:102337, 2024

    Michael Gadermayr and Maximilian Tschuchnig. Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential.Computerized Medical Imaging and Graphics, 112:102337, 2024

  7. [7]

    Sofia Serrano and Noah A. Smith. Is attention interpretable? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy, July 2019. Association for Computational Linguistics

  8. [8]

    Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary C. Lipton. Learning to deceive with attention-based explanations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4782–4793. Association for Computational Linguistics, July 2020

  9. [9]

    Multiple instance learning for wsi: A comparative analysis of attention-based approaches.Journal of Pathology Informatics, 15:100403, 2024

    Martim Afonso, Praphulla MS Bhawsar, Monjoy Saha, Jonas S Almeida, and Arlindo L Oliveira. Multiple instance learning for wsi: A comparative analysis of attention-based approaches.Journal of Pathology Informatics, 15:100403, 2024

  10. [10]

    Interpretability of deep learning models: A survey of results

    Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne, Moustafa Alzantot, Fed- erico Cerutti, Mani Srivastava, Alun Preece, Simon Julier, Raghuveer M Rao, et al. Interpretability of deep learning models: A survey of results. In2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & co...

  11. [11]

    How effective can dropout be in multiple instance learning ? InForty-second International Conference on Machine Learning, 2025

    Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, and Yalin Wang. How effective can dropout be in multiple instance learning ? InForty-second International Conference on Machine Learning, 2025

  12. [12]

    The cancer genome atlas pan-cancer analysis project

    John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013

  13. [13]

    Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

    Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Ström, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

  14. [14]

    Dietterich, Richard H

    Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles.Artificial Intelligence, 89(1):31–71, 1997

  15. [15]

    Transmil: Transformer based correlated multiple instance learning for whole slide image classification

    Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 213...

  16. [16]

    Chen, Chengkuan Chen, Yicong Li, Tiffany Y

    Richard J. Chen, Chengkuan Chen, Yicong Li, Tiffany Y . Chen, Andrew D. Trister, Rahul G. Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16144–16155, June 2022. 10

  17. [17]

    Multiple instance learning framework with masked hard instance mining for whole slide image classification

    Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple instance learning framework with masked hard instance mining for whole slide image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4078–4087, October 2023

  18. [18]

    Linghan Cai, Shenjin Huang, Ye Zhang, Jinpeng Lu, and Yongbing Zhang. Attrimil: Revisiting attention- based multiple instance learning for whole-slide pathological image classification from a perspective of instance attributes.Medical Image Analysis, 103:103631, 2025

  19. [19]

    Attention- challenging multiple instance learning for whole slide image classification

    Yunlong Zhang, Honglin Li, Yunxuan Sun, Sunyi Zheng, Chenglu Zhu, and Lin Yang. Attention- challenging multiple instance learning for whole slide image classification. InEuropean conference on computer vision, pages 125–143. Springer, 2024

  20. [20]

    Lu, Bowen Chen, Drew F

    Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

  21. [21]

    Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Sheng Wang, and Hoifung Poon

    Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, Yanbo Xu, Mu Wei, Wenhui Wang, Shuming Ma, Furu Wei, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Jaylen Rosemon, Tucker Bower, Soohee Lee, Roshanthi Weerasinghe, Bill J. Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Shen...

  22. [22]

    Hongyi Wang, Luyang Luo, Fang Wang, Ruofeng Tong, Yen-Wei Chen, Hongjie Hu, Lanfen Lin, and Hao Chen. Rethinking multiple instance learning for whole slide image classification: A bag-level classifier is a good instance-level teacher.IEEE Transactions on Medical Imaging, 43(11):3964–3976, 2024

  23. [23]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5338–5348. PMLR, 2020

  24. [24]

    Grad-cam: Visual explanations from deep networks via gradient-based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017

  25. [25]

    Additive MIL: Intrinsically interpretable multiple instance learning for pathology

    Syed Ashar Javed, Dinkar Juyal, Harshith Padigela, Amaro Taylor-Weiner, Limin Yu, and aaditya prakash. Additive MIL: Intrinsically interpretable multiple instance learning for pathology. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

  26. [26]

    Gupta, and Prateek Prasanna

    Saarthak Kapse, Pushpak Pati, Srijan Das, Jingwei Zhang, Chao Chen, Maria Vakalopoulou, Joel Saltz, Dimitris Samaras, Rajarsi R. Gupta, and Prateek Prasanna. SI-MIL: Taming deep MIL for self- interpretability in gigapixel histopathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11226–11237, 2024

  27. [27]

    Evaluating the visualization of what a deep neural network has learned.IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2017

    Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. Evaluating the visualization of what a deep neural network has learned.IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2017

  28. [28]

    xMIL: Insightful explanations for multiple instance learning in histopathology

    Julius Hense, Mina Jamshidi Idaji, Oliver Eberle, Thomas Schnake, Jonas Dippel, Laure Ciernik, Oliver Buchstab, Andreas Mock, Frederick Klauschen, and Klaus Robert Müller. xMIL: Insightful explanations for multiple instance learning in histopathology. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  29. [29]

    Batarseh

    Sai Gurrapu, Ajay Kulkarni, Lifu Huang, Ismini Lourentzou, and Feras A. Batarseh. Rationalization for explainable nlp: a survey.Frontiers in Artificial Intelligence, V olume 6 - 2023, 2023

  30. [30]

    Rationalizing neural predictions

    Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, Austin, Texas, November 2016. Association for Computational Linguistics

  31. [31]

    Interpretable neural predictions with differentiable binary variables

    Jasmijn Bastings, Wilker Aziz, and Ivan Titov. Interpretable neural predictions with differentiable binary variables. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977, Florence, Italy, July 2019. Association for Computational Linguistics

  32. [32]

    Boosting explainability through selective rationalization in pre-trained language models

    Libing Yuan, Shuaibo Hu, Kui Yu, and Le Wu. Boosting explainability through selective rationalization in pre-trained language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .1, page 1867–1878, 2025. 11

  33. [33]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  34. [34]

    Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. BranchyNet: Fast inference via early exiting from deep neural networks. In2016 23rd International Conference on Pattern Recognition (ICPR), pages 2464–2469. IEEE, 2016

  35. [35]

    Rethinking cooperative rationalization: Introspec- tive extraction and complement control

    Mo Yu, Shiyu Chang, Yang Zhang, and Tommi Jaakkola. Rethinking cooperative rationalization: Introspec- tive extraction and complement control. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103, Hong Kong, China, ...

  36. [36]

    Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

  37. [37]

    Plataniotis

    Linfeng Ye, Shayan Mohajer Hamidi, Zhixiang Chi, Guang Li, Mert Pilanci, Takahiro Ogawa, Miki Haseyama, and Konstantinos N. Plataniotis. ASMIL: Attention-stabilized multiple instance learning for whole-slide imaging. InThe Fourteenth International Conference on Learning Representations, 2026

  38. [38]

    Reamil: Reasoning- and evidence-aware multiple instance learning for whole-slide histopathology

    Hyun Do Jung, Jungwon Choi, and Hwiyoung Kim. Reamil: Reasoning- and evidence-aware multiple instance learning for whole-slide histopathology. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 40–45, March 2026

  39. [39]

    Maddison, Andriy Mnih, and Yee Whye Teh

    Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InInternational Conference on Learning Representations, 2017

  40. [40]

    Categorical reparameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017

  41. [41]

    sufficiency objective

    Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation of black-box models. InBritish Machine Vision Conference (BMVC), 2018. A Qualitative Illustration This appendix shows where FOCI-selected tiles appear, in WSI context, relative to two attention/selection baselines on the same input bag. The figure is illustrative an...