Recognition: unknown
Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models
Pith reviewed 2026-05-08 06:28 UTC · model grok-4.3
The pith
A weakly supervised model using foundation models predicts the full five-grade Nancy histological index for ulcerative colitis biopsies from only slide- and case-level labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Weakly supervised multiple instance learning that leverages foundation model representations can learn to predict the complete five-grade Nancy histological index, along with related endpoints such as neutrophilic activity, when trained only on slide-level and case-level labels in a realistic multicenter cohort of colon biopsies.
What carries the argument
Weakly supervised multiple instance learning that aggregates patch embeddings from foundation model encoders into slide-level predictions, with ensembling to improve five-grade accuracy.
If this is right
- Histologic assessment of ulcerative colitis can proceed with far less annotation effort than region-level labeling requires.
- The same pipeline yields both the full Nancy index and clinically useful groupings such as neutrophilic activity.
- Performance holds across data from multiple hospitals, supporting deployment in varied clinical environments.
- A simple ensembling step improves accuracy on the five-grade task relative to a hierarchical gating approach.
Where Pith is reading between the lines
- The same weak-supervision pattern could be tested on other inflammatory bowel disease scoring systems or on biopsies from additional organs.
- Larger or pathology-specific foundation models might further reduce the performance gap to fully supervised methods.
- Integration into existing digital pathology systems could make standardized Nancy scoring available at the point of care.
Load-bearing premise
Slide-level and case-level NHI labels alone contain enough signal for the model to learn and predict the fine-grained histological features such as neutrophilic activity.
What would settle it
Testing the trained model on an independent set of biopsies from a fourth hospital that uses different staining protocols or scanners and finding that five-grade NHI accuracy falls substantially below the reported multicenter level.
Figures
read the original abstract
Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a weakly supervised multiple instance learning (MIL) approach for whole-slide H&E images of ulcerative colitis biopsies that trains on slide- and case-level Nancy histological index (NHI) labels, leverages foundation-model encoders, and targets full five-grade NHI prediction along with neutrophilic activity and derived low/high groupings. It evaluates multiple encoders and aggregation strategies on a three-center multicenter cohort (2019-2025) and reports that encoder choice (e.g., Virchow2) and a simple ensembling rule improve performance over a hierarchical gating baseline.
Significance. If the quantitative results and interpretability analyses hold, the work would be significant for computational pathology: it shows that modern foundation models combined with standard MIL can deliver clinically relevant UC histology scoring without dense region-level annotations, which is especially valuable in heterogeneous multicenter settings where annotation cost and observer variability are high.
major comments (2)
- [Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.
- [Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.
minor comments (1)
- [Abstract] Abstract: the phrase 'Nancy-low/high groupings' is used without a brief definition or reference to how the five-grade NHI is collapsed, which may reduce accessibility for readers outside the UC histology community.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.
Authors: We agree that the abstract should provide sufficient quantitative detail to substantiate the claims. In the revised manuscript, we have updated the abstract to include key performance metrics (e.g., accuracy and macro-F1 for five-grade NHI prediction with Virchow2 and ensembling versus the hierarchical gating baseline), along with a brief description of the patient-level multicenter data splits and the use of multiple random seeds for error bars. These additions enable independent verification while preserving the abstract's conciseness. revision: yes
-
Referee: [Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.
Authors: This is a fair critique of the interpretability claims. The original manuscript provided qualitative attention visualizations but did not include quantitative correlation with neutrophilic features or explicit controls for center-specific artifacts. We have added a new paragraph in the Results section with per-center performance breakdowns to demonstrate robustness, plus expanded qualitative examples of attention maps overlaid on slides where high-attention regions align with pathologist-identified neutrophilic areas. A full quantitative correlation study would require dense pixel-level neutrophil annotations beyond the slide-level labels available in our dataset; we have therefore noted this as a limitation and outlined it as future work. revision: partial
Circularity Check
No circularity; standard MIL evaluation on external labels
full rationale
The paper describes a weakly supervised MIL pipeline that ingests pre-trained foundation-model embeddings of whole-slide images and aggregates them to predict slide- and case-level NHI grades. All reported metrics are computed against held-out human-assigned NHI labels from three independent centers; no equation, aggregation rule, or performance number is obtained by fitting a parameter to the target quantity and then re-labeling it a prediction. No self-citation is invoked to establish uniqueness or to forbid alternative architectures. The derivation chain therefore remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (2)
- Foundation model encoder selection
- MIL aggregation strategy
axioms (1)
- domain assumption Multiple instance learning can learn accurate slide-level predictions from bag-level labels alone when using rich patch embeddings from foundation models.
Reference graph
Works this paper leans on
-
[1]
Marchal-Bressenot, A., et al.: Development and validation of the nancy histological index for uc. Gut66(1), 43–49 (2017). https://doi.org/10.1136/gutjnl-2015-310187
-
[2]
Gut47(3), 404–409 (2000)
Geboes, K., et al.: A reproducible grading scale for histological assessment of in- flammation in ulcerative colitis. Gut47(3), 404–409 (2000). https://doi.org/10. 1136/gut.47.3.404
2000
-
[3]
Journal of Crohn’s and Colitis19(1), jjae198 (01 2025)
Puga-Tejada, M., et al.: Artificial intelligence–enabled histology exhibits compara- ble accuracy to pathologists in assessing histological remission in ulcerative colitis: a systematic review, meta-analysis, and meta-regression. Journal of Crohn’s and Colitis19(1), jjae198 (01 2025). https://doi.org/10.1093/ecco-jcc/jjae198
-
[4]
Mosli, M.H., et al.: Development and validation of a histological index for uc. Gut 66(1), 50–58 (2017). https://doi.org/10.1136/gutjnl-2015-310393
-
[5]
Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025)
Ohara, J., et al.: Automated neutrophil quantification and histological score es- timation in ulcerative colitis. Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025). https://doi.org/10.1016/j.cgh.2024.06.040
-
[6]
Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022)
Villanacci, V., et al.: Op15 a new simplified histology artificial intelligence system for accurate assessment of remission in ulcerative colitis. Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022). https://doi.org/10.1093/ecco-jcc/ jjab232.014
-
[7]
Gastroenterology164(7), 1180–1188.e2 (Jun 2023)
Iacucci, M., et al.: Artificial intelligence enabled histological prediction of remission or activity and clinical outcomes in ulcerative colitis. Gastroenterology164(7), 1180–1188.e2 (Jun 2023). https://doi.org/10.1053/j.gastro.2023.02.031
-
[8]
https://doi.org/10.1016/j.dld.2024.05.033
Furlanello, C., et al.: The development of artificial intelligence in the histological diagnosisofinflammatoryboweldisease(ibd-ai).DigestiveandLiverDisease57(1), 184–189 (Jan 2025). https://doi.org/10.1016/j.dld.2024.05.033
-
[9]
Modern Pathology36(6) (Jun 2023)
Najdawi, F., et al.: Artificial intelligence enables quantitative assessment of ul- cerative colitis histology. Modern Pathology36(6) (Jun 2023). https://doi.org/10. 1016/j.modpat.2023.100124
-
[10]
Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024)
Rubin, D.T., et al.: Deployment of an artificial intelligence histology tool to aid qualitative assessment of histopathology using the nancy histopathology index in ulcerative colitis. Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024). https: //doi.org/10.1093/ibd/izae204
-
[11]
United European Gastroen- terology Journal12(8), 1028–1033 (2024)
Peyrin-Biroulet, L., et al.: An artificial intelligence-driven scoring system to mea- sure histological disease activity in ulcerative colitis. United European Gastroen- terology Journal12(8), 1028–1033 (2024). https://doi.org/10.1002/ueg2.12562
-
[12]
Inflammatory Bowel Diseases28(4), 539–546 (06 2021)
Vande Casteele, N., et al.: Utilizing deep learning to analyze whole slide images of colonic biopsies for associations between eosinophil density and clinicopathologic features in active ulcerative colitis. Inflammatory Bowel Diseases28(4), 539–546 (06 2021). https://doi.org/10.1093/ibd/izab122
-
[13]
Gut69(10), 1778–1786 (2020)
Bossuyt, P., et al.: Automatic, computer-aided determination of endoscopic and histological inflammation in patients with mild to moderate ulcerative colitis based on red density. Gut69(10), 1778–1786 (2020). https://doi.org/10.1136/ gutjnl-2019-320056
2020
- [14]
-
[15]
Attention-based Deep Multiple Instance Learning
Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing (2018), https://arxiv.org/abs/1802.04712 10 A. Kukučka et al
work page Pith review arXiv 2018
-
[16]
Nature Medicine25(8), 1301–1309 (Aug 2019)
Campanella, G., et al.: Clinical-grade computational pathology using weakly su- pervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (Aug 2019). https://doi.org/10.1038/s41591-019-0508-1
- [17]
- [18]
-
[19]
Physics in Medicine & Biology68(15), 155007 (jul 2023)
Zhou, Y., et al.: Iterative multiple instance learning for weakly annotated whole slide image classification. Physics in Medicine & Biology68(15), 155007 (jul 2023). https://doi.org/10.1088/1361-6560/acde3f
-
[20]
IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022)
Seenivasan, L., et al.: Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022). https://doi.org/10.1109/LRA.2022.3146544
-
[21]
Tavolara, T.E., Gurcan, M.N., Niazi, M.K.K.: Contrastive multiple instance learn- ing: An unsupervised framework for learning slide-level representations of whole slide histopathology images without labels. Cancers14(23) (2022). https://doi. org/10.3390/cancers14235778
-
[22]
Mammadov, A., et al.: Self-supervision enhances instance-based multiple instance learning methods in digital pathology: A benchmark study (2025), https://arxiv. org/abs/2505.01109
-
[23]
Information Fusion119, 103027 (2025)
Zhang, Y., et al.: From patches to wsis: A systematic review of deep multiple instance learning in computational pathology. Information Fusion119, 103027 (2025). https://doi.org/10.1016/j.inffus.2025.103027
-
[24]
Computers in Biology and Medicine186, 109649 (2025)
Saeed, A., Ismail, M.A., Ghanem, N.M.: Colorectal cancer classification using weakly annotated whole slide images: Multiple instance learning optimization study. Computers in Biology and Medicine186, 109649 (2025). https://doi.org/ doi.org/10.1016/j.compbiomed.2024.109649
- [25]
- [26]
-
[27]
Nature630(8015), 181–188 (Jun 2024)
Xu, H., et al.: A whole-slide foundation model for digital pathology from real- world data. Nature630(8015), 181–188 (Jun 2024). https://doi.org/10.1038/ s41586-024-07441-w
2024
-
[28]
Nature Medicine30(3), 850–862 (Mar 2024)
Chen, R.J., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30(3), 850–862 (Mar 2024). https://doi.org/10.1038/ s41591-024-02857-3
2024
-
[29]
Nature Medicine30(10), 2924–2935 (2024) https://doi.org/10.1038/s41591-024-03141-0
Vorontsov, E., et al.: A foundation model for clinical-grade computational pathol- ogy and rare cancers detection. Nature Medicine30(10), 2924–2935 (Oct 2024). https://doi.org/10.1038/s41591-024-03141-0
- [30]
-
[31]
Bilal, M., et al.: Benchmarking pathology foundation models for predict- ing microsatellite instability in colorectal cancer histopathology. Computerized Medical Imaging and Graphics127, 102680 (2026). https://doi.org/10.1016/j. compmedimag.2025.102680
work page doi:10.1016/j 2026
-
[32]
https://doi.org/10.1111/cgf.14812
Horák,J.,etal.:xopat:explainableopenpathologyanalysistool.ComputerGraph- ics Forum42(3), 63–73 (2023). https://doi.org/10.1111/cgf.14812
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.