Recognition: 2 theorem links
· Lean TheoremBeyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
Pith reviewed 2026-05-12 00:53 UTC · model grok-4.3
The pith
A convolutional masked-diffusion model outperforms ViT-based pathology foundation models on cell-level dense prediction tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization; the resulting model consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks, with the advantage most pronounced under limited annotation settings.
What carries the argument
The CMD framework: a ConvNeXt-UNet backbone that conducts masked diffusion pretraining directly in pixel space while using adaptive normalization to integrate features from frozen pathology foundation models.
If this is right
- CMD achieves leading performance on multiple cell-level dense prediction tasks while requiring only minimal task-specific fine-tuning.
- The performance gap widens under limited annotation regimes, demonstrating improved robustness and generalization.
- Purely convolutional architectures can function as competitive pathology foundation models within the current ViT-dominated setting.
- The approach supplies a scalable pretraining recipe that maintains spatial continuity for fine-grained histological understanding.
Where Pith is reading between the lines
- The same pixel-space masked diffusion strategy could be tested on other dense-prediction domains where spatial continuity matters, such as electron microscopy or satellite imagery.
- Hybrid models that combine CMD-style pretraining with selective ViT components might further improve results on tasks that need both local detail and long-range context.
- If the convolutional advantage holds, future pathology foundation models may shift away from exclusive reliance on transformer tokenization for segmentation-heavy applications.
Load-bearing premise
That masked-diffusion pretraining performed in pixel space with a convolutional backbone preserves histological structural priors and local morphological details better than the patch tokenization used by vision transformers.
What would settle it
On a held-out pathology dataset with fine cell boundaries and strong domain shift, fine-tune both CMD and a comparable ViT model with the same number of task-specific parameters and measure whether CMD still yields higher segmentation accuracy.
Figures
read the original abstract
Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework using a fully convolutional ConvNeXt-UNet backbone. It performs masked-diffusion pretraining directly in pixel space and incorporates frozen pathology foundation model features via adaptive normalization layers. The central claim is that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods across multiple cell-level dense prediction tasks in pathology, with particular advantages under limited annotations, while suggesting that purely convolutional architectures can serve as competitive foundation models that better preserve histological structural priors.
Significance. If the performance gains can be isolated to the proposed convolutional masked-diffusion pretraining, the work would be significant for computational pathology by providing an alternative to the dominant ViT/patch-tokenization paradigm for fine-grained tasks. It highlights potential benefits of pixel-space generative pretraining and convolutional inductive biases for spatial continuity and low-data robustness, offering a scalable path for dense prediction without heavy reliance on patch-based tokenization.
major comments (1)
- [Abstract and Methods] Abstract and Methods: The claim that CMD demonstrates 'purely convolutional architectures can also serve as competitive pathology foundation models' is undermined by the explicit incorporation of frozen pathology foundation model features (almost certainly ViT-derived) through adaptive normalization. Without an ablation that removes the ViT-feature injection, retrains the pure ConvNeXt masked-diffusion backbone, and re-evaluates on the dense prediction tasks, it is impossible to attribute the reported outperformance specifically to the masked-diffusion objective and convolutional backbone rather than to distillation of ViT priors. This directly affects the central attribution of gains and the paper's positioning against ViT-based models.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one or two key quantitative metrics (e.g., Dice scores or mIoU improvements on specific datasets) to substantiate the claims of consistent outperformance and robustness under limited annotations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The claim that CMD demonstrates 'purely convolutional architectures can also serve as competitive pathology foundation models' is undermined by the explicit incorporation of frozen pathology foundation model features (almost certainly ViT-derived) through adaptive normalization. Without an ablation that removes the ViT-feature injection, retrains the pure ConvNeXt masked-diffusion backbone, and re-evaluates on the dense prediction tasks, it is impossible to attribute the reported outperformance specifically to the masked-diffusion objective and convolutional backbone rather than to distillation of ViT priors. This directly affects the central attribution of gains and the paper's positioning against ViT-based models.
Authors: We appreciate the referee's observation that the incorporation of frozen pathology foundation model features (typically ViT-derived) via adaptive normalization layers means the model is not entirely isolated from ViT priors. The CMD framework is nevertheless built around a fully convolutional ConvNeXt-UNet backbone whose masked-diffusion pretraining occurs directly in pixel space. This choice is motivated by the need to preserve spatial continuity and fine-grained morphological details that patch tokenization can disrupt. The adaptive normalization layers provide a lightweight mechanism for injecting high-level semantic guidance into the convolutional feature maps without replacing the backbone's core representational and predictive pathway. We agree that the current evidence does not fully isolate the contribution of the convolutional masked-diffusion objective from the injected priors. In the revised manuscript we will therefore add the requested ablation: a version of CMD trained without the ViT-feature injection, followed by re-evaluation on the cell-level dense prediction tasks. This will allow clearer attribution of gains to the proposed pretraining and architecture while refining the paper's positioning. revision: yes
Circularity Check
No circularity; purely empirical claims with no derivation chain
full rationale
The manuscript presents a new pretraining method (ConvNeXt-UNet with masked diffusion in pixel space plus adaptive normalization from frozen external features) and supports its claims exclusively via experimental comparisons on pathology dense-prediction benchmarks. No equations, parameter-fitting steps, or mathematical derivations appear in the abstract or described text. The central performance claims therefore cannot reduce to self-definitional inputs, fitted quantities renamed as predictions, or load-bearing self-citations. While the hybrid use of frozen ViT-derived features raises separate questions of attribution, that issue lies outside the circularity criteria (no reduction by construction is exhibited). The paper is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Masked diffusion pretraining in pixel space preserves spatial continuity better than patch-based tokenization for histological structures.
- domain assumption Frozen features from existing pathology foundation models can be effectively integrated via adaptive normalization without domain shift issues.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Masked-Diffusion Convolutional Foundation Models... for cell-level dense prediction in pathology.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021
work page internal anchor Pith review arXiv 2021
-
[3]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[4]
Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021
work page 2021
-
[5]
Agent aggregator with mask denoise mechanism for histopathology whole slide image analysis
Xitong Ling, Minxi Ouyang, Yizhi Wang, Xinrui Chen, Renao Yan, Hongbo Chu, Junru Cheng, Tian Guan, Sufang Tian, Xiaoping Liu, et al. Agent aggregator with mask denoise mechanism for histopathology whole slide image analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 2795–2803, 2024
work page 2024
-
[6]
Xiangde Luo, Jinxi Xiang, Yuanfeng Ji, and Ruijiang Li. nnmil: A generalizable multiple instance learning framework for computational pathology.arXiv preprint arXiv:2511.14907, 2025
-
[7]
Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024
work page 2024
-
[8]
Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024
work page 2024
-
[9]
Eric Zimmermann, Eugene V orontsov, Julian Viret, Adam Casson, Michal Zelechowski, George Shaikovski, Neil Tenenholtz, James Hall, David Klimstra, Razik Yousfi, et al. Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738, 2024
-
[10]
Fang Yan, Jianfeng Wu, Jiawen Li, Wei Wang, Yirong Chen, Linda Wei, Jiaxuan Lu, Wen Chen, Zizhao Gao, Jianan Li, et al. Pathorchestra: A comprehensive foundation model for computational pathology with over 100 diverse clinical-grade tasks.npj Digital Medicine, 8(1):695, 2025
work page 2025
-
[11]
Scaling self-supervised learning for histopathology with masked image modeling.medRxiv, 2023
Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul Jacob, Lucas Fidon, Alice Mac Kain, Charlie Saillard, and Jean-Baptiste Schiratti. Scaling self-supervised learning for histopathology with masked image modeling.medRxiv, 2023
work page 2023
-
[12]
Alexandre Filiot, Paul Jacob, Alice Mac Kain, and Charlie Saillard. Phikon-v2, a large and public feature extractor for biomarker prediction.arXiv preprint arXiv:2409.09173, 2024
-
[13]
Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024
work page 2024
-
[14]
Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Hibou: A family of foundational vision transformers for pathology.arXiv preprint arXiv:2406.05074, 2024
-
[15]
Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024
Nanne Aben, Edwin D de Jong, Ioannis Gatopoulos, Nicolas Känzig, Mikhail Karasikov, Axel Lagré, Roman Moser, Joost van Doorn, Fei Tang, et al. Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024
-
[16]
Lianghui Zhu, Xitong Ling, Minxi Ouyang, Xiaoping Liu, Tian Guan, Mingxi Fu, Zhiqiang Cheng, Fanglei Fu, Maomao Zeng, Liming Liu, et al. Subspecialty-specific foundation model for intelligent gastrointestinal pathology.arXiv preprint arXiv:2505.21928, 2025
-
[17]
Stainnet: A special staining self-supervised vision transformer for computational pathology
Jiawen Li, Jiali Hu, Xitong Ling, Yongqiang Lv, Yuxuan Chen, Yizhi Wang, Tian Guan, Yifei Liu, and Yonghong He. Stainnet: A special staining self-supervised vision transformer for computational pathology. arXiv preprint arXiv:2512.10326, 2025. 10
-
[18]
Jiabo Ma, Zhengrui Guo, Fengtao Zhou, Yihui Wang, Yingxue Xu, Jinbang Li, Fang Yan, Yu Cai, Zhengjie Zhu, Cheng Jin, et al. A generalizable pathology foundation model using a unified knowledge distillation pretraining framework.Nature Biomedical Engineering, pages 1–20, 2025
work page 2025
-
[19]
Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, and Sebastian Otálora. Training state-of-the-art pathology foundation models with orders of magnitude less data.arXiv preprint arXiv:2504.05186, 2025
-
[20]
Genbio-pathfm: A state-of-the-art foundation model for histopathology.bioRxiv, pages 2026–03, 2026
Saarthak Kapse, Mehmet Aygün, Elijah Cole, Emma Lundberg, Le Song, and Eric P Xing. Genbio-pathfm: A state-of-the-art foundation model for histopathology.bioRxiv, pages 2026–03, 2026
work page 2026
-
[21]
Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine, 29(9):2307–2316, 2023
work page 2023
-
[22]
A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024
Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024
work page 2024
-
[23]
A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025
Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025
work page 2025
-
[24]
A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025
Jinxi Xiang, Xiyue Wang, Xiaoming Zhang, Yinghua Xi, Feyisope Eweje, Yijiang Chen, Yuchen Li, Colin Bergstrom, Matthew Gopaulchan, Ted Kim, et al. A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025
work page 2025
-
[25]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[26]
Shangke Liu, Mohamed Amgad, Deeptej More, Muhammad A Rathore, Roberto Salgado, and Lee AD Cooper. A panoptic segmentation dataset and deep-learning approach for explainable scoring of tumor- infiltrating lymphocytes.NPJ Breast Cancer, 10(1):52, 2024
work page 2024
-
[27]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Vision transformer adapter for dense predictions,
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv preprint arXiv:2205.08534, 2022
-
[29]
Masked diffusion as self-supervised representation learner.arXiv preprint arXiv:2308.05695, 2023
Zixuan Pan, Jianxu Chen, and Yiyu Shi. Masked diffusion as self-supervised representation learner.arXiv preprint arXiv:2308.05695, 2023
-
[30]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022
work page 2022
-
[31]
Convnext v2: Co-designing and scaling convnets with masked autoencoders
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023
work page 2023
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[33]
Attention U-Net: Learning Where to Look for the Pancreas
Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas.arXiv preprint arXiv:1804.03999, 2018
work page internal anchor Pith review arXiv 2018
-
[34]
Quoc Dang Vu, Simon Graham, Tahsin Kurc, Minh Nguyen Nhat To, Muhammad Shaban, Talha Qaiser, Navid Alemi Koohbanani, Syed Ali Khurram, Jayashree Kalpathy-Cramer, Tianhao Zhao, et al. Methods for segmentation and classification of digital microscopy tissue images.Frontiers in bioengineering and biotechnology, 7:53, 2019
work page 2019
-
[35]
Andrew Janowczyk and Anant Madabhushi. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases.Journal of pathology informatics, 7(1):29, 2016. 11
work page 2016
-
[36]
Simon Graham, David Epstein, and Nasir Rajpoot. Dense steerable filter cnns for exploiting rotational symmetry in histology images.IEEE Transactions on Medical Imaging, 39(12):4124–4136, 2020
work page 2020
-
[37]
Peter Naylor, Marick Laé, Fabien Reyal, and Thomas Walter. Segmentation of nuclei in histopathology images by deep regression of the distance map.IEEE transactions on medical imaging, 38(2):448–459, 2018
work page 2018
-
[38]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[39]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[40]
Peixian Liang, Songhao Li, Shunsuke Koga, Yutong Li, Zahra Alipour, Yucheng Tang, Daguang Xu, and Zhi Huang. Vista-path: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology.arXiv preprint arXiv:2601.16451, 2026
-
[41]
Open-vocabulary object segmentation with diffusion models
Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Open-vocabulary object segmentation with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, 2023
work page 2023
-
[42]
Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Histai: an open-source, large-scale whole slide image dataset for computational pathology.arXiv preprint arXiv:2505.12120, 2025. 12 A Theoretical Overview of ConvNeXt Masked-Diffusion Models Masked-diffusion pretraining can be viewed as a self-supervised relaxation of denoising diffusion models for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.