arxiv: 2605.08276 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

Weiming Chen , Xitong Ling , Zhenyang Cai , Xidong Wang , Jiawen Li , Tian Guan , Benyou Wang , Yonghong He

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords pathology foundation modelmasked diffusionconvolutional backbonecell-level dense predictionself-supervised learningConvNeXtlimited annotationshistological structure

0 comments

The pith

A convolutional masked-diffusion model outperforms ViT-based pathology foundation models on cell-level dense prediction tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a self-supervised pretraining approach for pathology images that relies on a fully convolutional network rather than the patch-tokenization used in vision transformers. By applying masked diffusion directly in pixel space with a ConvNeXt-UNet backbone and adding adaptive normalization from existing foundation models, the method learns representations that retain local histological structures and spatial continuity. Experiments across several dense prediction tasks show that this convolutional foundation model surpasses both ViT-based pathology models and many specialized end-to-end segmentation techniques. The gains are largest when only limited annotations are available, indicating stronger generalization under data scarcity. The work argues that convolutional architectures can serve as viable alternatives to the current ViT-dominated paradigm for fine-grained pathology understanding.

Core claim

CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization; the resulting model consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks, with the advantage most pronounced under limited annotation settings.

What carries the argument

The CMD framework: a ConvNeXt-UNet backbone that conducts masked diffusion pretraining directly in pixel space while using adaptive normalization to integrate features from frozen pathology foundation models.

If this is right

CMD achieves leading performance on multiple cell-level dense prediction tasks while requiring only minimal task-specific fine-tuning.
The performance gap widens under limited annotation regimes, demonstrating improved robustness and generalization.
Purely convolutional architectures can function as competitive pathology foundation models within the current ViT-dominated setting.
The approach supplies a scalable pretraining recipe that maintains spatial continuity for fine-grained histological understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pixel-space masked diffusion strategy could be tested on other dense-prediction domains where spatial continuity matters, such as electron microscopy or satellite imagery.
Hybrid models that combine CMD-style pretraining with selective ViT components might further improve results on tasks that need both local detail and long-range context.
If the convolutional advantage holds, future pathology foundation models may shift away from exclusive reliance on transformer tokenization for segmentation-heavy applications.

Load-bearing premise

That masked-diffusion pretraining performed in pixel space with a convolutional backbone preserves histological structural priors and local morphological details better than the patch tokenization used by vision transformers.

What would settle it

On a held-out pathology dataset with fine cell boundaries and strong domain shift, fine-tune both CMD and a comparable ViT model with the same number of task-specific parameters and measure whether CMD still yields higher segmentation accuracy.

Figures

Figures reproduced from arXiv: 2605.08276 by Benyou Wang, Jiawen Li, Tian Guan, Weiming Chen, Xidong Wang, Xitong Ling, Yonghong He, Zhenyang Cai.

**Figure 2.** Figure 2: Qualitative comparison under the frozen-backbone dense prediction setting. CMD-L [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of cell-level dense representations. For ViT-based pathology foundation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMD's hybrid ViT feature injection confounds claims about purely convolutional pretraining.

read the letter

The punchline is that this paper proposes masked-diffusion pretraining on a ConvNeXt-UNet for pathology dense prediction but mixes in frozen ViT-derived features via adaptive normalization, which undercuts the 'purely convolutional' conclusion. What is new is the specific combination of pixel-space masked diffusion with a fully convolutional backbone aimed at preserving spatial continuity for cell-level tasks. It does a reasonable job highlighting how ViT patch tokenization can disrupt local histological details and in targeting robustness under limited annotations, which is a practical concern in pathology. The soft spots are clear and load-bearing. The method explicitly adds frozen pathology foundation model features, almost certainly ViT-based, so any reported gains over pure ViT models could simply reflect distillation rather than the conv inductive bias or diffusion objective. No ablation removing that injection is described, leaving attribution impossible. The abstract states consistent outperformance and better generalization but gives no metrics, datasets, ablations, or statistical tests, so the empirical claims cannot be evaluated for confounds or selection effects. If the full paper supplies those details they will need close checking. This work is for researchers comparing foundation model architectures in computational pathology, particularly those focused on dense prediction where spatial structure matters. A reader interested in conv alternatives to ViTs might extract useful design ideas if the experiments hold up. It deserves a serious referee because the targeted limitation in ViT-based models is real and the framework has conceptual merit, even with the hybrid caveat. I would recommend sending it for peer review but requiring the authors to add the missing ablation and present full quantitative results.

Referee Report

1 major / 1 minor

Summary. The paper proposes ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework using a fully convolutional ConvNeXt-UNet backbone. It performs masked-diffusion pretraining directly in pixel space and incorporates frozen pathology foundation model features via adaptive normalization layers. The central claim is that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods across multiple cell-level dense prediction tasks in pathology, with particular advantages under limited annotations, while suggesting that purely convolutional architectures can serve as competitive foundation models that better preserve histological structural priors.

Significance. If the performance gains can be isolated to the proposed convolutional masked-diffusion pretraining, the work would be significant for computational pathology by providing an alternative to the dominant ViT/patch-tokenization paradigm for fine-grained tasks. It highlights potential benefits of pixel-space generative pretraining and convolutional inductive biases for spatial continuity and low-data robustness, offering a scalable path for dense prediction without heavy reliance on patch-based tokenization.

major comments (1)

[Abstract and Methods] Abstract and Methods: The claim that CMD demonstrates 'purely convolutional architectures can also serve as competitive pathology foundation models' is undermined by the explicit incorporation of frozen pathology foundation model features (almost certainly ViT-derived) through adaptive normalization. Without an ablation that removes the ViT-feature injection, retrains the pure ConvNeXt masked-diffusion backbone, and re-evaluates on the dense prediction tasks, it is impossible to attribute the reported outperformance specifically to the masked-diffusion objective and convolutional backbone rather than to distillation of ViT priors. This directly affects the central attribution of gains and the paper's positioning against ViT-based models.

minor comments (1)

[Abstract] The abstract would benefit from including at least one or two key quantitative metrics (e.g., Dice scores or mIoU improvements on specific datasets) to substantiate the claims of consistent outperformance and robustness under limited annotations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The claim that CMD demonstrates 'purely convolutional architectures can also serve as competitive pathology foundation models' is undermined by the explicit incorporation of frozen pathology foundation model features (almost certainly ViT-derived) through adaptive normalization. Without an ablation that removes the ViT-feature injection, retrains the pure ConvNeXt masked-diffusion backbone, and re-evaluates on the dense prediction tasks, it is impossible to attribute the reported outperformance specifically to the masked-diffusion objective and convolutional backbone rather than to distillation of ViT priors. This directly affects the central attribution of gains and the paper's positioning against ViT-based models.

Authors: We appreciate the referee's observation that the incorporation of frozen pathology foundation model features (typically ViT-derived) via adaptive normalization layers means the model is not entirely isolated from ViT priors. The CMD framework is nevertheless built around a fully convolutional ConvNeXt-UNet backbone whose masked-diffusion pretraining occurs directly in pixel space. This choice is motivated by the need to preserve spatial continuity and fine-grained morphological details that patch tokenization can disrupt. The adaptive normalization layers provide a lightweight mechanism for injecting high-level semantic guidance into the convolutional feature maps without replacing the backbone's core representational and predictive pathway. We agree that the current evidence does not fully isolate the contribution of the convolutional masked-diffusion objective from the injected priors. In the revised manuscript we will therefore add the requested ablation: a version of CMD trained without the ViT-feature injection, followed by re-evaluation on the cell-level dense prediction tasks. This will allow clearer attribution of gains to the proposed pretraining and architecture while refining the paper's positioning. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivation chain

full rationale

The manuscript presents a new pretraining method (ConvNeXt-UNet with masked diffusion in pixel space plus adaptive normalization from frozen external features) and supports its claims exclusively via experimental comparisons on pathology dense-prediction benchmarks. No equations, parameter-fitting steps, or mathematical derivations appear in the abstract or described text. The central performance claims therefore cannot reduce to self-definitional inputs, fitted quantities renamed as predictions, or load-bearing self-citations. While the hybrid use of frozen ViT-derived features raises separate questions of attribution, that issue lies outside the circularity criteria (no reduction by construction is exhibited). The paper is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions from self-supervised learning and convolutional networks but introduces no new physical entities. Free parameters are typical ML hyperparameters not detailed here.

axioms (2)

domain assumption Masked diffusion pretraining in pixel space preserves spatial continuity better than patch-based tokenization for histological structures.
Invoked in the motivation and method description to justify the convolutional backbone choice.
domain assumption Frozen features from existing pathology foundation models can be effectively integrated via adaptive normalization without domain shift issues.
Used in the pretraining framework description.

pith-pipeline@v0.9.0 · 5548 in / 1320 out tokens · 22016 ms · 2026-05-12T00:53:14.628460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Masked-Diffusion Convolutional Foundation Models... for cell-level dense prediction in pathology.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

[1]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021

work page internal anchor Pith review arXiv 2021
[3]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[4]

Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

work page 2021
[5]

Agent aggregator with mask denoise mechanism for histopathology whole slide image analysis

Xitong Ling, Minxi Ouyang, Yizhi Wang, Xinrui Chen, Renao Yan, Hongbo Chu, Junru Cheng, Tian Guan, Sufang Tian, Xiaoping Liu, et al. Agent aggregator with mask denoise mechanism for histopathology whole slide image analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 2795–2803, 2024

work page 2024
[6]

nnmil: A generalizable multiple instance learning framework for computational pathology.arXiv preprint arXiv:2511.14907, 2025

Xiangde Luo, Jinxi Xiang, Yuanfeng Ji, and Ruijiang Li. nnmil: A generalizable multiple instance learning framework for computational pathology.arXiv preprint arXiv:2511.14907, 2025

work page arXiv 2025
[7]

Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

work page 2024
[8]

A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024

Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024

work page 2024
[9]

Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738, 2024

Eric Zimmermann, Eugene V orontsov, Julian Viret, Adam Casson, Michal Zelechowski, George Shaikovski, Neil Tenenholtz, James Hall, David Klimstra, Razik Yousfi, et al. Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738, 2024

work page arXiv 2024
[10]

Pathorchestra: A comprehensive foundation model for computational pathology with over 100 diverse clinical-grade tasks.npj Digital Medicine, 8(1):695, 2025

Fang Yan, Jianfeng Wu, Jiawen Li, Wei Wang, Yirong Chen, Linda Wei, Jiaxuan Lu, Wen Chen, Zizhao Gao, Jianan Li, et al. Pathorchestra: A comprehensive foundation model for computational pathology with over 100 diverse clinical-grade tasks.npj Digital Medicine, 8(1):695, 2025

work page 2025
[11]

Scaling self-supervised learning for histopathology with masked image modeling.medRxiv, 2023

Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul Jacob, Lucas Fidon, Alice Mac Kain, Charlie Saillard, and Jean-Baptiste Schiratti. Scaling self-supervised learning for histopathology with masked image modeling.medRxiv, 2023

work page 2023
[12]

Phikon-v2, a large and public feature extractor for biomarker prediction.arXiv preprint arXiv:2409.09173, 2024

Alexandre Filiot, Paul Jacob, Alice Mac Kain, and Charlie Saillard. Phikon-v2, a large and public feature extractor for biomarker prediction.arXiv preprint arXiv:2409.09173, 2024

work page arXiv 2024
[13]

A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

work page 2024
[14]

Hibou: A family of foundational vision transformers for pathology.arXiv preprint arXiv:2406.05074, 2024

Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Hibou: A family of foundational vision transformers for pathology.arXiv preprint arXiv:2406.05074, 2024

work page arXiv 2024
[15]

Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024

Nanne Aben, Edwin D de Jong, Ioannis Gatopoulos, Nicolas Känzig, Mikhail Karasikov, Axel Lagré, Roman Moser, Joost van Doorn, Fei Tang, et al. Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024

work page arXiv 2024
[16]

Subspecialty-specific foundation model for intelligent gastrointestinal pathology.arXiv preprint arXiv:2505.21928, 2025

Lianghui Zhu, Xitong Ling, Minxi Ouyang, Xiaoping Liu, Tian Guan, Mingxi Fu, Zhiqiang Cheng, Fanglei Fu, Maomao Zeng, Liming Liu, et al. Subspecialty-specific foundation model for intelligent gastrointestinal pathology.arXiv preprint arXiv:2505.21928, 2025

work page arXiv 2025
[17]

Stainnet: A special staining self-supervised vision transformer for computational pathology

Jiawen Li, Jiali Hu, Xitong Ling, Yongqiang Lv, Yuxuan Chen, Yizhi Wang, Tian Guan, Yifei Liu, and Yonghong He. Stainnet: A special staining self-supervised vision transformer for computational pathology. arXiv preprint arXiv:2512.10326, 2025. 10

work page arXiv 2025
[18]

A generalizable pathology foundation model using a unified knowledge distillation pretraining framework.Nature Biomedical Engineering, pages 1–20, 2025

Jiabo Ma, Zhengrui Guo, Fengtao Zhou, Yihui Wang, Yingxue Xu, Jinbang Li, Fang Yan, Yu Cai, Zhengjie Zhu, Cheng Jin, et al. A generalizable pathology foundation model using a unified knowledge distillation pretraining framework.Nature Biomedical Engineering, pages 1–20, 2025

work page 2025
[19]

Training state-of-the-art pathology foundation models with orders of magnitude less data.arXiv preprint arXiv:2504.05186, 2025

Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, and Sebastian Otálora. Training state-of-the-art pathology foundation models with orders of magnitude less data.arXiv preprint arXiv:2504.05186, 2025

work page arXiv 2025
[20]

Genbio-pathfm: A state-of-the-art foundation model for histopathology.bioRxiv, pages 2026–03, 2026

Saarthak Kapse, Mehmet Aygün, Elijah Cole, Emma Lundberg, Le Song, and Eric P Xing. Genbio-pathfm: A state-of-the-art foundation model for histopathology.bioRxiv, pages 2026–03, 2026

work page 2026
[21]

A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine, 29(9):2307–2316, 2023

Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine, 29(9):2307–2316, 2023

work page 2023
[22]

A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

work page 2024
[23]

A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

work page 2025
[24]

A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025

Jinxi Xiang, Xiyue Wang, Xiaoming Zhang, Yinghua Xi, Feyisope Eweje, Yijiang Chen, Yuchen Li, Colin Bergstrom, Matthew Gopaulchan, Ted Kim, et al. A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025

work page 2025
[25]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[26]

A panoptic segmentation dataset and deep-learning approach for explainable scoring of tumor- infiltrating lymphocytes.NPJ Breast Cancer, 10(1):52, 2024

Shangke Liu, Mohamed Amgad, Deeptej More, Muhammad A Rathore, Roberto Salgado, and Lee AD Cooper. A panoptic segmentation dataset and deep-learning approach for explainable scoring of tumor- infiltrating lymphocytes.NPJ Breast Cancer, 10(1):52, 2024

work page 2024
[27]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Vision transformer adapter for dense predictions,

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv preprint arXiv:2205.08534, 2022

work page arXiv 2022
[29]

Masked diffusion as self-supervised representation learner.arXiv preprint arXiv:2308.05695, 2023

Zixuan Pan, Jianxu Chen, and Yiyu Shi. Masked diffusion as self-supervised representation learner.arXiv preprint arXiv:2308.05695, 2023

work page arXiv 2023
[30]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022
[31]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

work page 2023
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[33]

Attention U-Net: Learning Where to Look for the Pancreas

Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas.arXiv preprint arXiv:1804.03999, 2018

work page internal anchor Pith review arXiv 2018
[34]

Methods for segmentation and classification of digital microscopy tissue images.Frontiers in bioengineering and biotechnology, 7:53, 2019

Quoc Dang Vu, Simon Graham, Tahsin Kurc, Minh Nguyen Nhat To, Muhammad Shaban, Talha Qaiser, Navid Alemi Koohbanani, Syed Ali Khurram, Jayashree Kalpathy-Cramer, Tianhao Zhao, et al. Methods for segmentation and classification of digital microscopy tissue images.Frontiers in bioengineering and biotechnology, 7:53, 2019

work page 2019
[35]

Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases.Journal of pathology informatics, 7(1):29, 2016

Andrew Janowczyk and Anant Madabhushi. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases.Journal of pathology informatics, 7(1):29, 2016. 11

work page 2016
[36]

Dense steerable filter cnns for exploiting rotational symmetry in histology images.IEEE Transactions on Medical Imaging, 39(12):4124–4136, 2020

Simon Graham, David Epstein, and Nasir Rajpoot. Dense steerable filter cnns for exploiting rotational symmetry in histology images.IEEE Transactions on Medical Imaging, 39(12):4124–4136, 2020

work page 2020
[37]

Segmentation of nuclei in histopathology images by deep regression of the distance map.IEEE transactions on medical imaging, 38(2):448–459, 2018

Peter Naylor, Marick Laé, Fabien Reyal, and Thomas Walter. Segmentation of nuclei in histopathology images by deep regression of the distance map.IEEE transactions on medical imaging, 38(2):448–459, 2018

work page 2018
[38]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[39]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[40]

Vista-path: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology.arXiv preprint arXiv:2601.16451, 2026

Peixian Liang, Songhao Li, Shunsuke Koga, Yutong Li, Zahra Alipour, Yucheng Tang, Daguang Xu, and Zhi Huang. Vista-path: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology.arXiv preprint arXiv:2601.16451, 2026

work page arXiv 2026
[41]

Open-vocabulary object segmentation with diffusion models

Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Open-vocabulary object segmentation with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, 2023

work page 2023
[42]

& Ivanova, E

Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Histai: an open-source, large-scale whole slide image dataset for computational pathology.arXiv preprint arXiv:2505.12120, 2025. 12 A Theoretical Overview of ConvNeXt Masked-Diffusion Models Masked-diffusion pretraining can be viewed as a self-supervised relaxation of denoising diffusion models for...

work page arXiv 2025