arxiv: 2605.00894 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding

Tianyang Wang , Ziyu Su , Abdul Rehman Akbar , Usama Sajjad , Usman Afzaal , Lina Gokhale , Charles Rabolli , Wei Chen

show 2 more authors

Anil Parwani Muhammad Khalid Khan Niazi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords Dino-NestedUNetpathology segmentationdense decoderDINOv3tumor bulk segmentationzero-shot evaluationhistopathologyfeature reuse

0 comments

The pith

A dense grid of pathways in the decoder lets a frozen DINOv3 encoder recover fine tumor boundaries in pathology slides by reusing features across scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dino-NestedUNet to address the mismatch between rich but frozen vision foundation encoders and the need for precise boundary reconstruction in tumor segmentation. It replaces sparse skip connections with a Nested Dense Decoder that creates many interconnected pathways, allowing ongoing reuse and recalibration of features from high-level semantics down to low-level tissue textures. This setup is tested on three different histopathology collections and shows gains over both classic UNet++ and simpler Dino-UNet baselines, including when the model is applied directly to entirely new cohorts without retraining.

Core claim

Dino-NestedUNet couples a pre-trained DINOv3 encoder with a Nested Dense Decoder that forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction and yielding consistent improvements over UNet++ and standard Dino-UNet variants on three histopathology cohorts.

What carries the argument

The Nested Dense Decoder, which replaces sparse skips with a dense grid of pathways for repeated feature fusion and multi-resolution recalibration during upsampling.

If this is right

The architecture produces higher segmentation accuracy than UNet++ or plain Dino-UNet on multi-center and single-institution pathology data.
Performance remains higher when the model encounters images from different sources or staining protocols.
Zero-shot testing on unseen cohorts such as TIGER WSIBULK and OSU CRC shows usable results without any further training.
Boundary fidelity improves for infiltrative tumors because semantic features stay aligned with local morphological detail throughout reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dense-pathway idea could be tried with other frozen vision foundation models to check whether the benefit is specific to DINOv3.
If the gains hold, the method might lower the amount of labeled pathology data needed for new segmentation tasks.
The approach may apply to other medical imaging problems where precise edge recovery matters more than overall classification.
Varying the density of the intermediate pathways could reveal trade-offs between accuracy and computational cost on large whole-slide images.

Load-bearing premise

That the reported gains come from the dense decoding structure itself rather than from training choices, hyperparameter tuning, or the particular makeup of the three evaluation cohorts.

What would settle it

Re-train the identical DINOv3 encoder on the same data using a standard UNet-style decoder versus the Nested Dense Decoder while holding every other setting fixed, then measure whether Dice or boundary-error metrics differ consistently on the held-out test portions of the CHTN, OSU, and CAMELYON16 sets.

Figures

Figures reproduced from arXiv: 2605.00894 by Abdul Rehman Akbar, Anil Parwani, Charles Rabolli, Lina Gokhale, Muhammad Khalid Khan Niazi, Tianyang Wang, Usama Sajjad, Usman Afzaal, Wei Chen, Ziyu Su.

**Figure 2.** Figure 2: Comparative segmentation results on CHTN (rows 1–2), OSU (rows [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Representative whole-slide segmentation results on the CHTN co [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of Dice Similarity Coefficients. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Dino-NestedUNet, which couples a frozen pre-trained DINOv3 vision foundation model encoder with a Nested Dense Decoder. The decoder replaces sparse skip connections with a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures. The framework is evaluated on three histopathology cohorts (multi-center CHTN, institutional OSU, CAMELYON16) with reported consistent improvements over UNet++ and standard Dino-UNet variants, plus zero-shot transfer by training on CHTN and testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. The central claim is that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology tumor bulk segmentation.

Significance. If the performance gains can be rigorously attributed to the dense decoding architecture, the work would be significant for computational pathology by addressing the capacity mismatch between powerful frozen VFMs and lightweight decoders, potentially improving boundary fidelity for infiltrative tumors. The zero-shot cross-cohort evaluation is a notable strength that supports claims of external generalization. However, the absence of isolated ablations means the result's attribution remains unverified, limiting immediate impact on the field.

major comments (1)

[Abstract and Experimental Evaluation] The central claim states that the Nested Dense Decoder's dense grid of intermediate pathways (continuous reuse + multi-scale recalibration) produces the observed improvements over UNet++ and Dino-UNet baselines, especially under cross-domain shift and zero-shot transfer (Abstract). No ablation is described that holds all other variables fixed—including loss function, optimizer, augmentations, learning-rate schedule, batch size, and precise DINOv3 layer selection—while varying only decoder connectivity. Without this isolation, the performance deltas cannot be confidently attributed to the claimed feature-reuse mechanism rather than incidental implementation or hyperparameter differences. This directly undermines the assertion that dense decoding is the 'key ingredient'.

minor comments (1)

[Abstract] The abstract describes 'consistent improvements' and 'zero-shot evaluation' but reports no quantitative metrics, Dice/IoU scores, error bars, or statistical tests. Including these (with cohort-specific values) would strengthen the summary and allow immediate assessment of effect sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of dense decoding for foundation encoders in pathology segmentation. We address the major comment below and will strengthen the experimental section accordingly.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] The central claim states that the Nested Dense Decoder's dense grid of intermediate pathways (continuous reuse + multi-scale recalibration) produces the observed improvements over UNet++ and Dino-UNet baselines, especially under cross-domain shift and zero-shot transfer (Abstract). No ablation is described that holds all other variables fixed—including loss function, optimizer, augmentations, learning-rate schedule, batch size, and precise DINOv3 layer selection—while varying only decoder connectivity. Without this isolation, the performance deltas cannot be confidently attributed to the claimed feature-reuse mechanism rather than incidental implementation or hyperparameter differences. This directly undermines the assertion that dense decoding is the 'key ingredient'.

Authors: We agree that rigorous isolation of the decoder connectivity is essential to substantiate the central claim. Our current Dino-UNet baseline shares the identical frozen DINOv3 encoder and training protocol with Dino-NestedUNet, providing partial evidence that the gains arise from the decoder design rather than encoder differences. However, we acknowledge that not every hyperparameter (e.g., exact layer selection or augmentation details) was exhaustively matched in a single controlled experiment. In the revised manuscript we will add a dedicated ablation that fixes the encoder, loss function, optimizer, augmentations, learning-rate schedule, batch size, and DINOv3 layer selection while varying only the decoder architecture (standard sparse skip connections versus the proposed dense grid). This will directly test whether the continuous feature reuse and multi-scale recalibration account for the observed improvements, including under cross-domain and zero-shot settings. We believe this addition will address the attribution concern without altering the overall conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The manuscript proposes an architectural modification (Nested Dense Decoder coupled to a frozen DINOv3 encoder) and supports its utility solely through empirical comparisons on three histopathology cohorts plus zero-shot transfer tests. No equations, first-principles derivations, fitted-parameter predictions, or self-citation chains appear in the provided text; performance deltas are presented as direct experimental outcomes rather than reductions to inputs by construction. The central claim therefore remains self-contained and externally falsifiable via replication on the stated datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained DINOv3 representations transfer effectively to pathology when paired with dense decoding; no new physical entities or mathematical axioms are introduced beyond standard deep-learning assumptions.

free parameters (1)

Nested Dense Decoder connection density and scale choices
Architecture hyperparameters that define the dense grid of pathways and recalibration operations, tuned to achieve the reported performance.

axioms (1)

domain assumption Frozen DINOv3 encoder provides rich semantic representations suitable for histopathology segmentation
Invoked in the abstract as the basis for coupling with the decoder.

pith-pipeline@v0.9.0 · 5545 in / 1300 out tokens · 28520 ms · 2026-05-09T20:22:48.228049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2508.21041 (2025)

Balezo, G., Feki, H., Bourgade, R., Monnier, L., Blons, M., Blondel, A., Decencière, E., Planas, A.P., Walter, T.: Efficient fine-tuning of dinov3 pretrained on natural images for atypical mitotic figure classification (midog 2025 task 2 winner). arXiv preprint arXiv:2508.21041 (2025)

work page arXiv 2025
[2]

Jama318(22), 2199–2210 (2017)

Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama318(22), 2199–2210 (2017)

2017
[3]

Nature medicine30(3), 850–862 (2024)

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature medicine30(3), 850–862 (2024)

2024
[4]

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Gao, Y., Li, H., Yuan, F., Wang, X., Gao, X.: Dino u-net: Exploiting high-fidelity dense features from foundation models for medical image segmentation. arXiv preprint arXiv:2508.20909 (2025) Title Suppressed Due to Excessive Length 11

work page internal anchor Pith review arXiv 2025
[5]

Medical image analy- sis92, 103047 (2024)

Graham, S., Vu, Q.D., Jahanifar, M., Weigert, M., Schmidt, U., Zhang, W., Zhang, J., Yang, S., Xiang, J., Wang, X., et al.: Conic challenge: Pushing the frontiers of nuclear detection, segmentation, classification and counting. Medical image analy- sis92, 103047 (2024)

2024
[6]

Adam: A Method for Stochastic Optimization

Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

https://www.chtn.org (2024)

National Cancer Institute: Cooperative human tissue network (CHTN). https://www.chtn.org (2024)

2024
[8]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

2015
[9]

Annals of oncology26(2), 259–271 (2015)

Salgado, R., Denkert, C., Demaria, S., Sirtaine, N., Klauschen, F., Pruneri, G., Wienert, S., Van den Eynden, G., Baehner, F.L., Pénault-Llorca, F., et al.: The evaluation of tumor-infiltrating lymphocytes (tils) in breast cancer: recommen- dations by an international tils working group 2014. Annals of oncology26(2), 259–271 (2015)

2014
[10]

arXiv preprint arXiv:2206.11943 (2022)

Shephard, A., Jahanifar, M., Wang, R., Dawood, M., Graham, S., Sidlauskas, K., Khurram, S.A., Rajpoot, N., Raza, S.E.A.: Tiager: Tumor-infiltrating lymphocyte scoring in breast cancer for the tiger challenge. arXiv preprint arXiv:2206.11943 (2022)

work page arXiv 2022
[11]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

arXiv preprint arXiv:2511.13615 (2025)

Xu, K., Chiou, E., Varamesh, A., Acqualagna, L., Rajpoot, N.: Tissue aware nu- clei detection and classification model for histopathology images. arXiv preprint arXiv:2511.13615 (2025)

work page arXiv 2025
[13]

In: International workshop on deep learning in medical image analysis

Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: International workshop on deep learning in medical image analysis. pp. 3–11. Springer (2018)

2018