pith. sign in

arxiv: 2606.00928 · v1 · pith:ZD4EMMIKnew · submitted 2026-05-30 · 💻 cs.CV · cs.LG

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

Pith reviewed 2026-06-28 18:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords knowledge distillationtissue segmentationsingle-channel imagingmultiplexed fluorescencefoundation modelsSAMU-Net studentsDice score
0
0 comments X

The pith

Cross-modal distillation from multiplexed foundation models boosts single-channel tissue segmentation Dice scores by 13 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a frozen foundation model teacher processing both nuclear and membrane fluorescence channels can transfer useful semantic information to a lightweight student model that uses only the nuclear channel. This is done through a distillation process that matches probability outputs, adds boundary supervision, and weights uncertainty. On the TissueNet dataset, this leads to a Swin-Tiny student reaching a Dice score of 78.36, up from 65.31 without distillation, while using 23 times fewer parameters than the teacher and recovering nearly 88 percent of the teacher's performance. The improvement is consistent across different student architectures and holds when tested on another dataset.

Core claim

A cross-modal knowledge distillation framework transfers semantic information from a multiplexed foundation model teacher to single-channel students by combining MSE probability matching, boundary-aware supervision, and learnable uncertainty weighting, enabling the students to achieve substantial performance gains on tissue segmentation tasks.

What carries the argument

The distillation objective that uses MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting to bridge multiplexed teacher inputs and single-channel student inputs.

If this is right

  • KD improves Dice scores by about 12 points for all four tested student models on TissueNet.
  • SAM ViT-H outperforms CellSAM as a teacher across all student architectures and datasets.
  • The gains from distillation persist in cross-dataset evaluation on BBBC038 without any teacher retraining.
  • The best student recovers 87.9% of the teacher oracle performance at a 23x parameter reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that multi-channel foundation models can be leveraged to improve many single-channel medical imaging tasks where full multiplexing is not available at deployment.
  • The framework could be extended to other modalities or tasks by adjusting the distillation losses to match domain differences.
  • Smaller models like MobileNetV3 might see even larger relative benefits if further optimized for the distilled knowledge.

Load-bearing premise

The multiplexed teacher provides complementary semantic information from non-nuclear channels that can be transferred to the nuclear-only student without major loss due to input domain differences.

What would settle it

A lack of Dice score improvement in the distilled students compared to no-KD baselines on TissueNet would indicate that the cross-modal transfer is not effective.

Figures

Figures reproduced from arXiv: 2606.00928 by Jarin Ritu, Md Sakhawat Hossain, Sakib Mohammad.

Figure 1
Figure 1. Figure 1: Overview of the proposed cross-modal knowledge distillation framework. A frozen foundation model teacher [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative samples from both datasets. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative segmentation results on TissueNet validation set. Each row shows a different sample. Columns show nuclear channel input, ground truth [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: The top row presents TissueNet examples, while [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a cross-modal knowledge distillation framework that transfers semantic information from frozen multiplexed-input foundation model teachers (SAM ViT-H and CellSAM) to lightweight single nuclear-channel student networks (Swin-Tiny, ResNet18, EfficientNet-B0, MobileNetV3) for tissue segmentation. On TissueNet the best distilled student reaches Dice 78.36 (±1.44), a 13.05-point gain over the no-KD baseline and 87.9% recovery of the teacher oracle at 23× fewer parameters; consistent ~12-point gains are reported across students and datasets, with cross-dataset transfer on BBBC038 without teacher retraining.

Significance. If the central claim holds, the work offers a practical route to high-performance single-channel inference by leveraging multiplexed teachers only at training time. The architecture-agnostic gains, cross-dataset generalization without retraining, and use of public datasets with reported standard deviations are concrete strengths that would support deployment in settings where only nuclear staining is available.

major comments (3)
  1. [Experiments / TissueNet results (abstract and corresponding table)] The headline attribution of the 13.05-point Dice lift (TissueNet, Swin-Tiny row) to cross-modal transfer from non-nuclear channels rests on the assumption that the multiplexed teacher supplies complementary information unavailable to a nuclear-only model. No ablation is presented that holds teacher architecture, distillation losses (MSE probability matching + boundary supervision + uncertainty weighting), and training protocol fixed while restricting the teacher to the nuclear channel only. Without this control the reported gains cannot be separated from generic regularization effects of knowledge distillation.
  2. [Method (distillation objective)] The method description does not supply the precise mathematical form of the boundary-aware supervision term or the parameterization of the learnable uncertainty weights. These details are load-bearing for reproducing the reported Dice scores and for understanding whether the framework is truly parameter-light beyond the listed free parameters.
  3. [Experiments (baseline and training protocol)] Baseline definitions are incompletely specified: it is unclear whether the no-KD baseline uses identical data augmentation, optimizer schedule, and training epochs as the distilled students, or whether any post-hoc hyper-parameter tuning was performed only on the distilled runs. This affects interpretation of the 12-point average improvement.
minor comments (2)
  1. [Abstract] Abstract uses the literal phrase 'plus or minus' instead of the ± symbol; this should be corrected for readability.
  2. [Method] The manuscript would benefit from an explicit statement of the exact input-channel configuration used for each teacher during distillation (all channels vs. a subset) and from a reference to the precise SAM/CellSAM checkpoint versions employed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Experiments / TissueNet results (abstract and corresponding table)] The headline attribution of the 13.05-point Dice lift (TissueNet, Swin-Tiny row) to cross-modal transfer from non-nuclear channels rests on the assumption that the multiplexed teacher supplies complementary information unavailable to a nuclear-only model. No ablation is presented that holds teacher architecture, distillation losses (MSE probability matching + boundary supervision + uncertainty weighting), and training protocol fixed while restricting the teacher to the nuclear channel only. Without this control the reported gains cannot be separated from generic regularization effects of knowledge distillation.

    Authors: We agree that an ablation with the teacher restricted to the nuclear channel is necessary to isolate the contribution of cross-modal information. We will perform this control experiment in the revision, maintaining identical teacher architecture, losses, and protocol, and include the results to demonstrate that the gains are indeed attributable to the additional channels. revision: yes

  2. Referee: [Method (distillation objective)] The method description does not supply the precise mathematical form of the boundary-aware supervision term or the parameterization of the learnable uncertainty weights. These details are load-bearing for reproducing the reported Dice scores and for understanding whether the framework is truly parameter-light beyond the listed free parameters.

    Authors: The full mathematical definitions of these terms were omitted from the main text for brevity. We will expand the Methods section in the revised manuscript to include the precise equations for the boundary-aware supervision and the learnable uncertainty weighting parameterization. revision: yes

  3. Referee: [Experiments (baseline and training protocol)] Baseline definitions are incompletely specified: it is unclear whether the no-KD baseline uses identical data augmentation, optimizer schedule, and training epochs as the distilled students, or whether any post-hoc hyper-parameter tuning was performed only on the distilled runs. This affects interpretation of the 12-point average improvement.

    Authors: The no-KD baselines were trained using the identical data augmentation, optimizer schedule, and number of epochs as the distilled students. No selective hyper-parameter tuning was applied to the distilled models. We will add explicit statements to this effect in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on public benchmarks

full rationale

The manuscript presents an empirical knowledge-distillation pipeline evaluated via Dice scores on TissueNet and BBBC038. No equations, fitted parameters, or self-citations are used to derive the reported performance numbers; all metrics are obtained by direct measurement against held-out test sets. The central result (approximately 12-point Dice lift) is therefore not reducible to any input quantity defined inside the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the teacher models being reliable sources of semantic information and on the distillation losses successfully transferring non-nuclear channel cues; one learnable component is introduced in the loss.

free parameters (1)
  • learnable uncertainty weights
    The distillation objective includes learnable uncertainty weighting that is optimized during training to balance the loss terms.
axioms (1)
  • domain assumption Foundation models (SAM ViT-H, CellSAM) produce high-quality semantic predictions on multiplexed fluorescence input that are worth distilling.
    The method assumes the frozen teachers provide useful targets; this is invoked when defining the distillation objective.

pith-pipeline@v0.9.1-grok · 5812 in / 1334 out tokens · 24281 ms · 2026-06-28T18:37:33.186764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Deep learning for cellular image analysis,

    E. Moen, D. Bannon, T. Kudo, W. Graf, M. Bhatt, and D. Van Valen, “Deep learning for cellular image analysis,”Nature Methods, vol. 16, pp. 1233–1246, 2019

  2. [2]

    U-Net: Convolutional net- works for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- works for biomedical image segmentation,” inMedical Image Comput- ing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241

  3. [3]

    Whole-cell segmentation of tissue images with human-level perfor- mance using large-scale data annotation and deep learning,

    N. F. Greenwald, G. Miller, E. Moen, A. Kong, A. Kagel, T. Dougherty, C. C. Fullaway, B. J. McIntosh, K. X. Leow, M. S. Schwartzet al., “Whole-cell segmentation of tissue images with human-level perfor- mance using large-scale data annotation and deep learning,”Nature Biotechnology, vol. 40, pp. 555–565, 2022

  4. [4]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  5. [5]

    Structured knowledge distillation for semantic segmentation,

    Y . Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” inIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2604–2613

  6. [6]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Gir- shick, “Segment anything,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,”ArXiv, vol. abs/2010.11929, 2020

  8. [8]

    A foundation model for cell segmentation,

    U. Israel, B. Bhatt, B. Bhatt, B. Bhattet al., “A foundation model for cell segmentation,”bioRxiv, 2023, doi:10.1101/2023.11.17.567630

  9. [9]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

  10. [10]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  11. [11]

    EfficientNet: Rethinking model scaling for con- volutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con- volutional neural networks,” inInternational Conference on Machine Learning (ICML), 2019, pp. 6105–6114

  12. [12]

    Searching for mobilenetv3,

    A. G. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . Vasudevan, Q. V . Le, and H. Adam, “Searching for mobilenetv3,”2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324, 2019

  13. [13]

    Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl,

    J. C. Caicedo, A. Goodman, K. W. Karhohs, B. A. Cimini, J. Ackerman, M. Haghighi, C. Heng, T. Becker, M. Doan, C. McQuin, M. Rohban, S. Singh, and A. E. Carpenter, “Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl,”Nature Methods, vol. 16, pp. 1247–1253, 2019

  14. [14]

    Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,

    A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7482–7491

  15. [15]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”ArXiv, vol. abs/1704.04861, 2017

  16. [16]

    Segment Anything in medical images,

    J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment Anything in medical images,”Nature Communications, vol. 15, p. 654, 2024

  17. [17]

    Poynton,Digital Video and HD: Algorithms and Interfaces, 2nd ed

    C. Poynton,Digital Video and HD: Algorithms and Interfaces, 2nd ed. Morgan Kaufmann, 2012

  18. [18]

    PyTorch Lightning,

    W. Falconet al., “PyTorch Lightning,” https://github.com/Lightning- AI/pytorch-lightning, 2019

  19. [19]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019