arxiv: 2604.24793 · v1 · submitted 2026-04-25 · 📡 eess.IV · cs.CV

Recognition: unknown

CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT, Colonoscopy, and Histology Images

Daniel Lao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:45 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords colorectal cancerimage segmentationmulti-modal imagingfoundation model adaptationLoRACTcolonoscopyhistopathology

0 comments

The pith

CRC-SAM adds low-rank adaptation layers to a frozen MedSAM encoder to deliver consistent colorectal cancer segmentation across CT, colonoscopy, and histology images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a single framework called CRC-SAM that segments colorectal cancer lesions in three different imaging types without building separate models for each. It starts from the MedSAM foundation model, inserts low-rank adaptation layers into the encoder, and keeps the base model frozen so only a small set of new parameters needs training. This setup targets efficient transfer to modalities that have fewer labeled examples, and the authors test it on standard datasets to show gains over prior single-modality approaches. If the method works as described, clinicians could use one system for tumor outlining at screening, staging, and pathology review stages.

Core claim

CRC-SAM achieves superior multi-modal segmentation by incorporating LoRA layers into a frozen MedSAM encoder, which enables effective domain transfer to CT, colonoscopy, and histopathology images using only minimal trainable parameters, as validated on the MSD-Colon, CVC-ClinicDB, and EBHI-Seg datasets where it outperforms state-of-the-art baselines.

What carries the argument

Low-rank adaptation (LoRA) layers inserted into the frozen encoder of the MedSAM foundation model, which perform the domain transfer to new imaging modalities while keeping the number of updated parameters small.

If this is right

One model can replace multiple specialized ones for segmentation across the clinical workflow from CT staging to colonoscopy detection to histology review.
Adaptation to additional imaging types or cancer sites becomes feasible with far fewer labeled examples and compute resources than full model retraining.
Segmentation accuracy improves on the tested public datasets relative to prior methods that handle only one modality at a time.
Tumor quantification remains consistent across modalities, supporting unified measurement of lesion size and extent.
The lightweight nature of the added layers lowers the barrier to deploying foundation models in settings with limited data per modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LoRA insertion pattern could be tried on other foundation models or for different organs where imaging modalities also differ sharply.
Hospitals could maintain a smaller set of models rather than retraining separate ones when new scanner types are introduced.
Real-time use during procedures might become practical if the low parameter count translates to faster inference on standard hardware.
The approach invites direct comparison of adaptation cost versus performance when scaling to rarer cancer subtypes or rarer modalities.

Load-bearing premise

The differences between CT, colonoscopy, and histology images can be bridged well enough by low-rank adaptation on a frozen foundation-model encoder without needing larger architectural changes or more extensive retraining.

What would settle it

Application of CRC-SAM to a new modality or a larger, more diverse test set where it no longer outperforms single-modality baselines or requires far more than the reported parameter count to match performance would falsify the claim that this minimal adaptation suffices.

read the original abstract

We present CRC-SAM, a unified framework for colorectal cancer segmentation across colonoscopy, CT, and histopathology images. Unlike prior single-modality methods, CRC-SAM provides consistent, modality-agnostic segmentation throughout the clinical workflow. Built on MedSAM, it incorporates low-rank adaptation (LoRA) layers into a frozen encoder, enabling efficient domain transfer to underrepresented modalities with minimal trainable parameters. Experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrate superior performance across modalities, outperforming state-of-the-art baselines and highlighting the effectiveness of lightweight LoRA adaptation for foundation-model-based colorectal cancer analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CRC-SAM, a unified multi-modal framework for colorectal cancer segmentation that adapts the MedSAM foundation model by inserting low-rank adaptation (LoRA) layers into a frozen encoder. It claims this enables efficient domain transfer and consistent performance across CT, colonoscopy, and histology images, with experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrating superiority over state-of-the-art baselines.

Significance. If the reported gains prove robust, the work would illustrate a practical, low-parameter route for extending medical foundation models to heterogeneous modalities whose physics, resolution, and dimensionality differ substantially. The emphasis on frozen-encoder LoRA adaptation is a pragmatic strength for resource-constrained clinical settings.

major comments (2)

[Methods] Methods section: the central claim that a single LoRA configuration on the frozen MedSAM encoder suffices for effective segmentation across 3D CT volumes and native 2D colonoscopy/histology images is not accompanied by any modality-specific input handling description, feature-space distance analysis, or ablation that isolates adaptation from preprocessing choices.
[Experiments] Experiments section: superiority is asserted over baselines on the three named datasets, yet no error bars, statistical significance tests, or cross-validation protocol are referenced, leaving open whether observed differences exceed dataset-specific variability.

minor comments (2)

[Abstract] Abstract: the statement of 'superior performance' would be more informative if accompanied by at least the primary quantitative metrics (Dice, IoU) for each modality.
[Figures] Figure captions and architecture diagram: the placement of LoRA layers relative to the image encoder and prompt encoder is not visually clarified, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. These have prompted us to strengthen the presentation of our methods and the statistical robustness of our experiments. We address each major comment point by point below, indicating the specific revisions made to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section: the central claim that a single LoRA configuration on the frozen MedSAM encoder suffices for effective segmentation across 3D CT volumes and native 2D colonoscopy/histology images is not accompanied by any modality-specific input handling description, feature-space distance analysis, or ablation that isolates adaptation from preprocessing choices.

Authors: We agree that the original Methods section provided only a high-level description of the unified LoRA-adapted MedSAM framework and did not sufficiently detail modality-specific input handling. In the revised manuscript we have added an explicit subsection (Section 3.2) describing slice-wise processing of 3D CT volumes (with 3D connected-component post-processing), intensity normalization for CT, color deconvolution for histology, and resizing/padding protocols for colonoscopy images. We have also inserted a new ablation study (Table 4) that isolates the contribution of LoRA adaptation from preprocessing variations by comparing (i) frozen MedSAM with preprocessing only, (ii) LoRA with standard preprocessing, and (iii) LoRA with modality-specific preprocessing. Finally, we include a brief feature-space analysis using t-SNE projections of encoder embeddings before and after LoRA adaptation to illustrate how the low-rank updates reduce modality-induced distribution shifts. These additions directly support the central claim while preserving the frozen-encoder efficiency. revision: yes
Referee: [Experiments] Experiments section: superiority is asserted over baselines on the three named datasets, yet no error bars, statistical significance tests, or cross-validation protocol are referenced, leaving open whether observed differences exceed dataset-specific variability.

Authors: We acknowledge that the original experimental reporting relied on single fixed splits and point estimates, which limits assessment of variability. In the revised manuscript we have re-evaluated all methods using 5-fold cross-validation on MSD-Colon and CVC-ClinicDB (and 3-fold on the smaller EBHI-Seg dataset due to computational limits). Results are now reported as mean ± standard deviation, with error bars in all figures and tables. We have added paired t-tests (and Wilcoxon signed-rank tests for non-normal metrics) between CRC-SAM and each baseline, reporting p-values in the updated results tables (Tables 1–3). These tests confirm that the observed improvements remain statistically significant (p < 0.05) across folds. While full leave-one-center-out validation was not feasible within the revision timeline, the added cross-validation protocol and significance testing directly address the concern about dataset-specific variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical LoRA adaptation on public benchmarks

full rationale

The paper describes an empirical adaptation of the existing MedSAM model by inserting LoRA layers into a frozen encoder, followed by standard training and evaluation on three public datasets (MSD-Colon, CVC-ClinicDB, EBHI-Seg). No equations, derivations, or parameter-fitting steps are presented that reduce by construction to quantities defined inside the method itself. The central claims rest on comparative benchmark numbers rather than any self-referential uniqueness theorem, ansatz smuggled via citation, or renaming of known results. Self-citations, if present, are not load-bearing for the reported performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard transfer-learning assumption that parameter-efficient adaptation of a frozen foundation model can bridge domain gaps between CT, colonoscopy, and histology without further architectural changes.

axioms (1)

domain assumption LoRA adaptation of a frozen encoder enables effective domain transfer for medical image segmentation with minimal trainable parameters.
Invoked in the description of the CRC-SAM architecture as the mechanism for efficient multi-modal use.

pith-pipeline@v0.9.0 · 5402 in / 1240 out tokens · 46959 ms · 2026-05-08T06:45:03.464744+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Because early stages are frequently asymptomatic, early detection and accurate diagnosis are critical for improving outcomes

INTRODUCTION Colorectal cancer (CRC), the world’s third most common cancer ( 10%of cases), arises from abnormal colon cell growth and often develops from benign adenomatous polyps. Because early stages are frequently asymptomatic, early detection and accurate diagnosis are critical for improving outcomes. In the U.S., colorectal cancer screening typically...
[2]

RELA TED WORK Initial research on colorectal tissue segmentation was domi- nated by encoder–decoder CNNs such as U-Net [4]. UNet++
[3]

CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT, Colonoscopy, and Histology Images

advanced this paradigm by introducing dense and atten- tion connections, and later studies further improved segmen- tation accuracy through enhanced boundary modeling and multi-scale feature representations. Transformers have recently gained traction in medical im- age segmentation for their ability to capture global context. Hybrid CNN–Transformer models...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Overview Given an input imagex∈R H×W×3 , the image encoderf θ outputs embeddingsz=f θ(x)∈R N×D

METHOD 3.1. Overview Given an input imagex∈R H×W×3 , the image encoderf θ outputs embeddingsz=f θ(x)∈R N×D . The prompt en- coderg ϕ maps promptspto embeddingse=g ϕ(p), which are fused by the mask decoderh ψ to produce the predicted maskˆy=h ψ(z, e)∈[0,1] H×W . As shown in Figure 1, CRC-SAM adopts the MedSAM architecture [1, 2]. The image encoder is froze...
[5]

EXPERIMENTS 4.1. Datasets, Evaluation Metrics, and Model Parameters The proposed model is evaluated on CS, CT, and PATH modalities using the public datasets MSD-Colon, CVC- ClinicDB, and EBHI-Seg (Table 2). Category #Parameters % of Total Total model parameters 93,735,472 100% Total trainable 4,280,036 4.57% –LoRA (encoder) 221,184 0.24% –Mask decoder 4,0...
[6]

Built on MedSAM [1, 2] with a lightweight LoRA-based adaptation [3], the framework achieves accu- rate and efficient segmentation across CT, colonoscopy, and histopathology

CONCLUSION AND DISCUSSION We propose a unified model for colorectal cancer segmenta- tion across multiple medical imaging modalities, enabling consistent and objective analysis throughout the clinical workflow. Built on MedSAM [1, 2] with a lightweight LoRA-based adaptation [3], the framework achieves accu- rate and efficient segmentation across CT, colon...
[7]

Segment anything in medical images,

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang, “Segment anything in medical images,”Nature Communications, vol. 15, no. 1, pp. 654, 2024

2024
[8]

Segment anything,

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[9]

Lora: Low-rank adaptation of large language models.,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models.,”ICLR, vol. 1, no. 2, pp. 3, 2022

2022
[10]

U-net: Convolutional networks for biomedical image segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medi- cal image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

2015
[11]

Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,

Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,”IEEE transactions on medical imaging, vol. 39, no. 6, pp. 1856–1867, 2019

2019
[12]

Ctnet: Contrastive transformer network for polyp segmentation,

Bin Xiao, Jinwu Hu, Weisheng Li, Chi-Man Pun, and Xi- uli Bi, “Ctnet: Contrastive transformer network for polyp segmentation,”IEEE Transactions on Cybernetics, vol. 54, no. 9, pp. 5040–5053, 2024

2024
[13]

Unetr: Transformers for 3d medical image segmentation,

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu, “Unetr: Transformers for 3d medical image segmentation,” inProceedings of the IEEE/CVF winter conference on applications of com- puter vision, 2022, pp. 574–584

2022
[14]

Unetr++: delving into efficient and accurate 3d medical image segmentation,

Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan, “Unetr++: delving into efficient and accurate 3d medical image segmentation,”IEEE Transac- tions on Medical Imaging, vol. 43, no. 9, pp. 3377–3390, 2024

2024