Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation
Pith reviewed 2026-06-27 07:20 UTC · model grok-4.3
The pith
A cascade of binary triage followed by three-class differentiation allows tunable sensitivity for skin neoplasm images that single-stage models cannot achieve.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By evaluating ViT-B/16, Swin-S, ConvNeXt-S, and EfficientNetV2-S across binary, single-stage four-class, and cascade schemes on aggregated ISIC data, the paper shows that the cascade raises macro F1 over single-stage four-class classification for most architectures and significantly for ViT-B/16. The binary triage stage attains ROC-AUC 0.952-0.966 internally but drops to 0.797-0.893 on Sechenov University data, with sensitivity falling to 0.53-0.67 and ECE rising from 0.02 to 0.27-0.39. No architecture proves superior at the differentiation stage on clinical data, and direct 11-class classification on ISIC MILK10k yields mean-class sensitivity of 0.525.
What carries the argument
Two-stage cascade: binary malignant/benign triage with adjustable threshold, followed by three-class differentiation among malignant types (MEL, SCC, BCC).
If this is right
- Cascade raises macro F1 over single-stage four-class classification for most architectures by recovering malignant lesions assigned to the benign class.
- Tunable triage threshold supplies sensitivity control unattainable with standard single-stage argmax classification.
- Binary stage ROC-AUC falls from 0.952-0.966 internally to 0.797-0.893 on external clinical data, with sensitivity declining to 0.53-0.67.
- Calibration error rises sharply on external data, with malignancy underestimation quantified by ECE increasing to 0.27-0.39.
- No architecture shows a proven advantage at the malignant differentiation stage on clinical data.
Where Pith is reading between the lines
- The persistent gap between internal and external performance implies that domain adaptation or target-population data collection may be required before reliable clinical use.
- The cascade structure could be tested on other imbalanced medical imaging tasks where rare positive cases must be separated from a large negative background.
- Incorporating additional patient metadata or multi-modal inputs might narrow the observed generalization gap between open international and local clinical datasets.
- Regulatory pathways for similar diagnostic tools would likely need to require independent external validation on representative populations.
Load-bearing premise
Aggregated open ISIC Archive data with ImageNet-pretrained weights provides a sufficient basis for models that transfer meaningfully to independent Russian clinical datasets without domain adaptation.
What would settle it
Showing that adjusting the triage threshold on the Sechenov University or Melanoscope AI datasets produces no improvement in macro F1 or sensitivity control compared with single-stage argmax classification would falsify the claimed advantage of the cascade.
Figures
read the original abstract
Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares four deep learning architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) for dermoscopic skin neoplasm classification under three schemes: binary (malignant/benign), single-stage four-class (benign/MEL/SCC/BCC), and a two-stage cascade (binary triage then three-class differentiation). All models use ImageNet-pretrained weights and are trained on aggregated ISIC Archive data; evaluation occurs on an internal held-out sample plus two external Russian clinical datasets. Reported results include internal binary AUC of 0.952-0.966 dropping to 0.797-0.893 externally with sensitivity 0.53-0.67 and rising ECE, macro F1 gains for cascade over single-stage (significant only for ViT-B/16), and statistical tests confirming limited inter-architecture differences on clinical data. The conclusion states that a tunable triage threshold enables sensitivity control unattainable with standard single-stage argmax classification and better matches clinical logic, while the generalization gap requires external validation and recalibration.
Significance. If the central claims hold, the work supplies concrete empirical support for cascade schemes in medical image triage by quantifying sensitivity control and domain-shift effects via external validation on independent clinical data. Credit is due for reporting specific AUC/sensitivity/F1/ECE values, paired statistical tests, and the explicit quantification of the generalization gap (AUC drop and ECE rise from 0.02 to 0.27-0.39). These elements provide a falsifiable basis for the triage-threshold advantage and the call for recalibration.
major comments (1)
- [Results] Results: the central claim that 'a tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification' rests solely on comparison to argmax single-stage four-class models. No results are shown for single-stage four-class models whose output probabilities are thresholded (e.g., malignancy probability or per-class operating points) to achieve the same external sensitivity range (0.53-0.67); this comparison is required to substantiate that the reported control is unavailable in any single-stage formulation.
minor comments (2)
- [Abstract] Abstract: the specific data subset (internal vs. external) on which the macro F1 improvement reaches statistical significance for ViT-B/16 is not stated.
- [Methods] The manuscript does not detail whether the single-stage models were also evaluated under any form of probability thresholding, leaving the scope of the 'standard single-stage' baseline ambiguous.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on our results section. We address the point below and agree that additional comparisons will strengthen the manuscript.
read point-by-point responses
-
Referee: [Results] Results: the central claim that 'a tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification' rests solely on comparison to argmax single-stage four-class models. No results are shown for single-stage four-class models whose output probabilities are thresholded (e.g., malignancy probability or per-class operating points) to achieve the same external sensitivity range (0.53-0.67); this comparison is required to substantiate that the reported control is unavailable in any single-stage formulation.
Authors: We agree that the referee's point is valid for fully substantiating the advantage of the cascade. While the manuscript explicitly frames its claim against the standard argmax single-stage four-class output (as stated in the conclusion), a comparison to single-stage models operated with probability thresholding is a natural extension. In the revised manuscript we will add results for single-stage four-class models where decision thresholds are adjusted on the output probabilities (both on the aggregated malignant probability and per-class operating points) to target the same external sensitivity range of 0.53-0.67. We will report the resulting macro F1, specificity, and calibration metrics alongside the cascade results. This will clarify whether the cascade provides sensitivity control that cannot be replicated by post-hoc thresholding in a single-stage formulation. revision: yes
Circularity Check
No circularity: purely empirical comparisons on held-out and external datasets
full rationale
The paper reports training and evaluation of four architectures under three classification schemes (binary, single-stage four-class, cascade) using ImageNet-pretrained weights on aggregated ISIC data, with metrics on internal held-out and two external clinical datasets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; the central claim about tunable triage thresholds is an empirical observation from direct comparisons to argmax baselines, not a reduction to inputs by construction. The generalization gap is quantified via explicit AUC/ECE drops rather than assumed away.
Axiom & Free-Parameter Ledger
free parameters (1)
- triage threshold
axioms (2)
- domain assumption ImageNet-pretrained weights and single augmentation protocol are sufficient for fair comparison across architectures
- domain assumption The clinical datasets from Melanoscope and Sechenov University are independent and representative of Russian practice
Reference graph
Works this paper leans on
-
[1]
Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056
Esteva A., Kuprel B., Novoa R.A., Ko J., Swetter S.M., Blau H.M., Thrun S. Dermatologist-level classification of skin can- cer with deep neural networks.Nature. 2017;542(7639):115–118. https://doi.org/10.1038/nature21056
-
[2]
Brinker T.J., Hekler A., Enk A.H., Berking C., Haferkamp S., Hauschild A., Weichenthal M., Klode J., Schadendorf D., Holland- Letz T., von Kalle C., Fröhling S., Schilling B., Utikal J.S. Deep learn- ing outperformed 136 of 157 dermatologists in a head-to-head dermo- scopic melanoma image classification task.European Journal of Cancer. 2019;113:47–54. htt...
-
[3]
Maron R.C., Weichenthal M., Utikal J.S., Hekler A., Berk- ing C., Hauschild A., Enk A.H., Haferkamp S., Klode J., Schaden- dorf D., Jansen P., Holland-Letz T., Schilling B., von Kalle C., 25 Fröhling S., Gaiser M.R., Hartmann D., Gesierich A., Käm- merer U., Brinker T.J. Systematic outperformance of 112 derma- tologists in multiclass skin cancer image cla...
-
[4]
Tschandl P., Rosendahl C., Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of com- mon pigmented skin lesions.Scientific Data. 2018;5:180161. https://doi.org/10.1038/sdata.2018.161
-
[5]
Codella N., Rotemberg V., Tschandl P., Celebi M.E., Dusza S., Gut- man D., Helba B., Kalloo A., Liopyris K., Marchetti M., Kittler H., HalpernA.Skinlesionanalysistowardmelanomadetection2018:Achal- lenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv:1902.03368. 2019. https://doi.org/10.48550/arXiv.1902.03368
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.03368 1902
-
[6]
An introduction to domain adaptation and trans- fer learning
Kouw W.M., Loog M. An introduction to domain adaptation and trans- fer learning. arXiv:1812.11806. 2018
Pith/arXiv arXiv 2018
-
[7]
Daneshjou R., Vodrahalli K., Novoa R.A., Jenkins M., Liang W., Rotemberg V., Ko J., Swetter S.M., Bailey E.E., Gevaert O., Mukherjee P., Phung M., Yekrang K., Fong B., Sahasrabudhe R., Allerup J.A.C., Okata-Karigane U., Zou J., Chiou A.S. Dis- parities in dermatology AI performance on a diverse, cu- rated clinical image set.Science Advances. 2022;8(31):ea...
-
[8]
Combalia M., Codella N., Rotemberg V., Carrera C., Dusza S., Gutman D., Helba B., Kittler H., Kurtansky N.R., Liopyris K., Marchetti M.A., Podlipnik S., Puig S., Rinner C., Tschandl P., We- ber J., Halpern A., Malvehy J. Validation of AI prediction mod- els for skin cancer diagnosis using dermoscopy images: the 2019 ISIC grand challenge.The Lancet Digital...
-
[9]
Rotemberg V., Kurtansky N., Betz-Stablein B. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context.Scientific Data. 2021;8(1):34. https://doi.org/10.1038/s41597- 021-00815-z
-
[10]
Methodology for Creating a Clinically Verified Dermoscopic Image Dataset
Kozachok E.S. Methodology for Creating a Clinically Verified Der- moscopic Image Dataset. Preprint. 2026. arXiv:2605.25168 [cs.CV]. https://doi.org/10.48550/arXiv.2605.25168. 26
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.25168 2026
-
[11]
Kozachok E.S. [A dermoscopic image dataset with high-quality an- notation of clinically significant features for diagnosis of melanocytic skin lesions].Izvestiya Yugo-Zapadnogo gosudarstvennogo universiteta. 2025;15(3):93–111.(InRuss.)https://doi.org/10.21869/2223-1536-2025- 15-3-93-111
-
[12]
Kozachok E.S., Seregin S.S., Kozachok A.V., Eletskiy K.V., Samovarov O.I. [Screening methodology for early differ- ential diagnosis of skin lesions using mobile dermoscopy]. Vrach i informatsionnye tekhnologii. 2025;(3):50–64. (In Russ.) https://doi.org/10.25881/18110193_2025_3_50
-
[13]
Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System
Kozachok E.S., Seregin S.S. [Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System]. Preprint. 2026. arXiv:2605.27561 [cs.CV]. (In Russ.) https://doi.org/10.48550/arXiv.2605.27561
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.27561 2026
-
[14]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. An image is worth 16×16 words: Trans- formers for image recognition at scale.Proceedings of ICLR. 2021. https://doi.org/10.48550/arXiv.2010.11929
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021
-
[15]
Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B. Swin Transformer: hierarchical vision transformer using shifted windows.Proceedings of IEEE/CVF ICCV. 2021:10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
-
[16]
In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu Z., Mao H., Wu C.-Y., Feichtenhofer C., Darrell T., Xie S. A Con- vNet for the 2020s.Proceedings of IEEE/CVF CVPR. 2022:11976– 11986. https://doi.org/10.1109/CVPR52688.2022.01167
-
[17]
EfficientNetV2: smaller models and faster training.Pro- ceedings of ICML
Tan M., Le Q. EfficientNetV2: smaller models and faster training.Pro- ceedings of ICML. 2021;139:10096–10106
2021
-
[18]
Zhang X., Liu Y., Ouyang G., Chen W., Xu A., Hara T., Zhou X., Wu D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification.Bioengineering. 2025;12(4):421. https://doi.org/10.3390/bioengineering12040421
-
[19]
Hierarchical skin lesion image classification with prototypical decision tree.npj Digital Medicine
Yu Z., et al. Hierarchical skin lesion image classification with prototypical decision tree.npj Digital Medicine. 2025;8:26. https://doi.org/10.1038/s41746-024-01395-z. 27
-
[20]
International Skin Imaging Col- laboration
ISIC MILK10k Challenge. International Skin Imaging Col- laboration. 2024. Available from: https://challenge.isic- archive.com/leaderboards/milk10k/. 28
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.