CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

Durjoy Dey; Hassan Hajjdiab; Yuhong Yan

arxiv: 2605.26294 · v1 · pith:5MC7GYTUnew · submitted 2026-05-25 · 💻 cs.CV

CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

Durjoy Dey , Yuhong Yan , Hassan Hajjdiab This is my paper

Pith reviewed 2026-06-29 22:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords skin cancer detectiondeep learningCNNvision transformershybrid modelsvision language modelsPAD-UFES-20binary classification

0 comments

The pith

Transformer hybrids and SigLIP VLMs deliver the strongest accuracy-specificity trade-offs for skin cancer detection on PAD-UFES-20.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a single unified evaluation of twelve models drawn from CNN, vision transformer, hybrid, and vision-language families on the PAD-UFES-20 dataset for binary skin cancer classification. It shows that CNNs already give competitive results once tuned, yet every transformer-inclusive family improves AUC, maximum F1, and sensitivity at 80 percent specificity. Within those families, MaxViT Tiny, CoAtNet0, and the SigLIP-based VLM reach the best overall balance, while the CLIP-based VLM records the highest precision. A reader cares because the comparison supplies concrete guidance on which architectures to deploy when both ranking quality and clinically relevant operating points matter for triage. The released codebase turns the reported numbers into a reproducible reference point.

Core claim

On the PAD-UFES-20 dataset for binary skin cancer detection, well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision.

What carries the argument

Side-by-side training and evaluation of twelve models from four families on the same PAD-UFES-20 split, scored by AUC, maximum F1, and sensitivity at 80 percent specificity.

If this is right

Hybrid convolution-transformer models supply the most favorable balance between overall ranking and screening-oriented sensitivity.
SigLIP-based VLMs match or exceed hybrid performance across the reported clinical thresholds.
CLIP-based VLMs deliver the highest precision, useful when false-positive cost is high.
Pure CNNs remain viable baselines but are outperformed once transformer components are introduced under matched conditions.
Public release of the full training and evaluation code creates a fixed reference for future PAD-UFES-20 experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same families may generalize to other dermoscopic collections if the advantage is truly architectural rather than dataset-specific.
The 80-percent-specificity sensitivity focus implies these models are intended for settings where missing a positive case carries higher clinical cost than extra referrals.
Mobile or edge deployment of the top hybrids could reduce specialist workload in primary-care screening programs.
Precision-oriented use of the CLIP model might be combined with a hybrid model in a cascaded system to control both false negatives and false positives.

Load-bearing premise

Any observed performance gaps between the four model families are caused by architecture rather than differences in training procedure, hyperparameter search, or test-set leakage on PAD-UFES-20.

What would settle it

Retrain all twelve models from scratch with identical hyperparameters, augmentations, and train-test splits on PAD-UFES-20; if the transformer and hybrid families no longer show consistent gains in AUC or sensitivity at 80 percent specificity, the architecture advantage claim is falsified.

Figures

Figures reproduced from arXiv: 2605.26294 by Durjoy Dey, Hassan Hajjdiab, Yuhong Yan.

**Figure 1.** Figure 1: Experimental pipeline. (1) PAD-UFES-20 is split into patient-level train, validation, and test sets. (2) Twelve models from four families (CNN, ViT, hybrid, VLM) are trained on the training set. (3) The best checkpoint is evaluated on the test set. (4) Models are compared using AUC, sensitivity at 80% specificity, and F1max with the associated precision, recall, and threshold. 3 Methodology In this sectio… view at source ↗

**Figure 2.** Figure 2: Test AUC comparison of CNN, ViT, hybrid, and vision language models on PAD-UFES-20. Higher values indicate better discrimination. 4.1 CNN Models Results Analysis The CNN baselines (DenseNet121, EfficientNetB3, InceptionV3, ResNet50) provide reference results for other model families on PAD-UFES-20. All are trained with the same patient level split for binary cancer versus non cancer classification. Their… view at source ↗

**Figure 3.** Figure 3: Model sensitivity at 80% specificity on PAD-UFES-20. (0.824). At a screening oriented operating point, CNNs reach competitive sensitivity at 80% specificity ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straight benchmark of 12 models on PAD-UFES-20 with code release, but the architecture ranking claim needs clearer evidence of equal training effort.

read the letter

The paper runs a single-setup comparison of CNNs, ViTs, hybrids, and VLMs for binary skin cancer detection on PAD-UFES-20. It reports that hybrids (MaxViT Tiny, CoAtNet0) and a SigLIP VLM give the best AUC and operating-point trade-offs, while well-tuned CNNs are already strong and CLIP gives high precision. The full codebase is released.

What stands out is the coverage of four model families under the same metrics (AUC, max F1, sensitivity at 80% specificity) plus the public code. That combination supplies a usable reference point for anyone choosing models for this dataset or similar screening tasks.

The main soft spot is whether the reported gaps reflect architecture or differences in tuning. The abstract calls the evaluation unified, yet gives no numbers on hyperparameter trials, search budget, or validation rules per family. If the transformer and hybrid models received more optimization than the CNN baselines, the ranking is not guaranteed to be architectural. The stress-test note flags exactly this point, and it is load-bearing because the dataset and metrics are otherwise fixed. Releasing code helps, but the paper text itself does not demonstrate parity.

This work is for readers who need current empirical baselines on PAD-UFES-20 or practical model-family guidance for dermatology triage. It is coherent on its own terms and shows honest engagement with the task. A serious editor should send it to peer review so the methods section can be tightened on training details; the code release makes that feasible.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a unified evaluation of twelve deep learning models spanning CNN, vision transformer, hybrid convolution-transformer, and vision-language model families for binary skin cancer detection on the PAD-UFES-20 dataset. Performance is measured via AUC, maximum F1 score (with precision/recall), and sensitivity at 80% specificity. The central claim is that well-tuned CNNs provide strong baselines but transformer-based families improve discrimination, with MaxViT Tiny, CoAtNet0, and a SigLIP-based VLM achieving the best overall trade-off and a CLIP-based model offering high precision; the full codebase is released.

Significance. If the experimental conditions prove comparable, the work supplies a reproducible reference benchmark on PAD-UFES-20 and practical guidance for model-family selection in screening-oriented skin-cancer triage. The public code release is a concrete strength that enables verification and extension by the community.

major comments (1)

[Abstract] Abstract: the claim of a 'unified evaluation' across the twelve models is load-bearing for attributing performance differences to architecture families, yet the manuscript provides no explicit accounting of per-family hyperparameter search budgets, number of trials, augmentation pipelines, optimizer settings, early-stopping rules, or validation protocol on the fixed PAD-UFES-20 split. Without this information the observed ranking (e.g., hybrids and SigLIP over CNNs) cannot be confidently ascribed to model family rather than differential tuning effort.

minor comments (2)

[Abstract] The abstract states metric values and rankings but does not report statistical significance tests or confidence intervals; adding these would strengthen the empirical claims without altering the central narrative.
[Methods] Consider clarifying in the methods whether a single fixed train/validation/test split was used for all models or whether any model-specific data handling occurred.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. The single major comment raises a valid point about the need for explicit documentation of experimental controls to support the 'unified evaluation' claim. We address this below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a 'unified evaluation' across the twelve models is load-bearing for attributing performance differences to architecture families, yet the manuscript provides no explicit accounting of per-family hyperparameter search budgets, number of trials, augmentation pipelines, optimizer settings, early-stopping rules, or validation protocol on the fixed PAD-UFES-20 split. Without this information the observed ranking (e.g., hybrids and SigLIP over CNNs) cannot be confidently ascribed to model family rather than differential tuning effort.

Authors: We agree that the manuscript would be strengthened by an explicit, consolidated description of the experimental protocol to make the attribution to model families more robust. The released codebase already encodes all training details (including per-model hyperparameter grids, augmentation pipelines, optimizers, and the fixed 70/15/15 split with early stopping on validation AUC), but we accept that readers should not need to inspect the code to verify fairness. In the revision we will add a dedicated 'Experimental Protocol' subsection (and an accompanying table) that reports, for each family: (i) hyperparameter search budget and number of trials, (ii) shared vs. family-specific augmentation and optimizer choices, (iii) early-stopping criterion, and (iv) confirmation that the same validation protocol was applied to all twelve models. This will allow the ranking to be confidently linked to architectural differences rather than tuning disparity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on fixed dataset

full rationale

The paper conducts a unified empirical evaluation of 12 models across CNN, ViT, hybrid, and VLM families on the PAD-UFES-20 dataset, reporting AUC, max F1, and sensitivity at 80% specificity. All outcomes are direct measurements on held-out data with no equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The codebase release further supports reproducibility without introducing definitional loops. The assumption of comparable training conditions is an empirical claim open to verification via the released code, not a circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the PAD-UFES-20 dataset and the assumption that performance differences can be attributed to model architecture after standard training. No new entities are postulated.

free parameters (1)

model-specific hyperparameters
Learning rates, batch sizes, and other training settings were presumably tuned separately for each of the twelve models to produce the reported metrics.

axioms (1)

domain assumption The PAD-UFES-20 dataset constitutes a suitable and unbiased benchmark for comparing skin cancer detection models
All comparisons and conclusions are drawn directly from performance on this single dataset without additional external validation.

pith-pipeline@v0.9.1-grok · 5770 in / 1310 out tokens · 46355 ms · 2026-06-29T22:28:13.473358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 20 canonical work pages · 1 internal anchor

[1]

EAI Endorsed Transactions on Pervasive Health and Technology9, 1–8 (2023)

Agarwal, R., Godavarthi, D.: Skin disease classification using CNN algorithms. EAI Endorsed Transactions on Pervasive Health and Technology9, 1–8 (2023). https://doi.org/10.4108/eetpht.9.4039

work page doi:10.4108/eetpht.9.4039 2023
[2]

Remote Sensing15(7), 1860 (2023).https://doi.org/10.3390/ rs15071860

Aleissaee, A.A., Kumar, A., Anwer, R.M., et al.: Transformers in remote sens- ing: A survey. Remote Sensing15(7), 1860 (2023).https://doi.org/10.3390/ rs15071860

2023
[3]

Sinkron: Jur- nal dan Penelitian Teknik Informatika8, 2168–2178 (2023).https://doi.org/10

Anggriandi, D., Utami, E., Ariatmanto, D.: Comparative analysis of CNN and CNN-SVM methods for classification types of human skin disease. Sinkron: Jur- nal dan Penelitian Teknik Informatika8, 2168–2178 (2023).https://doi.org/10. 33395/sinkron.v8i4.12831

2023
[4]

European Journal of Cancer111, 148–154 (2019).https://doi

Brinker, T.J., Hekler, A., Enk, A.H., et al.: A convolutional neural network trained with dermoscopic images performs on par with 145 dermatologists in melanoma classification. European Journal of Cancer111, 148–154 (2019).https://doi. org/10.1016/j.ejca.2019.01.011 12 D. Dey et al

work page doi:10.1016/j.ejca.2019.01.011 2019
[5]

Zhang, Y

Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image classification. In: 2021 IEEE/CVF International Confer- enceonComputerVision(ICCV).pp.347–356(2021).https://doi.org/10.1109/ ICCV48922.2021.00041

work page arXiv 2021
[6]

NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021)

Dai, Z., Liu, H., Le, Q.V., et al.: Coatnet: marrying convolution and attention for all data sizes. NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021)

2021
[7]

International Journal of Engineering Applied Sciences and Technology8, 186–191 (2023).https://doi.org/10.33564/ijeast.2023.v08i02.027

Dhankar, U., Jain, S., Zaidi, S., et al.: Skin disease detection using python and deep learning. International Journal of Engineering Applied Sciences and Technology8, 186–191 (2023).https://doi.org/10.33564/ijeast.2023.v08i02.027

work page doi:10.33564/ijeast.2023.v08i02.027 2023
[8]

Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056

Esteva, A., Kuprel, B., Novoa, R.A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056

work page doi:10.1038/nature21056 2017
[9]

International Journal of Cancer149(4), 778–789 (2021)

Ferlay, J., Colombet, M., Soerjomataram, I., et al.: Cancer statistics for the year 2020: An overview. International Journal of Cancer149(4), 778–789 (2021)

2020
[10]

Nature Communications12(1), 160 (2021)

Fontanillas, P., Alipanahi, B., Furlotte, N.A., et al.: Disease risk scores for skin cancers. Nature Communications12(1), 160 (2021)

2021
[11]

Frontiers in Medicine10, 1305954 (2024)

Furriel, B.C.R.S., Oliveira, B.D., Proã, R., et al.: Artificial intelligence for skin cancer detection and classification for clinical environment: A systematic review. Frontiers in Medicine10, 1305954 (2024)

2024
[12]

Skin Lesion Segmentation and Classification for ISIC 2018 Using Traditional Classifiers with Hand-Crafted Features

Hardie, R.C., Ali, R., De Silva, M.S., et al.: Skin lesion segmentation and classi- fication for isic 2018 using traditional classifiers with hand-crafted features. arXiv preprint arXiv:1807.07001 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016).https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[14]

Computational and Mathematical Methods in Medicine2024, 3022192 (2024).https://doi.org/10.1155/2024/3022192

Himel, G.M.S., Islam, M.M., Al-Aff, K.A., et al.: Skin cancer segmentation and classification using vision transformer. Computational and Mathematical Methods in Medicine2024, 3022192 (2024).https://doi.org/10.1155/2024/3022192

work page doi:10.1155/2024/3022192 2024
[15]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2261–2269 (2017).https://doi.org/10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017
[16]

Journal of Cuta- neous Pathology37(9), e1–e20 (2010)

Hussein, M.R.A.: Skin metastasis: A pathologist’s perspective. Journal of Cuta- neous Pathology37(9), e1–e20 (2010)

2010
[17]

American Family Physician62(2), 357–368 (2000)

Jerant, A.F., Johnson, J.T., Sheridan, C.D., et al.: Early detection and treatment of skin cancer. American Family Physician62(2), 357–368 (2000)

2000
[18]

Cancer75(S2), 684–690 (1995)

Kopf, A.W., Salopek, T.G., Slade, J., et al.: Techniques of cutaneous examination for the detection of skin cancer. Cancer75(S2), 684–690 (1995)

1995
[19]

IEEE Access10, 123212–123224 (2022).https://doi.org/ 10.1109/ACCESS.2022.3224044

Lee, S., Lee, S., Song, B.C.: Improving vision transformers to learn small-size dataset from scratch. IEEE Access10, 123212–123224 (2022).https://doi.org/ 10.1109/ACCESS.2022.3224044

work page doi:10.1109/access.2022.3224044 2022
[20]

In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002 (2021).https://doi.org/10.1109/ICCV48922. 2021.00986

work page doi:10.1109/iccv48922 2021
[21]

In: Advances in Intelligent Systems and Com- puting, Lecture Notes in Computer Science, vol

Lungu-Stan, V.C., Cercel, D.C., Pop, F.: SkinDistilViT: Lightweight vision trans- former for skin lesion classification. In: Advances in Intelligent Systems and Com- puting, Lecture Notes in Computer Science, vol. 14254. Springer Nature Switzer- land (2023).https://doi.org/10.1007/978-3-031-44207-0_23

work page doi:10.1007/978-3-031-44207-0_23 2023
[22]

Journal of Imaging Informatics in Medicine37, 3174–3192 (2024).https://doi.org/10.1007/s10278-024-01140-8

Pacal, I., Alaftekin, M., Zengul, F.: Enhancing skin cancer diagnosis using swin transformer with hybrid shifted window-based multi-head self-attention and Comparing CNNs, Transformers, Hybrids, and VLMs 13 swiglu-based mlp. Journal of Imaging Informatics in Medicine37, 3174–3192 (2024).https://doi.org/10.1007/s10278-024-01140-8

work page doi:10.1007/s10278-024-01140-8 2024
[23]

Data in Brief32(2020).https://doi.org/10.1016/j.dib.2020.106221

Pacheco, A.G.C., Lima, G.R., da Silva Salomão, A., et al.: Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smart- phones. Data in Brief32(2020).https://doi.org/10.1016/j.dib.2020.106221

work page doi:10.1016/j.dib.2020.106221 2020
[24]

In: Proceedings of the 38th International Con- ference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Con- ference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)

2021
[25]

Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision (06 2016).https://doi.org/10.1109/CVPR.2016.308

work page doi:10.1109/cvpr.2016.308 2016
[26]

In: Chaudhuri, K., Salakhutdinov, R

Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (09–15 Jun 2019)

2019
[27]

In: Meila, M., Zhang, T

Touvron,H.,Cord,M.,Douze,M.,etal.:Trainingdata-efficientimagetransformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021)

2021
[28]

Nature Medicine26(8), 1229–1234 (2020).https://doi.org/ 10.1038/s41591-020-0942-0

Tschandl, P., Rinner, C., Apalla, Z., et al.: Human–computer collaboration for skin cancer recognition. Nature Medicine26(8), 1229–1234 (2020).https://doi.org/ 10.1038/s41591-020-0942-0

work page doi:10.1038/s41591-020-0942-0 2020
[29]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Tu, Z., Talebi, H., Zhang, H., et al.: Maxvit: Multi-axis vision transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13684, pp. 459–479. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-20053-3_27

work page doi:10.1007/978-3-031-20053-3_27 2022
[30]

In: Walters, K.A

Walters, K.A., Roberts, M.S.: The structure and function of skin. In: Walters, K.A. (ed.) Dermatological and Transdermal Formulations, pp. 19–58. CRC Press (2002)

2002
[31]

arXiv preprint arXiv:2501.14685 (2025), under review for MIDL 2025

Wu, F., Papiez, B.W.: Rethinking foundation models for medical image classifica- tion through a benchmark study on medmnist. arXiv preprint arXiv:2501.14685 (2025), under review for MIDL 2025

work page arXiv 2025
[32]

In: Proceedings of the 2025 10th In- ternational Conference on Intelligent Information Technology (ICIIT 2025)

Ye, C., Li, J., Shuai, Q.: Evaluating the performance and clinical applications of multiclass deep learning models for skin cancer pathology diagnosis (isic): A comparative analysis of cnn, vit, and vlm. In: Proceedings of the 2025 10th In- ternational Conference on Intelligent Information Technology (ICIIT 2025). pp. 92–103. Association for Computing Mac...

work page doi:10.1145/3731763.3731793 2025
[33]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Zhai, X., Mustafa, B., Kolesnikov, A., et al.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023).https://doi.org/10.1109/ICCV51070.2023. 01100

work page doi:10.1109/iccv51070.2023 2023
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2022)

Zhang, J., Liu, X., Wang, Y., et al.: Medclip: Contrastive learning from un- paired images and text for medical visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2022)

2022
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, P., Li, X., Hu, X., Yang, J., et al.: Vinvl: Revisiting visual representa- tions in vision–language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5579–5588 (2021)

2021

[1] [1]

EAI Endorsed Transactions on Pervasive Health and Technology9, 1–8 (2023)

Agarwal, R., Godavarthi, D.: Skin disease classification using CNN algorithms. EAI Endorsed Transactions on Pervasive Health and Technology9, 1–8 (2023). https://doi.org/10.4108/eetpht.9.4039

work page doi:10.4108/eetpht.9.4039 2023

[2] [2]

Remote Sensing15(7), 1860 (2023).https://doi.org/10.3390/ rs15071860

Aleissaee, A.A., Kumar, A., Anwer, R.M., et al.: Transformers in remote sens- ing: A survey. Remote Sensing15(7), 1860 (2023).https://doi.org/10.3390/ rs15071860

2023

[3] [3]

Sinkron: Jur- nal dan Penelitian Teknik Informatika8, 2168–2178 (2023).https://doi.org/10

Anggriandi, D., Utami, E., Ariatmanto, D.: Comparative analysis of CNN and CNN-SVM methods for classification types of human skin disease. Sinkron: Jur- nal dan Penelitian Teknik Informatika8, 2168–2178 (2023).https://doi.org/10. 33395/sinkron.v8i4.12831

2023

[4] [4]

European Journal of Cancer111, 148–154 (2019).https://doi

Brinker, T.J., Hekler, A., Enk, A.H., et al.: A convolutional neural network trained with dermoscopic images performs on par with 145 dermatologists in melanoma classification. European Journal of Cancer111, 148–154 (2019).https://doi. org/10.1016/j.ejca.2019.01.011 12 D. Dey et al

work page doi:10.1016/j.ejca.2019.01.011 2019

[5] [5]

Zhang, Y

Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image classification. In: 2021 IEEE/CVF International Confer- enceonComputerVision(ICCV).pp.347–356(2021).https://doi.org/10.1109/ ICCV48922.2021.00041

work page arXiv 2021

[6] [6]

NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021)

Dai, Z., Liu, H., Le, Q.V., et al.: Coatnet: marrying convolution and attention for all data sizes. NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021)

2021

[7] [7]

International Journal of Engineering Applied Sciences and Technology8, 186–191 (2023).https://doi.org/10.33564/ijeast.2023.v08i02.027

Dhankar, U., Jain, S., Zaidi, S., et al.: Skin disease detection using python and deep learning. International Journal of Engineering Applied Sciences and Technology8, 186–191 (2023).https://doi.org/10.33564/ijeast.2023.v08i02.027

work page doi:10.33564/ijeast.2023.v08i02.027 2023

[8] [8]

Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056

Esteva, A., Kuprel, B., Novoa, R.A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature542(7639), 115–118 (2017).https: //doi.org/10.1038/nature21056

work page doi:10.1038/nature21056 2017

[9] [9]

International Journal of Cancer149(4), 778–789 (2021)

Ferlay, J., Colombet, M., Soerjomataram, I., et al.: Cancer statistics for the year 2020: An overview. International Journal of Cancer149(4), 778–789 (2021)

2020

[10] [10]

Nature Communications12(1), 160 (2021)

Fontanillas, P., Alipanahi, B., Furlotte, N.A., et al.: Disease risk scores for skin cancers. Nature Communications12(1), 160 (2021)

2021

[11] [11]

Frontiers in Medicine10, 1305954 (2024)

Furriel, B.C.R.S., Oliveira, B.D., Proã, R., et al.: Artificial intelligence for skin cancer detection and classification for clinical environment: A systematic review. Frontiers in Medicine10, 1305954 (2024)

2024

[12] [12]

Skin Lesion Segmentation and Classification for ISIC 2018 Using Traditional Classifiers with Hand-Crafted Features

Hardie, R.C., Ali, R., De Silva, M.S., et al.: Skin lesion segmentation and classi- fication for isic 2018 using traditional classifiers with hand-crafted features. arXiv preprint arXiv:1807.07001 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016).https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[14] [14]

Computational and Mathematical Methods in Medicine2024, 3022192 (2024).https://doi.org/10.1155/2024/3022192

Himel, G.M.S., Islam, M.M., Al-Aff, K.A., et al.: Skin cancer segmentation and classification using vision transformer. Computational and Mathematical Methods in Medicine2024, 3022192 (2024).https://doi.org/10.1155/2024/3022192

work page doi:10.1155/2024/3022192 2024

[15] [15]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2261–2269 (2017).https://doi.org/10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017

[16] [16]

Journal of Cuta- neous Pathology37(9), e1–e20 (2010)

Hussein, M.R.A.: Skin metastasis: A pathologist’s perspective. Journal of Cuta- neous Pathology37(9), e1–e20 (2010)

2010

[17] [17]

American Family Physician62(2), 357–368 (2000)

Jerant, A.F., Johnson, J.T., Sheridan, C.D., et al.: Early detection and treatment of skin cancer. American Family Physician62(2), 357–368 (2000)

2000

[18] [18]

Cancer75(S2), 684–690 (1995)

Kopf, A.W., Salopek, T.G., Slade, J., et al.: Techniques of cutaneous examination for the detection of skin cancer. Cancer75(S2), 684–690 (1995)

1995

[19] [19]

IEEE Access10, 123212–123224 (2022).https://doi.org/ 10.1109/ACCESS.2022.3224044

Lee, S., Lee, S., Song, B.C.: Improving vision transformers to learn small-size dataset from scratch. IEEE Access10, 123212–123224 (2022).https://doi.org/ 10.1109/ACCESS.2022.3224044

work page doi:10.1109/access.2022.3224044 2022

[20] [20]

In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002 (2021).https://doi.org/10.1109/ICCV48922. 2021.00986

work page doi:10.1109/iccv48922 2021

[21] [21]

In: Advances in Intelligent Systems and Com- puting, Lecture Notes in Computer Science, vol

Lungu-Stan, V.C., Cercel, D.C., Pop, F.: SkinDistilViT: Lightweight vision trans- former for skin lesion classification. In: Advances in Intelligent Systems and Com- puting, Lecture Notes in Computer Science, vol. 14254. Springer Nature Switzer- land (2023).https://doi.org/10.1007/978-3-031-44207-0_23

work page doi:10.1007/978-3-031-44207-0_23 2023

[22] [22]

Journal of Imaging Informatics in Medicine37, 3174–3192 (2024).https://doi.org/10.1007/s10278-024-01140-8

Pacal, I., Alaftekin, M., Zengul, F.: Enhancing skin cancer diagnosis using swin transformer with hybrid shifted window-based multi-head self-attention and Comparing CNNs, Transformers, Hybrids, and VLMs 13 swiglu-based mlp. Journal of Imaging Informatics in Medicine37, 3174–3192 (2024).https://doi.org/10.1007/s10278-024-01140-8

work page doi:10.1007/s10278-024-01140-8 2024

[23] [23]

Data in Brief32(2020).https://doi.org/10.1016/j.dib.2020.106221

Pacheco, A.G.C., Lima, G.R., da Silva Salomão, A., et al.: Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smart- phones. Data in Brief32(2020).https://doi.org/10.1016/j.dib.2020.106221

work page doi:10.1016/j.dib.2020.106221 2020

[24] [24]

In: Proceedings of the 38th International Con- ference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Con- ference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)

2021

[25] [25]

Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision (06 2016).https://doi.org/10.1109/CVPR.2016.308

work page doi:10.1109/cvpr.2016.308 2016

[26] [26]

In: Chaudhuri, K., Salakhutdinov, R

Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (09–15 Jun 2019)

2019

[27] [27]

In: Meila, M., Zhang, T

Touvron,H.,Cord,M.,Douze,M.,etal.:Trainingdata-efficientimagetransformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021)

2021

[28] [28]

Nature Medicine26(8), 1229–1234 (2020).https://doi.org/ 10.1038/s41591-020-0942-0

Tschandl, P., Rinner, C., Apalla, Z., et al.: Human–computer collaboration for skin cancer recognition. Nature Medicine26(8), 1229–1234 (2020).https://doi.org/ 10.1038/s41591-020-0942-0

work page doi:10.1038/s41591-020-0942-0 2020

[29] [29]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Tu, Z., Talebi, H., Zhang, H., et al.: Maxvit: Multi-axis vision transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13684, pp. 459–479. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-20053-3_27

work page doi:10.1007/978-3-031-20053-3_27 2022

[30] [30]

In: Walters, K.A

Walters, K.A., Roberts, M.S.: The structure and function of skin. In: Walters, K.A. (ed.) Dermatological and Transdermal Formulations, pp. 19–58. CRC Press (2002)

2002

[31] [31]

arXiv preprint arXiv:2501.14685 (2025), under review for MIDL 2025

Wu, F., Papiez, B.W.: Rethinking foundation models for medical image classifica- tion through a benchmark study on medmnist. arXiv preprint arXiv:2501.14685 (2025), under review for MIDL 2025

work page arXiv 2025

[32] [32]

In: Proceedings of the 2025 10th In- ternational Conference on Intelligent Information Technology (ICIIT 2025)

Ye, C., Li, J., Shuai, Q.: Evaluating the performance and clinical applications of multiclass deep learning models for skin cancer pathology diagnosis (isic): A comparative analysis of cnn, vit, and vlm. In: Proceedings of the 2025 10th In- ternational Conference on Intelligent Information Technology (ICIIT 2025). pp. 92–103. Association for Computing Mac...

work page doi:10.1145/3731763.3731793 2025

[33] [33]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Zhai, X., Mustafa, B., Kolesnikov, A., et al.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023).https://doi.org/10.1109/ICCV51070.2023. 01100

work page doi:10.1109/iccv51070.2023 2023

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2022)

Zhang, J., Liu, X., Wang, Y., et al.: Medclip: Contrastive learning from un- paired images and text for medical visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2022)

2022

[35] [35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, P., Li, X., Hu, X., Yang, J., et al.: Vinvl: Revisiting visual representa- tions in vision–language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5579–5588 (2021)

2021