Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models
Pith reviewed 2026-06-26 21:19 UTC · model grok-4.3
The pith
Frozen 2D vision transformers detect anomalies in 3D medical volumes by scoring tokens that lack matches across a batch of subjects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing 3D volumes along anatomical axes and encoding slices with a frozen 2D vision transformer, the method creates localized volumetric tokens whose anomaly scores come from their lack of close analogues across other subjects in the batch. A coarse-to-fine tokenization strategy is added to preserve signals from focal lesions that would otherwise be diluted by pooling. This enables training-free zero-shot anomaly detection and localization in 3D medical images using only 2D foundation models, as demonstrated on brain MRI datasets for metastases, glioma, and stroke, and validated on lung CT.
What carries the argument
Cross-subject mutual similarity scoring on volumetric tokens created from multi-axis slice encoding by a 2D ViT with optional coarse-to-fine pooling.
If this is right
- Anomaly localization in 3D volumes is possible without volumetric foundation models or supervised training.
- The benefit of fine-resolution tokenization varies with lesion contrast and imaging modality.
- The approach generalizes from atlas-aligned brain MRI to lung CT.
- Focal lesion signals are better preserved by switching to coarse-to-fine tokenization instead of uniform depth pooling.
Where Pith is reading between the lines
- The batch-matching principle could transfer to other 3D data types processed by 2D models, such as video sequences.
- Practical use would require reliable ways to form batches of mostly normal scans from clinical archives.
- Combining the token comparison with simple preprocessing steps tuned to each modality might reduce sensitivity to acquisition differences.
Load-bearing premise
The batch consists primarily of normal cases with comparable acquisition conditions so that unmatched tokens indicate anomalies rather than normal variation.
What would settle it
Apply the method to batches that contain many abnormal cases and check whether known lesions still receive distinctly higher anomaly scores than normal tissue.
Figures
read the original abstract
Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CS3F, a training-free batch-based framework for zero-shot anomaly detection (ZSAD) in 3D medical images that uses frozen 2D vision transformers. Volumes are decomposed along anatomical axes, encoded slice-wise, converted to localized volumetric tokens via pooling, and scored for anomalies using cross-subject mutual similarity (tokens without close analogues receive higher scores). A coarse-to-fine tokenization strategy is introduced to mitigate signal attenuation from depth pooling. The method is evaluated on brain MRI (metastases, glioma, stroke) and lung CT, with the central claims being that 2D foundation models can support 3D anomaly localization and that the benefit of fine tokenization depends on lesion contrast and imaging modality.
Significance. If the central claims hold under the stated conditions, the work would be significant for enabling practical ZSAD in 3D clinical volumes without task-specific training data or volumetric foundation models, addressing real-world heterogeneity in acquisition protocols. The training-free design, multi-axis decomposition, and batch-based similarity mechanism are notable strengths, as is the empirical observation that fine tokenization benefits are modality- and contrast-dependent. These elements could inform efficient adaptations of existing 2D models to 3D tasks.
major comments (1)
- [Evaluation] Evaluation on brain MRI and lung CT: the paper does not report controlled experiments that vary the anomalous fraction within batches or introduce intra-batch acquisition heterogeneity (e.g., differing protocols or scanner parameters). This is load-bearing for the central claim because anomaly scores derive from cross-subject mutual similarity, which presupposes batches dominated by normal cases acquired under comparable conditions; without such tests the robustness of the scoring mechanism remains unverified even though the abstract acknowledges heterogeneous clinical protocols.
minor comments (1)
- [Abstract] Abstract: quantitative metrics, baselines, and statistical details supporting the claims about fine tokenization benefits and overall performance are not provided, making it difficult to assess the strength of the reported results.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the single major comment below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation on brain MRI and lung CT: the paper does not report controlled experiments that vary the anomalous fraction within batches or introduce intra-batch acquisition heterogeneity (e.g., differing protocols or scanner parameters). This is load-bearing for the central claim because anomaly scores derive from cross-subject mutual similarity, which presupposes batches dominated by normal cases acquired under comparable conditions; without such tests the robustness of the scoring mechanism remains unverified even though the abstract acknowledges heterogeneous clinical protocols.
Authors: We agree that the robustness of the cross-subject mutual similarity mechanism under controlled variations in anomalous fraction and intra-batch acquisition heterogeneity is central to validating the approach, particularly given the abstract's reference to heterogeneous protocols. Our evaluations on brain MRI (metastases, glioma, stroke) and lung CT already incorporate real-world clinical data with natural variations across subjects, scanners, and pathologies, and the batch-based design is intended to leverage predominantly normal cases. However, we did not include explicit ablation studies that systematically vary the anomalous fraction within batches or introduce synthetic intra-batch protocol differences. In the revised manuscript we will add these controlled experiments to directly test the scoring mechanism's sensitivity to these factors. revision: yes
Circularity Check
No circularity; method is explicitly assumption-driven and self-contained
full rationale
The paper presents CS3F as a training-free heuristic that assigns anomaly scores to tokens lacking batch analogues, with the batch-normality precondition stated outright in the method description rather than derived. No equations, fitted parameters, or self-citations appear in the abstract or method outline that would reduce any claimed result to its own inputs by construction. The approach is evaluated on external datasets (brain MRI, lung CT) without invoking prior author work as a uniqueness theorem or smuggling an ansatz. This is the common case of an independent empirical proposal whose validity rests on the untested batch assumption rather than on logical circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Unsupervised brain lesion segmentation from mri using a con- volutional autoencoder, in: Medical Imaging 2019: Image Processing, SPIE. pp. 372–378. Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., Prevedello, L.M., Rudie, J.D., Sako, C., Shinohara, R.T., Bergquist, T., ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvprw63382.2024.00408 2019
-
[2]
Merlin: A vision language foundation model for 3d computed tomography. Research Square , rs–3. Cai, Y., Chen, H., Cheng, K.T., 2024. Rethinking autoencoders for medical anomaly detection from a theoretical perspective, in: Interna- tionalConferenceonMedicalImageComputingandComputer-Assisted Intervention, Springer. pp. 544–554. Carass, A., Roy, S., Jog, A....
-
[3]
Europeanradiology experimental 4, 50
Automatic lung segmentation in routine imaging is primarily a datadiversityproblem,notamethodologyproblem. Europeanradiology experimental 4, 50. Hu, J., Chen, Y., Yi, Z., 2019. Automated segmentation of macular edema inoctusingdeepneuralnetworks. Medicalimageanalysis55,216–227. Isensee, F., Jäger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H., 2021. nn...
2019
-
[4]
Human brain mapping 40, 4952–4964
Automated brain extraction of multisequence mri using artificial neural networks. Human brain mapping 40, 4952–4964. IXI, 2025. IXI Dataset.https://brain-development.org/ixi-dataset/. Accessed 11 December 2025. Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A., Dabeer, O.,
2025
-
[5]
Winclip:Zero-/few-shotanomalyclassificationandsegmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19606–19616. Johnson, W.B., Lindenstrauss, J., et al., 1984. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26, 1. Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, ...
-
[6]
Medical Image Analysis 103, 103559
Improved unsupervised 3d lung lesion detection and lo- calization by fusing global and local features: Validation in 3d low-dose computed tomography. Medical Image Analysis 103, 103559. URL:https://www.sciencedirect.com/science/article/pii/ S1361841525001069, doi:https://doi.org/10.1016/j.media.2025.103559. Li, A., Qiu, C., Kloft, M., Smyth, P., Rudolph, ...
-
[7]
Scientific data 9, 320
A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Scientific data 9, 320. Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W., 2023. Pmc-clip: Contrastive language-image pre-training using biomedical documents, in: MICCAI. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.,
2023
-
[8]
10012–10022
Swin transformer: Hierarchical vision transformer using shifted Le Gia:Preprint submitted to Elsevier Page 21 of 22 CS3F: Training-free 3D anomaly localization windows,in: Proceedingsof theIEEE/CVFinternational conferenceon computer vision, pp. 10012–10022. Maleki, N., Amiruddin, R., Moawad, A.W., Yordanov, N., Gkampenis, A., Fehringer, P., Umeh, F., Chuk...
-
[9]
arXiv preprint arXiv:2504.12527
Analysis of the miccai brain tumor segmentation–metastases (brats-mets) 2025 lighthouse challenge: Brain metastasis segmentation on pre-and post-treatment mri. arXiv preprint arXiv:2504.12527 . Marzullo, A., Cappa, N., Morellini, M., Ranzini, M.B.M., 2025. Gener- alist models in specialized domains: Evaluating contrastive language- image pre-training for ...
-
[10]
Fast unsupervised brain anomaly detection and segmentation with diffusion models, in: International Conference on Medical Image ComputingandComputer-AssistedIntervention,Springer.pp.705–714. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning tran...
-
[11]
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44. Schwarz,J.,Will,L.,Wellmer,J.,Mosig,A.,2024. Apatch-basedstudent- teacherpyramidmatchingapproachtoanomalydetectionin3dmagnetic resonance imaging, in: Medical Imaging with Deep Learning. Shang, L., Lou, Z., Sethares, W.A., Alexander, A.L., ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.patrec.2025.10 2024
-
[12]
URL:https://openreview.net/forum?id= z0r388Sbv3
FeasibilityandbenefitsofjointlearningfromMRIdatabaseswith different brain diseases and modalities for segmentation, in: Medical Imaging with Deep Learning. URL:https://openreview.net/forum?id= z0r388Sbv3. Yeung, M., Sala, E., Schönlieb, C.B., Rundo, L., 2022. Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbal- anced...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.