pith. sign in

arxiv: 2606.18749 · v1 · pith:L5LNWVNQnew · submitted 2026-06-17 · 💻 cs.CV

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

Pith reviewed 2026-06-26 21:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot anomaly detection3D medical imaging2D foundation modelsbatch-based detectiontraining-freevolumetric tokenizationbrain MRIlung CT
0
0 comments X

The pith

Frozen 2D vision transformers detect anomalies in 3D medical volumes by scoring tokens that lack matches across a batch of subjects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CS3F, a training-free framework that decomposes 3D medical volumes into slices along multiple axes, encodes them with a frozen 2D vision transformer, and pools neighboring features into localized volumetric tokens. Anomaly scores are assigned based on cross-subject mutual similarity, where tokens without close analogues in other subjects receive higher scores. A coarse-to-fine tokenization step is added to limit signal loss from pooling when lesions are small. This matters for medical imaging because it sidesteps the lack of large 3D foundation models and the scarcity of annotated training data for rare or protocol-varying pathologies. Experiments cover brain MRI cases of metastases, glioma, and stroke plus lung CT to check broader applicability.

Core claim

By decomposing 3D volumes along anatomical axes and encoding slices with a frozen 2D vision transformer, the method creates localized volumetric tokens whose anomaly scores come from their lack of close analogues across other subjects in the batch. A coarse-to-fine tokenization strategy is added to preserve signals from focal lesions that would otherwise be diluted by pooling. This enables training-free zero-shot anomaly detection and localization in 3D medical images using only 2D foundation models, as demonstrated on brain MRI datasets for metastases, glioma, and stroke, and validated on lung CT.

What carries the argument

Cross-subject mutual similarity scoring on volumetric tokens created from multi-axis slice encoding by a 2D ViT with optional coarse-to-fine pooling.

If this is right

  • Anomaly localization in 3D volumes is possible without volumetric foundation models or supervised training.
  • The benefit of fine-resolution tokenization varies with lesion contrast and imaging modality.
  • The approach generalizes from atlas-aligned brain MRI to lung CT.
  • Focal lesion signals are better preserved by switching to coarse-to-fine tokenization instead of uniform depth pooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The batch-matching principle could transfer to other 3D data types processed by 2D models, such as video sequences.
  • Practical use would require reliable ways to form batches of mostly normal scans from clinical archives.
  • Combining the token comparison with simple preprocessing steps tuned to each modality might reduce sensitivity to acquisition differences.

Load-bearing premise

The batch consists primarily of normal cases with comparable acquisition conditions so that unmatched tokens indicate anomalies rather than normal variation.

What would settle it

Apply the method to batches that contain many abnormal cases and check whether known lesions still receive distinctly higher anomaly scores than normal tissue.

Figures

Figures reproduced from arXiv: 2606.18749 by Tai Le-Gia.

Figure 1
Figure 1. Figure 1: Overview of CS3F. Each volume is processed independently along the sagittal, coronal, and axial axes. Slices are encoded by a frozen 2D foundation model and aggregated into volumetric tokens via depth pooling and random projection. The tokens are scored using cross-subject mutual similarity scoring. Axis-specific anomaly maps are fused to obtain the final voxel-level anomaly map. construction details (Sec.… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the axial volumetric tokenization pipeline. A 3D volume is decomposed into axial slices and encoded by a frozen vision transformer (DINOv2) to extract patch-level features. Patch features corresponding to the same spatial location across neighboring slices are aggregated via average pooling and 𝓁2 -normalized to form volumetric tokens. These tokens are subsequently projected into a compact … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the coarse-to-fine routing mechanism. A coarse token from the query volume first identifies its top-𝐿 nearest coarse tokens in each reference volume. Fine￾scale matching is then performed only within the fine tokens belonging to these selected coarse regions, avoiding exhaustive search over the entire reference volume. 8 16 32 64 128 256 512 Projection dimension d 0.0 0.2 0.4 0.6 0.8 1.0 Me… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the random projection dimension 𝑑. Shaded regions denote sample standard deviation across 5 random seeds. Le Gia: Preprint submitted to Elsevier Page 15 of 22 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative anomaly segmentation results on BraTS-METS using T1w and T2w inputs. The same subjects and axial slices are shown for both modalities to enable direct comparison of modality-dependent segmentation behavior. Red contours indicate the boundaries of the predicted anomaly masks at maximum Dice operating point, overlaid on the original images. Le Gia: Preprint submitted to Elsevier Page 16 of 22 [P… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative anomaly segmentation results on T1w inputs from ATLAS. Red contours indicate the boundaries of the predicted anomaly masks at maximum Dice operating point, overlaid on the original images. The examples illustrate stroke lesion localization across subjects with different lesion extent and appearance. GT DAE pDDPM APRIL-GAN AnomalyCLIP CS3F-F CS3F-C CS3F-MS Subject 1 Subject 2 Subject 3 [PITH_FU… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative anomaly segmentation results on T2w inputs from BraTS-GLI. Red contours indicate the boundaries of the predicted anomaly masks at maximum Dice operating point, overlaid on the original images. The examples illustrate glioma localization across subjects with heterogeneous tumor appearance and spatial extent. Le Gia: Preprint submitted to Elsevier Page 17 of 22 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of 3D anomaly segmentation maps generated by CS3F-C. Le Gia: Preprint submitted to Elsevier Page 18 of 22 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes CS3F, a training-free batch-based framework for zero-shot anomaly detection (ZSAD) in 3D medical images that uses frozen 2D vision transformers. Volumes are decomposed along anatomical axes, encoded slice-wise, converted to localized volumetric tokens via pooling, and scored for anomalies using cross-subject mutual similarity (tokens without close analogues receive higher scores). A coarse-to-fine tokenization strategy is introduced to mitigate signal attenuation from depth pooling. The method is evaluated on brain MRI (metastases, glioma, stroke) and lung CT, with the central claims being that 2D foundation models can support 3D anomaly localization and that the benefit of fine tokenization depends on lesion contrast and imaging modality.

Significance. If the central claims hold under the stated conditions, the work would be significant for enabling practical ZSAD in 3D clinical volumes without task-specific training data or volumetric foundation models, addressing real-world heterogeneity in acquisition protocols. The training-free design, multi-axis decomposition, and batch-based similarity mechanism are notable strengths, as is the empirical observation that fine tokenization benefits are modality- and contrast-dependent. These elements could inform efficient adaptations of existing 2D models to 3D tasks.

major comments (1)
  1. [Evaluation] Evaluation on brain MRI and lung CT: the paper does not report controlled experiments that vary the anomalous fraction within batches or introduce intra-batch acquisition heterogeneity (e.g., differing protocols or scanner parameters). This is load-bearing for the central claim because anomaly scores derive from cross-subject mutual similarity, which presupposes batches dominated by normal cases acquired under comparable conditions; without such tests the robustness of the scoring mechanism remains unverified even though the abstract acknowledges heterogeneous clinical protocols.
minor comments (1)
  1. [Abstract] Abstract: quantitative metrics, baselines, and statistical details supporting the claims about fine tokenization benefits and overall performance are not provided, making it difficult to assess the strength of the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation on brain MRI and lung CT: the paper does not report controlled experiments that vary the anomalous fraction within batches or introduce intra-batch acquisition heterogeneity (e.g., differing protocols or scanner parameters). This is load-bearing for the central claim because anomaly scores derive from cross-subject mutual similarity, which presupposes batches dominated by normal cases acquired under comparable conditions; without such tests the robustness of the scoring mechanism remains unverified even though the abstract acknowledges heterogeneous clinical protocols.

    Authors: We agree that the robustness of the cross-subject mutual similarity mechanism under controlled variations in anomalous fraction and intra-batch acquisition heterogeneity is central to validating the approach, particularly given the abstract's reference to heterogeneous protocols. Our evaluations on brain MRI (metastases, glioma, stroke) and lung CT already incorporate real-world clinical data with natural variations across subjects, scanners, and pathologies, and the batch-based design is intended to leverage predominantly normal cases. However, we did not include explicit ablation studies that systematically vary the anomalous fraction within batches or introduce synthetic intra-batch protocol differences. In the revised manuscript we will add these controlled experiments to directly test the scoring mechanism's sensitivity to these factors. revision: yes

Circularity Check

0 steps flagged

No circularity; method is explicitly assumption-driven and self-contained

full rationale

The paper presents CS3F as a training-free heuristic that assigns anomaly scores to tokens lacking batch analogues, with the batch-normality precondition stated outright in the method description rather than derived. No equations, fitted parameters, or self-citations appear in the abstract or method outline that would reduce any claimed result to its own inputs by construction. The approach is evaluated on external datasets (brain MRI, lung CT) without invoking prior author work as a uniqueness theorem or smuggling an ansatz. This is the common case of an independent empirical proposal whose validity rests on the untested batch assumption rather than on logical circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all components appear drawn from existing 2D foundation models and standard pooling techniques.

pith-pipeline@v0.9.1-grok · 5784 in / 1030 out tokens · 21126 ms · 2026-06-26T21:19:42.848242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Unsupervised brain lesion segmentation from mri using a con- volutional autoencoder, in: Medical Imaging 2019: Image Processing, SPIE. pp. 372–378. Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., Prevedello, L.M., Rudie, J.D., Sako, C., Shinohara, R.T., Bergquist, T., ...

  2. [2]

    Research Square , rs–3

    Merlin: A vision language foundation model for 3d computed tomography. Research Square , rs–3. Cai, Y., Chen, H., Cheng, K.T., 2024. Rethinking autoencoders for medical anomaly detection from a theoretical perspective, in: Interna- tionalConferenceonMedicalImageComputingandComputer-Assisted Intervention, Springer. pp. 544–554. Carass, A., Roy, S., Jog, A....

  3. [3]

    Europeanradiology experimental 4, 50

    Automatic lung segmentation in routine imaging is primarily a datadiversityproblem,notamethodologyproblem. Europeanradiology experimental 4, 50. Hu, J., Chen, Y., Yi, Z., 2019. Automated segmentation of macular edema inoctusingdeepneuralnetworks. Medicalimageanalysis55,216–227. Isensee, F., Jäger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H., 2021. nn...

  4. [4]

    Human brain mapping 40, 4952–4964

    Automated brain extraction of multisequence mri using artificial neural networks. Human brain mapping 40, 4952–4964. IXI, 2025. IXI Dataset.https://brain-development.org/ixi-dataset/. Accessed 11 December 2025. Jeong, J., Zou, Y., Kim, T., Zhang, D., Ravichandran, A., Dabeer, O.,

  5. [5]

    19606–19616

    Winclip:Zero-/few-shotanomalyclassificationandsegmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19606–19616. Johnson, W.B., Lindenstrauss, J., et al., 1984. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26, 1. Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, ...

  6. [6]

    Medical Image Analysis 103, 103559

    Improved unsupervised 3d lung lesion detection and lo- calization by fusing global and local features: Validation in 3d low-dose computed tomography. Medical Image Analysis 103, 103559. URL:https://www.sciencedirect.com/science/article/pii/ S1361841525001069, doi:https://doi.org/10.1016/j.media.2025.103559. Li, A., Qiu, C., Kloft, M., Smyth, P., Rudolph, ...

  7. [7]

    Scientific data 9, 320

    A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Scientific data 9, 320. Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W., 2023. Pmc-clip: Contrastive language-image pre-training using biomedical documents, in: MICCAI. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.,

  8. [8]

    10012–10022

    Swin transformer: Hierarchical vision transformer using shifted Le Gia:Preprint submitted to Elsevier Page 21 of 22 CS3F: Training-free 3D anomaly localization windows,in: Proceedingsof theIEEE/CVFinternational conferenceon computer vision, pp. 10012–10022. Maleki, N., Amiruddin, R., Moawad, A.W., Yordanov, N., Gkampenis, A., Fehringer, P., Umeh, F., Chuk...

  9. [9]

    arXiv preprint arXiv:2504.12527

    Analysis of the miccai brain tumor segmentation–metastases (brats-mets) 2025 lighthouse challenge: Brain metastasis segmentation on pre-and post-treatment mri. arXiv preprint arXiv:2504.12527 . Marzullo, A., Cappa, N., Morellini, M., Ranzini, M.B.M., 2025. Gener- alist models in specialized domains: Evaluating contrastive language- image pre-training for ...

  10. [10]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021

    Fast unsupervised brain anomaly detection and segmentation with diffusion models, in: International Conference on Medical Image ComputingandComputer-AssistedIntervention,Springer.pp.705–714. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning tran...

  11. [11]

    VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

    f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44. Schwarz,J.,Will,L.,Wellmer,J.,Mosig,A.,2024. Apatch-basedstudent- teacherpyramidmatchingapproachtoanomalydetectionin3dmagnetic resonance imaging, in: Medical Imaging with Deep Learning. Shang, L., Lou, Z., Sethares, W.A., Alexander, A.L., ...

  12. [12]

    URL:https://openreview.net/forum?id= z0r388Sbv3

    FeasibilityandbenefitsofjointlearningfromMRIdatabaseswith different brain diseases and modalities for segmentation, in: Medical Imaging with Deep Learning. URL:https://openreview.net/forum?id= z0r388Sbv3. Yeung, M., Sala, E., Schönlieb, C.B., Rundo, L., 2022. Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbal- anced...