When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

Anju Chhetri; Binod Bhattarai; Prashnna Gyawali; Pratik Shrestha; Ramesh Rana; Sam Philip

arxiv: 2606.16196 · v2 · pith:5IULS3CCnew · submitted 2026-06-15 · 💻 cs.LG · cs.CV

When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

Anju Chhetri , Pratik Shrestha , Ramesh Rana , Sam Philip , Prashnna Gyawali , Binod Bhattarai This is my paper

Pith reviewed 2026-06-27 04:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords OOD detectionsparse autoencodersconcept vectorsrepresentation perturbationsmedical imaginginterpretabilitydistributional shift

0 comments

The pith

Out-of-distribution samples in medical imaging are detected by measuring how much class logits shift when representations are perturbed along sparse autoencoder concept vectors tied to the predicted class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an OOD detection method that learns class-specific concept vectors from in-distribution medical images using sparse autoencoders. At test time it applies perturbations to deeper-layer representations along the vectors for the model's predicted class and tracks stability in the output logits. The central hypothesis is that in-distribution inputs produce little change because their features align with the learned semantic directions, while OOD inputs produce larger shifts due to misalignment. This supplies both a detection score and an interpretable view of the internal reasons for uncertainty.

Core claim

Leveraging sparse autoencoders to extract class-specific concept vectors from in-distribution data, the framework perturbs intermediate representations using the vectors of the predicted class and quantifies the resulting change in class logits. In-distribution samples are expected to remain stable under these perturbations because their representations align with the semantic directions, whereas OOD samples exhibit amplified logit deviations from representational misalignment. The approach therefore treats OOD detection as a concept-conditioned stability analysis.

What carries the argument

Class-conditioned semantic perturbations of deeper-layer representations along sparse-autoencoder-derived concept vectors, followed by measurement of class-logit stability.

If this is right

The method yields an explicit semantic explanation for each detection decision by identifying which concept directions cause instability.
It supplies a stability-based signal that can be inspected layer by layer rather than relying on opaque internal statistics.
The same perturbation procedure can be applied at multiple depths to localize where misalignment first appears.
Because the concept vectors are learned per class, the detector naturally adapts to the model's own prediction without requiring separate OOD training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment hypothesis holds, the same SAE concept vectors might serve as a diagnostic tool for identifying which semantic features a model has failed to learn on particular inputs.
The approach could be tested on non-medical vision tasks by swapping the SAE training corpus, revealing whether the stability property is domain-specific or general.
Combining the stability score with existing logit-based OOD methods might produce a hybrid detector whose failures are easier to debug.

Load-bearing premise

In-distribution samples exhibit low sensitivity to perturbations along class-specific concept vectors because their representations align with those semantic directions, whereas OOD samples show amplified deviations due to representational misalignment.

What would settle it

Finding a dataset where in-distribution medical images produce large logit changes under the class-specific concept perturbations, or where known OOD images produce small changes, would falsify the core hypothesis.

Figures

Figures reproduced from arXiv: 2606.16196 by Anju Chhetri, Binod Bhattarai, Prashnna Gyawali, Pratik Shrestha, Ramesh Rana, Sam Philip.

**Figure 1.** Figure 1: (a) SAE Training: Intermediate ViT activations are used to train an SAE; sparse coefficients are aggregated per class to select top-k decoder columns, forming a class concept dictionary. (b) Inference: For a test sample x¯, we first obtain the predicted class from the base model and select the corresponding concept set. These concepts are aggregated into a class-specific concept vector vc, which is injecte… view at source ↗

**Figure 2.** Figure 2: (a) Cosine similarity between the class concept vector vc and sample activations for ID and OOD data in the histopathology dataset (similar trends are observed for Kvasir and OCT). (b) Robustness analysis with respect to the number of concept vectors k and the scaling parameter β across Histopathology, OCT, and Kvasir datasets [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Clinical Grounding and OOD Contrast of Sparse Concept Latents. Each row pairs a base-model predicted class with an SAE latent (concept). Left (green): ID clusters showing the four images with the highest latent activations. Right (red): An OOD sample with low activation. While all images in a row share the same predicted class, green and red boxes provide expert clinical descriptions for the ID cluster and… view at source ↗

read the original abstract

Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes using class-specific SAE perturbations for interpretable OOD detection in medical imaging but states the stability hypothesis without derivation or any results to back it.

read the letter

The main takeaway is that this work frames OOD detection as a stability test under perturbations along SAE-learned concept vectors tied to the predicted class, but the abstract supplies no experiments, ablations, or geometric argument for why those directions should produce the claimed ID/OOD difference.

The new element is the specific combination of class-conditioned SAE vectors with targeted representation perturbation to generate both a detection signal and a claimed semantic explanation. That is distinct from standard logit- or feature-based OOD baselines that do not explicitly tie the test to disentangled concepts.

The intent is reasonable for medical imaging, where opaque uncertainty signals are a known barrier. Linking the perturbation to interpretable directions is a logical step if the underlying assumption holds.

The soft spot is exactly the one flagged in the stress-test: the hypothesis that ID samples stay stable because their representations align with the SAE directions, while OOD samples deviate, is asserted without any supporting analysis. There is no derivation showing why these particular sparse vectors (as opposed to random directions of similar scale or other bases) must expose misalignment. The circularity burden is real here—the detection rule is defined in terms of the untested sensitivity difference. With no quantitative validation supplied, it is impossible to assess whether the signal is discriminative or merely another opaque perturbation test.

This paper would interest a reading group working on interpretable safety mechanisms for clinical models, but only as an idea to discuss rather than a finished method. A reader looking for validated techniques would get little usable content.

I would not send it for peer review in this form. It needs at least a clear justification of the perturbation choice and some empirical checks before it merits referee time.

Referee Report

2 major / 1 minor

Summary. The paper proposes an interpretable OOD detection framework for deep neural networks in medical imaging tasks. It learns class-specific concept vectors via sparse autoencoders (SAEs) trained on in-distribution data, then perturbs deeper-layer activations along the vectors corresponding to the model's predicted class and measures resulting changes in class logits. The central hypothesis is that in-distribution samples exhibit low sensitivity to these perturbations due to representational alignment with the semantic directions, whereas OOD samples exhibit amplified deviations due to misalignment; this is positioned as supplying both a discriminative detection signal and an interpretable view of uncertainty.

Significance. If the hypothesis were independently validated with supporting experiments and justification, the method could provide a meaningful advance in semantically grounded OOD detection, offering interpretability that is especially relevant for safety-critical medical applications where existing black-box signals limit trust.

major comments (2)

[Abstract] Abstract: The OOD detection rule is defined directly in terms of the hypothesized difference in perturbation sensitivity between ID and OOD samples along class-specific SAE concept vectors, without any independent grounding, derivation, or geometric analysis showing why these directions (as opposed to random directions or other sparse bases of equal magnitude) must expose misalignment. This renders the claimed discriminative signal and interpretability circular with respect to the central assumption.
[Abstract] Abstract: The manuscript states the method and hypothesis but supplies no experimental results, ablation studies, quantitative OOD detection metrics, or comparisons against baselines, so there is no evidence to assess whether the proposed stability measure actually separates ID from OOD inputs or yields interpretable insights.

minor comments (1)

[Abstract] Abstract: The description of the inference procedure is high-level; concrete details on how perturbation magnitude is selected relative to the SAE sparsity level and how the class-specific vectors are extracted would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for highlighting key issues in the presentation of our work. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The OOD detection rule is defined directly in terms of the hypothesized difference in perturbation sensitivity between ID and OOD samples along class-specific SAE concept vectors, without any independent grounding, derivation, or geometric analysis showing why these directions (as opposed to random directions or other sparse bases of equal magnitude) must expose misalignment. This renders the claimed discriminative signal and interpretability circular with respect to the central assumption.

Authors: We agree that the abstract, owing to length constraints, states the hypothesis without a self-contained geometric derivation. The choice of class-conditioned SAE vectors is motivated by their training objective on ID data, which yields sparse directions aligned with class semantics; perturbations along these directions are intended to probe representational alignment. We will add a new subsection in the revised manuscript that supplies geometric motivation, including why random directions of comparable magnitude are not expected to produce the same differential sensitivity, thereby reducing any appearance of circularity. revision: yes
Referee: [Abstract] Abstract: The manuscript states the method and hypothesis but supplies no experimental results, ablation studies, quantitative OOD detection metrics, or comparisons against baselines, so there is no evidence to assess whether the proposed stability measure actually separates ID from OOD inputs or yields interpretable insights.

Authors: The referee correctly notes that the submitted manuscript presents the framework and hypothesis but does not yet contain empirical results. We will revise the paper to include a full experimental section with quantitative OOD detection metrics (AUROC, AUPR), ablation studies on SAE hyperparameters and perturbation strength, and comparisons to standard baselines, all evaluated on medical imaging datasets. This will directly address the need for evidence supporting the discriminative power and interpretability claims. revision: yes

Circularity Check

1 steps flagged

Differential sensitivity hypothesis (ID stable, OOD unstable under class-SAE perturbations) lacks derivation or geometric justification

specific steps

self definitional [Abstract]
"We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal..."

The detection rule measures exactly the sensitivity difference asserted in the hypothesis; the method therefore reduces to testing the assumption by construction rather than deriving an independent signal from first principles or external validation.

full rationale

The paper's OOD detection procedure is constructed directly around the stated hypothesis that ID samples are stable and OOD samples are unstable under perturbations along class-specific SAE concept vectors. No independent geometric or derivation step is supplied to justify why these directions (as opposed to other bases) produce the claimed differential effect; the discriminative signal is therefore defined in terms of the assumption it is intended to test.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that SAEs produce semantically meaningful class-specific directions and that perturbation sensitivity directly tracks distributional shift; no free parameters are numerically specified in the abstract, and the concept vectors are the main invented entity without external validation.

free parameters (2)

SAE sparsity level
Controls how many active dimensions per concept vector; value not stated in abstract but required to produce the disentangled directions used for perturbation.
perturbation magnitude
Determines the size of the nudge applied along each concept vector; value not stated but directly affects the stability measurement.

axioms (1)

domain assumption Intermediate representations of a trained DNN can be linearly decomposed into sparse, class-specific semantic directions via SAE training on in-distribution data.
Invoked when the method learns concept vectors and uses them to perturb deeper-layer activations.

invented entities (1)

class-specific concept vectors no independent evidence
purpose: Provide semantic directions for class-conditioned representation perturbation.
New entities extracted by SAEs; no independent evidence of their semantic validity is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5758 in / 1568 out tokens · 64744 ms · 2026-06-27T04:10:39.550855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2310.06823 (2023)

Ammar, M.B., Belkhir, N., Popescu, S., Manzanera, A., Franchi, G.: Neco: Neu- ral collapse based out-of-distribution detection. arXiv preprint arXiv:2310.06823 (2023)

arXiv 2023
[2]

arXiv preprint arXiv:2505.20063 (2025)

Arad, D., Mueller, A., Belinkov, Y.: Saes are good for steering–if you select the right features. arXiv preprint arXiv:2505.20063 (2025)

arXiv 2025
[3]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quan- tifying interpretability of deep visual representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6541–6549 (2017)

2017
[4]

arXiv preprint arXiv:2404.14082 (2024)

Bereska, L., Gavves, E.: Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082 (2024)

Pith/arXiv arXiv 2024
[5]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Chhetri, A., Korhonen, J., Gyawali, P., Bhattarai, B.: Nero: Explainable out-of- distribution detection with neuron-level relevance in gastrointestinal imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 349–359. Springer (2025)

2025
[6]

In: International conference on ma- chine learning

Choi, J., Raghuram, J., Feng, R., Chen, J., Jha, S., Prakash, A.: Concept-based explanations for out-of-distribution detectors. In: International conference on ma- chine learning. pp. 5817–5837. PMLR (2023)

2023
[7]

arXiv preprint arXiv:2010.11929 (2020)

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

Pith/arXiv arXiv 2010
[8]

arXiv preprint arXiv:2209.10652 (2022)

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv preprint arXiv:2209.10652 (2022)

Pith/arXiv arXiv 2022
[9]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016
[10]

arXiv preprint arXiv:1911.11132 (2019)

Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Stein- hardt, J., Song, D.: Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132 (2019)

arXiv 1911
[11]

arXiv preprint arXiv:1610.02136 (2016)

Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)

Pith/arXiv arXiv 2016
[12]

Medical Image Analysis97, 103289 (2024).https://doi.org/https: //doi.org/10.1016/j.media.2024.103289,https://www.sciencedirect

Hua, S., Yan, F., Shen, T., Ma, L., Zhang, X.: Pathoduet: Founda- tion models for pathological slide analysis of h and e and ihc stains. Medical Image Analysis97, 103289 (2024).https://doi.org/https: //doi.org/10.1016/j.media.2024.103289,https://www.sciencedirect. com/science/article/pii/S1361841524002147

work page doi:10.1016/j.media.2024.103289 2024
[13]

Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A

Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distribu- tional shifts in the wild. Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A. Chhetri et al

2021
[14]

Gastroenterology (2025)

Jong, M.R., Boers, T.G., Fockens, K.N., Jukema, J.B., Kusters, C.H., Jaspers, T.J., van Heslinga, R.v.E., Slooter, F.C., Struyvenberg, M.R., Bisschops, R., et al.: Gastronet-5m: A multicenter dataset for developing foundation models in gastroin- testinal endoscopy. Gastroenterology (2025)

2025
[15]

arXiv preprint arXiv:2504.19475 (2025)

Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., Lapuschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video. arXiv preprint arXiv:2504.19475 (2025)

arXiv 2025
[16]

(No Title) (2018)

Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorec- tal cancer and healthy tissue. (No Title) (2018)

2018
[17]

cell172(5), 1122–1131 (2018)

Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell172(5), 1122–1131 (2018)

2018
[18]

Advances in neural information processing systems25 (2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)

2012
[19]

Advances in neural information processing systems31(2018)

Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. Advances in neural information processing systems31(2018)

2018
[20]

arXiv preprint arXiv:1706.02690 (2017)

Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 (2017)

arXiv 2017
[21]

Advances in neural information processing systems33, 21464–21475 (2020)

Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. Advances in neural information processing systems33, 21464–21475 (2020)

2020
[22]

In: Interna- tional Conference on Artificial Intelligence and Statistics

Morningstar, W., Ham, C., Gallagher, A., Lakshminarayanan, B., Alemi, A., Dil- lon, J.: Density of states estimation for out of distribution detection. In: Interna- tional Conference on Artificial Intelligence and Statistics. pp. 3232–3240. PMLR (2021)

2021
[23]

arXiv preprint arXiv:2311.03658 (2023)

Park, K., Choe, Y.J., Veitch, V.: The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658 (2023)

Pith/arXiv arXiv 2023
[24]

Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., de Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: A multi-class image dataset for computer aided gastroin- testinal disease detection. In: Proceedings of the 8th ACM on Multimedia Sys- tems Conference. pp. 164–169. MMSys’17, A...

work page doi:10.1145/3083187.3083212 2017
[25]

Ad- vances in neural information processing systems30(2017)

Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Ad- vances in neural information processing systems30(2017)

2017
[26]

arXiv preprint arXiv:2411.10794 (2024)

Regmi, S.: Image-based outlier synthesis with training data. arXiv preprint arXiv:2411.10794 (2024)

arXiv 2024
[27]

IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)

Ren, Y., Pu, J., Yang, Z., Xu, J., Li, G., Pu, X., Yu, P.S., He, L.: Deep clustering: A comprehensive survey. IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)

2024
[28]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

2017
[29]

Stevens, S., Chao, W.L., Berger-Wolf, T., Su, Y.: Interpretable and testable vision features via sparse autoencoders (2025),https://arxiv.org/abs/2502.06755

arXiv 2025
[30]

Advances in neural information processing systems34, 144–157 (2021) 11

Sun, Y., Guo, C., Li, Y.: React: Out-of-distribution detection with rectified acti- vations. Advances in neural information processing systems34, 144–157 (2021) 11

2021
[31]

IEEE Transactions on Knowledge and Data Engineering (2025)

Tamang, L., Bouadjenek, M.R., Dazeley, R., Aryal, S.: Handling out-of-distribution data: A survey. IEEE Transactions on Knowledge and Data Engineering (2025)

2025
[32]

Advances in Neural Information Processing Systems33, 18583–18599 (2020)

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., Schmidt, L.: Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems33, 18583–18599 (2020)

2020
[33]

In: International Data Science Conference

Tschuchnig, M.E., Gadermayr, M.: Anomaly detection in medical imaging-a mini review. In: International Data Science Conference. pp. 33–38. Springer (2021)

2021
[34]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4921–4930 (2022)

2022
[35]

In: International conference on machine learning

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: International conference on machine learning. pp. 10524–10533. PMLR (2020)

2020

[1] [1]

arXiv preprint arXiv:2310.06823 (2023)

Ammar, M.B., Belkhir, N., Popescu, S., Manzanera, A., Franchi, G.: Neco: Neu- ral collapse based out-of-distribution detection. arXiv preprint arXiv:2310.06823 (2023)

arXiv 2023

[2] [2]

arXiv preprint arXiv:2505.20063 (2025)

Arad, D., Mueller, A., Belinkov, Y.: Saes are good for steering–if you select the right features. arXiv preprint arXiv:2505.20063 (2025)

arXiv 2025

[3] [3]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quan- tifying interpretability of deep visual representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6541–6549 (2017)

2017

[4] [4]

arXiv preprint arXiv:2404.14082 (2024)

Bereska, L., Gavves, E.: Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082 (2024)

Pith/arXiv arXiv 2024

[5] [5]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Chhetri, A., Korhonen, J., Gyawali, P., Bhattarai, B.: Nero: Explainable out-of- distribution detection with neuron-level relevance in gastrointestinal imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 349–359. Springer (2025)

2025

[6] [6]

In: International conference on ma- chine learning

Choi, J., Raghuram, J., Feng, R., Chen, J., Jha, S., Prakash, A.: Concept-based explanations for out-of-distribution detectors. In: International conference on ma- chine learning. pp. 5817–5837. PMLR (2023)

2023

[7] [7]

arXiv preprint arXiv:2010.11929 (2020)

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

Pith/arXiv arXiv 2010

[8] [8]

arXiv preprint arXiv:2209.10652 (2022)

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv preprint arXiv:2209.10652 (2022)

Pith/arXiv arXiv 2022

[9] [9]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016

[10] [10]

arXiv preprint arXiv:1911.11132 (2019)

Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Stein- hardt, J., Song, D.: Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132 (2019)

arXiv 1911

[11] [11]

arXiv preprint arXiv:1610.02136 (2016)

Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)

Pith/arXiv arXiv 2016

[12] [12]

Medical Image Analysis97, 103289 (2024).https://doi.org/https: //doi.org/10.1016/j.media.2024.103289,https://www.sciencedirect

Hua, S., Yan, F., Shen, T., Ma, L., Zhang, X.: Pathoduet: Founda- tion models for pathological slide analysis of h and e and ihc stains. Medical Image Analysis97, 103289 (2024).https://doi.org/https: //doi.org/10.1016/j.media.2024.103289,https://www.sciencedirect. com/science/article/pii/S1361841524002147

work page doi:10.1016/j.media.2024.103289 2024

[13] [13]

Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A

Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distribu- tional shifts in the wild. Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A. Chhetri et al

2021

[14] [14]

Gastroenterology (2025)

Jong, M.R., Boers, T.G., Fockens, K.N., Jukema, J.B., Kusters, C.H., Jaspers, T.J., van Heslinga, R.v.E., Slooter, F.C., Struyvenberg, M.R., Bisschops, R., et al.: Gastronet-5m: A multicenter dataset for developing foundation models in gastroin- testinal endoscopy. Gastroenterology (2025)

2025

[15] [15]

arXiv preprint arXiv:2504.19475 (2025)

Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., Lapuschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video. arXiv preprint arXiv:2504.19475 (2025)

arXiv 2025

[16] [16]

(No Title) (2018)

Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorec- tal cancer and healthy tissue. (No Title) (2018)

2018

[17] [17]

cell172(5), 1122–1131 (2018)

Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell172(5), 1122–1131 (2018)

2018

[18] [18]

Advances in neural information processing systems25 (2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)

2012

[19] [19]

Advances in neural information processing systems31(2018)

Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. Advances in neural information processing systems31(2018)

2018

[20] [20]

arXiv preprint arXiv:1706.02690 (2017)

Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 (2017)

arXiv 2017

[21] [21]

Advances in neural information processing systems33, 21464–21475 (2020)

Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. Advances in neural information processing systems33, 21464–21475 (2020)

2020

[22] [22]

In: Interna- tional Conference on Artificial Intelligence and Statistics

Morningstar, W., Ham, C., Gallagher, A., Lakshminarayanan, B., Alemi, A., Dil- lon, J.: Density of states estimation for out of distribution detection. In: Interna- tional Conference on Artificial Intelligence and Statistics. pp. 3232–3240. PMLR (2021)

2021

[23] [23]

arXiv preprint arXiv:2311.03658 (2023)

Park, K., Choe, Y.J., Veitch, V.: The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658 (2023)

Pith/arXiv arXiv 2023

[24] [24]

Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., de Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: A multi-class image dataset for computer aided gastroin- testinal disease detection. In: Proceedings of the 8th ACM on Multimedia Sys- tems Conference. pp. 164–169. MMSys’17, A...

work page doi:10.1145/3083187.3083212 2017

[25] [25]

Ad- vances in neural information processing systems30(2017)

Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Ad- vances in neural information processing systems30(2017)

2017

[26] [26]

arXiv preprint arXiv:2411.10794 (2024)

Regmi, S.: Image-based outlier synthesis with training data. arXiv preprint arXiv:2411.10794 (2024)

arXiv 2024

[27] [27]

IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)

Ren, Y., Pu, J., Yang, Z., Xu, J., Li, G., Pu, X., Yu, P.S., He, L.: Deep clustering: A comprehensive survey. IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)

2024

[28] [28]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

2017

[29] [29]

Stevens, S., Chao, W.L., Berger-Wolf, T., Su, Y.: Interpretable and testable vision features via sparse autoencoders (2025),https://arxiv.org/abs/2502.06755

arXiv 2025

[30] [30]

Advances in neural information processing systems34, 144–157 (2021) 11

Sun, Y., Guo, C., Li, Y.: React: Out-of-distribution detection with rectified acti- vations. Advances in neural information processing systems34, 144–157 (2021) 11

2021

[31] [31]

IEEE Transactions on Knowledge and Data Engineering (2025)

Tamang, L., Bouadjenek, M.R., Dazeley, R., Aryal, S.: Handling out-of-distribution data: A survey. IEEE Transactions on Knowledge and Data Engineering (2025)

2025

[32] [32]

Advances in Neural Information Processing Systems33, 18583–18599 (2020)

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., Schmidt, L.: Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems33, 18583–18599 (2020)

2020

[33] [33]

In: International Data Science Conference

Tschuchnig, M.E., Gadermayr, M.: Anomaly detection in medical imaging-a mini review. In: International Data Science Conference. pp. 33–38. Springer (2021)

2021

[34] [34]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4921–4930 (2022)

2022

[35] [35]

In: International conference on machine learning

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: International conference on machine learning. pp. 10524–10533. PMLR (2020)

2020