When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Pith reviewed 2026-06-27 04:10 UTC · model grok-4.3
The pith
Out-of-distribution samples in medical imaging are detected by measuring how much class logits shift when representations are perturbed along sparse autoencoder concept vectors tied to the predicted class.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Leveraging sparse autoencoders to extract class-specific concept vectors from in-distribution data, the framework perturbs intermediate representations using the vectors of the predicted class and quantifies the resulting change in class logits. In-distribution samples are expected to remain stable under these perturbations because their representations align with the semantic directions, whereas OOD samples exhibit amplified logit deviations from representational misalignment. The approach therefore treats OOD detection as a concept-conditioned stability analysis.
What carries the argument
Class-conditioned semantic perturbations of deeper-layer representations along sparse-autoencoder-derived concept vectors, followed by measurement of class-logit stability.
If this is right
- The method yields an explicit semantic explanation for each detection decision by identifying which concept directions cause instability.
- It supplies a stability-based signal that can be inspected layer by layer rather than relying on opaque internal statistics.
- The same perturbation procedure can be applied at multiple depths to localize where misalignment first appears.
- Because the concept vectors are learned per class, the detector naturally adapts to the model's own prediction without requiring separate OOD training data.
Where Pith is reading between the lines
- If the alignment hypothesis holds, the same SAE concept vectors might serve as a diagnostic tool for identifying which semantic features a model has failed to learn on particular inputs.
- The approach could be tested on non-medical vision tasks by swapping the SAE training corpus, revealing whether the stability property is domain-specific or general.
- Combining the stability score with existing logit-based OOD methods might produce a hybrid detector whose failures are easier to debug.
Load-bearing premise
In-distribution samples exhibit low sensitivity to perturbations along class-specific concept vectors because their representations align with those semantic directions, whereas OOD samples show amplified deviations due to representational misalignment.
What would settle it
Finding a dataset where in-distribution medical images produce large logit changes under the class-specific concept perturbations, or where known OOD images produce small changes, would falsify the core hypothesis.
Figures
read the original abstract
Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an interpretable OOD detection framework for deep neural networks in medical imaging tasks. It learns class-specific concept vectors via sparse autoencoders (SAEs) trained on in-distribution data, then perturbs deeper-layer activations along the vectors corresponding to the model's predicted class and measures resulting changes in class logits. The central hypothesis is that in-distribution samples exhibit low sensitivity to these perturbations due to representational alignment with the semantic directions, whereas OOD samples exhibit amplified deviations due to misalignment; this is positioned as supplying both a discriminative detection signal and an interpretable view of uncertainty.
Significance. If the hypothesis were independently validated with supporting experiments and justification, the method could provide a meaningful advance in semantically grounded OOD detection, offering interpretability that is especially relevant for safety-critical medical applications where existing black-box signals limit trust.
major comments (2)
- [Abstract] Abstract: The OOD detection rule is defined directly in terms of the hypothesized difference in perturbation sensitivity between ID and OOD samples along class-specific SAE concept vectors, without any independent grounding, derivation, or geometric analysis showing why these directions (as opposed to random directions or other sparse bases of equal magnitude) must expose misalignment. This renders the claimed discriminative signal and interpretability circular with respect to the central assumption.
- [Abstract] Abstract: The manuscript states the method and hypothesis but supplies no experimental results, ablation studies, quantitative OOD detection metrics, or comparisons against baselines, so there is no evidence to assess whether the proposed stability measure actually separates ID from OOD inputs or yields interpretable insights.
minor comments (1)
- [Abstract] Abstract: The description of the inference procedure is high-level; concrete details on how perturbation magnitude is selected relative to the SAE sparsity level and how the class-specific vectors are extracted would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting key issues in the presentation of our work. We address each major comment below and commit to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The OOD detection rule is defined directly in terms of the hypothesized difference in perturbation sensitivity between ID and OOD samples along class-specific SAE concept vectors, without any independent grounding, derivation, or geometric analysis showing why these directions (as opposed to random directions or other sparse bases of equal magnitude) must expose misalignment. This renders the claimed discriminative signal and interpretability circular with respect to the central assumption.
Authors: We agree that the abstract, owing to length constraints, states the hypothesis without a self-contained geometric derivation. The choice of class-conditioned SAE vectors is motivated by their training objective on ID data, which yields sparse directions aligned with class semantics; perturbations along these directions are intended to probe representational alignment. We will add a new subsection in the revised manuscript that supplies geometric motivation, including why random directions of comparable magnitude are not expected to produce the same differential sensitivity, thereby reducing any appearance of circularity. revision: yes
-
Referee: [Abstract] Abstract: The manuscript states the method and hypothesis but supplies no experimental results, ablation studies, quantitative OOD detection metrics, or comparisons against baselines, so there is no evidence to assess whether the proposed stability measure actually separates ID from OOD inputs or yields interpretable insights.
Authors: The referee correctly notes that the submitted manuscript presents the framework and hypothesis but does not yet contain empirical results. We will revise the paper to include a full experimental section with quantitative OOD detection metrics (AUROC, AUPR), ablation studies on SAE hyperparameters and perturbation strength, and comparisons to standard baselines, all evaluated on medical imaging datasets. This will directly address the need for evidence supporting the discriminative power and interpretability claims. revision: yes
Circularity Check
Differential sensitivity hypothesis (ID stable, OOD unstable under class-SAE perturbations) lacks derivation or geometric justification
specific steps
-
self definitional
[Abstract]
"We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal..."
The detection rule measures exactly the sensitivity difference asserted in the hypothesis; the method therefore reduces to testing the assumption by construction rather than deriving an independent signal from first principles or external validation.
full rationale
The paper's OOD detection procedure is constructed directly around the stated hypothesis that ID samples are stable and OOD samples are unstable under perturbations along class-specific SAE concept vectors. No independent geometric or derivation step is supplied to justify why these directions (as opposed to other bases) produce the claimed differential effect; the discriminative signal is therefore defined in terms of the assumption it is intended to test.
Axiom & Free-Parameter Ledger
free parameters (2)
- SAE sparsity level
- perturbation magnitude
axioms (1)
- domain assumption Intermediate representations of a trained DNN can be linearly decomposed into sparse, class-specific semantic directions via SAE training on in-distribution data.
invented entities (1)
-
class-specific concept vectors
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2310.06823 (2023)
Ammar, M.B., Belkhir, N., Popescu, S., Manzanera, A., Franchi, G.: Neco: Neu- ral collapse based out-of-distribution detection. arXiv preprint arXiv:2310.06823 (2023)
arXiv 2023
-
[2]
arXiv preprint arXiv:2505.20063 (2025)
Arad, D., Mueller, A., Belinkov, Y.: Saes are good for steering–if you select the right features. arXiv preprint arXiv:2505.20063 (2025)
arXiv 2025
-
[3]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quan- tifying interpretability of deep visual representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6541–6549 (2017)
2017
-
[4]
arXiv preprint arXiv:2404.14082 (2024)
Bereska, L., Gavves, E.: Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082 (2024)
Pith/arXiv arXiv 2024
-
[5]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Chhetri, A., Korhonen, J., Gyawali, P., Bhattarai, B.: Nero: Explainable out-of- distribution detection with neuron-level relevance in gastrointestinal imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 349–359. Springer (2025)
2025
-
[6]
In: International conference on ma- chine learning
Choi, J., Raghuram, J., Feng, R., Chen, J., Jha, S., Prakash, A.: Concept-based explanations for out-of-distribution detectors. In: International conference on ma- chine learning. pp. 5817–5837. PMLR (2023)
2023
-
[7]
arXiv preprint arXiv:2010.11929 (2020)
Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)
Pith/arXiv arXiv 2010
-
[8]
arXiv preprint arXiv:2209.10652 (2022)
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv preprint arXiv:2209.10652 (2022)
Pith/arXiv arXiv 2022
-
[9]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
2016
-
[10]
arXiv preprint arXiv:1911.11132 (2019)
Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Stein- hardt, J., Song, D.: Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132 (2019)
arXiv 1911
-
[11]
arXiv preprint arXiv:1610.02136 (2016)
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
Pith/arXiv arXiv 2016
-
[12]
Hua, S., Yan, F., Shen, T., Ma, L., Zhang, X.: Pathoduet: Founda- tion models for pathological slide analysis of h and e and ihc stains. Medical Image Analysis97, 103289 (2024).https://doi.org/https: //doi.org/10.1016/j.media.2024.103289,https://www.sciencedirect. com/science/article/pii/S1361841524002147
-
[13]
Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A
Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distribu- tional shifts in the wild. Advances in Neural Information Processing Systems34, 677–689 (2021) 10 A. Chhetri et al
2021
-
[14]
Gastroenterology (2025)
Jong, M.R., Boers, T.G., Fockens, K.N., Jukema, J.B., Kusters, C.H., Jaspers, T.J., van Heslinga, R.v.E., Slooter, F.C., Struyvenberg, M.R., Bisschops, R., et al.: Gastronet-5m: A multicenter dataset for developing foundation models in gastroin- testinal endoscopy. Gastroenterology (2025)
2025
-
[15]
arXiv preprint arXiv:2504.19475 (2025)
Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., Lapuschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video. arXiv preprint arXiv:2504.19475 (2025)
arXiv 2025
-
[16]
(No Title) (2018)
Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorec- tal cancer and healthy tissue. (No Title) (2018)
2018
-
[17]
cell172(5), 1122–1131 (2018)
Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell172(5), 1122–1131 (2018)
2018
-
[18]
Advances in neural information processing systems25 (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)
2012
-
[19]
Advances in neural information processing systems31(2018)
Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. Advances in neural information processing systems31(2018)
2018
-
[20]
arXiv preprint arXiv:1706.02690 (2017)
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 (2017)
arXiv 2017
-
[21]
Advances in neural information processing systems33, 21464–21475 (2020)
Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. Advances in neural information processing systems33, 21464–21475 (2020)
2020
-
[22]
In: Interna- tional Conference on Artificial Intelligence and Statistics
Morningstar, W., Ham, C., Gallagher, A., Lakshminarayanan, B., Alemi, A., Dil- lon, J.: Density of states estimation for out of distribution detection. In: Interna- tional Conference on Artificial Intelligence and Statistics. pp. 3232–3240. PMLR (2021)
2021
-
[23]
arXiv preprint arXiv:2311.03658 (2023)
Park, K., Choe, Y.J., Veitch, V.: The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658 (2023)
Pith/arXiv arXiv 2023
-
[24]
Proceedings of the 8th ACM on Multimedia Systems Conference , pages =
Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., de Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: A multi-class image dataset for computer aided gastroin- testinal disease detection. In: Proceedings of the 8th ACM on Multimedia Sys- tems Conference. pp. 164–169. MMSys’17, A...
-
[25]
Ad- vances in neural information processing systems30(2017)
Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Ad- vances in neural information processing systems30(2017)
2017
-
[26]
arXiv preprint arXiv:2411.10794 (2024)
Regmi, S.: Image-based outlier synthesis with training data. arXiv preprint arXiv:2411.10794 (2024)
arXiv 2024
-
[27]
IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)
Ren, Y., Pu, J., Yang, Z., Xu, J., Li, G., Pu, X., Yu, P.S., He, L.: Deep clustering: A comprehensive survey. IEEE transactions on neural networks and learning systems 36(4), 5858–5878 (2024)
2024
-
[28]
In: Proceedings of the IEEE international conference on computer vision
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
2017
-
[29]
Stevens, S., Chao, W.L., Berger-Wolf, T., Su, Y.: Interpretable and testable vision features via sparse autoencoders (2025),https://arxiv.org/abs/2502.06755
arXiv 2025
-
[30]
Advances in neural information processing systems34, 144–157 (2021) 11
Sun, Y., Guo, C., Li, Y.: React: Out-of-distribution detection with rectified acti- vations. Advances in neural information processing systems34, 144–157 (2021) 11
2021
-
[31]
IEEE Transactions on Knowledge and Data Engineering (2025)
Tamang, L., Bouadjenek, M.R., Dazeley, R., Aryal, S.: Handling out-of-distribution data: A survey. IEEE Transactions on Knowledge and Data Engineering (2025)
2025
-
[32]
Advances in Neural Information Processing Systems33, 18583–18599 (2020)
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., Schmidt, L.: Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems33, 18583–18599 (2020)
2020
-
[33]
In: International Data Science Conference
Tschuchnig, M.E., Gadermayr, M.: Anomaly detection in medical imaging-a mini review. In: International Data Science Conference. pp. 33–38. Springer (2021)
2021
-
[34]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4921–4930 (2022)
2022
-
[35]
In: International conference on machine learning
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: International conference on machine learning. pp. 10524–10533. PMLR (2020)
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.