arxiv: 2604.08192 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

Yunxiang Peng , Mengmeng Ma , Ziyu Yao , Xi Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords generalizationvision transformerscircuit discoverydistribution shiftproxy metricsmodel evaluationinternal representationscausal circuits

0 comments

The pith

Extracting causal circuits from vision transformers yields stronger label-free metrics for predicting generalization under distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve evaluation of machine learning models when labeled data for the target distribution is unavailable. It argues that current proxies relying on model outputs like confidence are insufficient because they overlook the internal mechanisms driving predictions. By using circuit discovery to identify causal interactions within vision transformers, the authors derive two new metrics: one for choosing models pre-deployment and another for tracking shifts post-deployment. These show markedly better alignment with actual generalization performance across tested tasks.

Core claim

Leveraging circuit discovery on vision transformers, the work extracts circuits as causal interactions between internal representations and proposes Dependency Depth Bias for pre-deployment model selection and Circuit Shift Score for post-deployment monitoring, demonstrating superior correlation with generalization performance compared to existing output-based proxies.

What carries the argument

Circuits as causal interactions between internal representations of the vision transformer, extracted via circuit discovery to compute dependency depth bias and circuit shift scores.

If this is right

Dependency Depth Bias ranks candidate models by expected generalization capability on unlabeled target data before deployment.
Circuit Shift Score quantifies how distribution shifts alter the model's internal causal interactions to forecast performance drops.
Both metrics achieve higher average correlations with true generalization than prior proxies across the evaluated tasks.
The internal focus enables label-free model selection and ongoing monitoring in applications where target labels are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Circuit-based evaluation could be extended to other architectures to check whether similar internal patterns predict generalization in non-vision settings.
The scores might help isolate which internal dependencies cause poor generalization, suggesting targeted fixes to model layers.
If the correlations hold, the metrics could be incorporated into training objectives to encourage circuits that support robust generalization.

Load-bearing premise

The circuits identified by the discovery method must accurately reflect the internal mechanisms that control generalization behavior under distribution shifts.

What would settle it

A test on additional vision tasks and models where the new metrics show no higher or lower correlation with held-out generalization accuracy than output-based proxies like model confidence.

Figures

Figures reproduced from arXiv: 2604.08192 by Mengmeng Ma, Xi Peng, Yunxiang Peng, Ziyu Yao.

**Figure 2.** Figure 2: Our observations from visualizing the circuits in the pre- and post-deployment scenarios. The circuits are obtained via Eq. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: CCA reveals the Universal Generalization Motif predeployment. Left. Universal Generalization Motif obtained by normalizing and averaging vT (Eq.3) over all generalization tasks. Brighter regions indicate the inter-layer dependencies positively correlated with generalization; darker regions indicate negative correlations. Right. The resulting motif highlights the anti and pro-generalization connections sha… view at source ↗

**Figure 4.** Figure 4: The training dynamics of OOD accuracy (left y-axis) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization Motifs post-deployment shows contradictory patterns. Left. The GM on PACS, most inter-layer dependencies are positively correlated with performance. Right. The GM on FMoW shows mostly negative correlation. Inter-layer topology no longer provides a consistent signal of generalization. ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Alarm F1 score on the Camelyon17 dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of circuit rank changes across domains on FMoW and Camelyon17. Each pixel in the heatmap shows the absolute [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: All pre-deployment metrics’ scatter plots for all 12 (ID [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: All pre-deployment metrics’ scatter plots for the 2 (ID [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: All pre-deployment metrics’ scatter plots for all 12 (ID [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: All post-deployment metrics’ scatter plot. In each plot, the yellow triangles represent OOD domains in the dataset, Y-axis shows [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: The training dynamics of OOD performance (left y-axis) vs. our [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Pre-deployment Generalization Motifs vT (Eq.3) of all tasks. Brighter regions indicate the inter-layer dependencies positively correlated with OOD generalization; darker regions indicate negative correlations. abling larger batch parallelism. Metric calculation overhead is negligible. After circuits are discovered, the computation of circuit metrics (e.g., DDB, CSS) involves only graph-level operations o… view at source ↗

**Figure 14.** Figure 14: Circuit discovery runtime profile for a single batch (size 32). Backward pass is the major bottleneck, and replacing EAP-IG [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4\% and 34.1\%, respectively. Our code is available at https://github.com/deep-real/GenCircuit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns circuit discovery into two concrete metrics for label-free generalization prediction in ViTs and reports better correlations than output proxies, but the causal grounding of those circuits is not shown in the abstract.

read the letter

The main point is that internal circuits in vision transformers can be turned into usable proxies for how well a model will generalize under shift, without needing target labels. They define Dependency Depth Bias to rank models before deployment and Circuit Shift Score to monitor performance after, claiming average correlation gains of 13.4% and 34.1% over existing methods across tasks. The code release helps anyone who wants to check the numbers themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces Dependency Depth Bias and Circuit Shift Score, two metrics derived from circuits extracted via circuit discovery in Vision Transformers. These are positioned as label-free proxies for generalization under distribution shift: the former for pre-deployment model selection on unlabeled target data, the latter for post-deployment monitoring. The central empirical claim is that both metrics achieve substantially higher correlation with true generalization performance than existing output-based proxies, with average gains of 13.4% and 34.1% across tasks.

Significance. If the central claim holds after proper validation, the work would offer a useful inner-workings perspective on generalization proxies, addressing a practical need in high-stakes settings where target labels are unavailable. The public code release is a clear strength that supports reproducibility and further scrutiny.

major comments (2)

[Abstract and §3] Abstract and §3 (Methods): The claim that the extracted circuits capture causal interactions relevant to generalization is load-bearing for both metrics, yet the manuscript provides no intervention-based validation (e.g., activation patching or ablation on the discovered circuits) to rule out spurious correlations. Without such checks, the reported correlation improvements cannot be confidently attributed to the inner-workings perspective rather than model capacity or other confounders.
[§4] §4 (Experiments): The average improvements of 13.4% and 34.1% are presented without accompanying details on the number of tasks/models, variance across random seeds, statistical significance, or explicit comparison to strong baselines that also use internal representations. This makes it impossible to assess whether the gains are robust or task-specific.

minor comments (2)

[§3] Notation for Dependency Depth Bias and Circuit Shift Score should be introduced with explicit formulas (including any hyperparameters) rather than descriptive text only, to allow direct reproduction.
[§4] Figure captions and axis labels in the results section would benefit from clearer indication of which metric corresponds to which practical scenario (pre- vs. post-deployment).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and will incorporate revisions to strengthen the causal validation of the circuits and the reporting of experimental results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The claim that the extracted circuits capture causal interactions relevant to generalization is load-bearing for both metrics, yet the manuscript provides no intervention-based validation (e.g., activation patching or ablation on the discovered circuits) to rule out spurious correlations. Without such checks, the reported correlation improvements cannot be confidently attributed to the inner-workings perspective rather than model capacity or other confounders.

Authors: We agree that explicit intervention-based validation would provide stronger support for attributing the improved correlations to the causal structure of the circuits rather than confounders. Our circuit discovery procedure follows established mechanistic interpretability methods that rely on interventions (such as activation patching during discovery), but the current manuscript does not include additional post-discovery experiments that directly ablate or patch the extracted circuits and measure the resulting impact on generalization performance under distribution shift. We will add such analyses in the revised version, for example by performing targeted ablations on circuit components and reporting changes in the correlation of Dependency Depth Bias and Circuit Shift Score with true generalization error. This will help substantiate the causal relevance claim. revision: yes
Referee: [§4] §4 (Experiments): The average improvements of 13.4% and 34.1% are presented without accompanying details on the number of tasks/models, variance across random seeds, statistical significance, or explicit comparison to strong baselines that also use internal representations. This makes it impossible to assess whether the gains are robust or task-specific.

Authors: We appreciate this observation and will substantially expand the experimental reporting in §4. The revised manuscript will specify the exact number of tasks and models evaluated, include variance or standard deviations across random seeds, report statistical significance (e.g., p-values from appropriate tests on the correlation differences), and add explicit comparisons against strong internal-representation baselines such as layer-wise activation statistics, attention pattern metrics, and other circuit-agnostic internal probes. These additions will allow a clearer assessment of robustness across tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes Dependency Depth Bias and Circuit Shift Score by applying an external circuit discovery technique to extract causal interactions in ViTs, then evaluates these metrics empirically via correlation with observed generalization under distribution shift. No equations, definitions, or self-citations are presented that reduce either metric or the reported performance gains (13.4% and 34.1%) to the input data or target variable by construction. The central claims rest on experimental comparisons against existing proxies rather than tautological derivations or fitted parameters renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper relies on the assumption that circuits represent key mechanisms for generalization. No free parameters mentioned in abstract. New metrics are invented but derived from existing circuit discovery methods.

axioms (1)

domain assumption Circuit discovery can extract meaningful causal interactions from model internals.
Central to deriving the metrics from inner workings.

invented entities (2)

Dependency Depth Bias no independent evidence
purpose: Measure of generalization capability before deployment
Newly introduced metric.
Circuit Shift Score no independent evidence
purpose: Predicts generalization under distribution shifts after deployment
Newly introduced metric.

pith-pipeline@v0.9.0 · 5561 in / 1312 out tokens · 102432 ms · 2026-05-10T18:21:52.556860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1 (Circuit as edge weight mapping): c_M^D_X(e) := E_{x~D_X} [KL(M_{e}(x), M(x))]
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dependency Depth Bias (DDB) as log-ratio of deep vs shallow layer weights; Circuit Shift Score as distance from ID circuit

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 32 canonical work pages · 5 internal anchors

[1]

In search of the successful in- terpolation: On the role of sharpness in clip generalization

Alireza Abdollahpoorrostam. In search of the successful in- terpolation: On the role of sharpness in clip generalization. arXiv preprint arXiv:2410.16476, 2024. 8

work page arXiv 2024
[2]

4, 5, 7, 8

Kumar K Agrawal, Arnab Kumar Mondal, Arna Ghosh, and Blake Richards.α-req: Assessing representation quality in self-supervised learning by measuring eigenspectrum de- cay.Advances in Neural Information Processing Systems, 35:17626–17638, 2022. 4, 5, 7, 8

2022
[3]

A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023

Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generaliza- tion.arXiv preprint arXiv:2302.07011, 2023. 4, 5, 8

work page arXiv 2023
[4]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European confer- ence on computer vision (ECCV), pages 456–473, 2018. 4, 1

2018
[5]

arXiv preprint arXiv:2404.14082 (2024)

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082, 2024. 2

work page arXiv 2024
[6]

Finding transformer circuits with edge prun- ing.Advances in Neural Information Processing Systems, 37:18506–18534, 2024

Adithya Bhaskar, Alexander Wettig, Dan Friedman, and Danqi Chen. Finding transformer circuits with edge prun- ing.Advances in Neural Information Processing Systems, 37:18506–18534, 2024. 3, 8

2024
[7]

Invariance and stability of deep convolutional representations.Advances in neural in- formation processing systems, 30, 2017

Alberto Bietti and Julien Mairal. Invariance and stability of deep convolutional representations.Advances in neural in- formation processing systems, 30, 2017. 8

2017
[8]

Chughtai, A

Bilal Chughtai, Alan Cooney, and Neel Nanda. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402.07321, 2024. 8

work page arXiv 2024
[9]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri `a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36: 16318–16352, 2023. 3, 8

2023
[10]

Uncovering graph reason- ing in decoder-only transformers with circuit tracing.arXiv preprint arXiv:2509.20336, 2025

Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, and Jiliang Tang. Uncovering graph reason- ing in decoder-only transformers with circuit tracing.arXiv preprint arXiv:2509.20336, 2025. 8

work page arXiv 2025
[11]

ISBN 9781510838819

A D’Amour, KA Heller, DI Moldovan, B Adlam, B Ali- panahi, A Beutel, C Chen, J Deaton, J Eisenstein, MD Hoff- man, et al. Underspecification presents challenges for credi- bility in modern machine learning. arxiv 2011.03395.arXiv preprint arXiv:2011.03395 [cs, stat], 2020. 2

work page arXiv 2011
[12]

Disparities in dermatology ai performance on a di- verse, curated clinical image set.Science advances, 8(31): eabq6147, 2022

Roxana Daneshjou, Kailas V odrahalli, Roberto A Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, Susan M Swetter, Elizabeth E Bailey, Olivier Gevaert, et al. Disparities in dermatology ai performance on a di- verse, curated clinical image set.Science advances, 8(31): eabq6147, 2022. 2

2022
[13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 7, 1, 4

2009
[14]

On the strong correlation between model invariance and generaliza- tion.Advances in Neural Information Processing Systems, 35:28052–28067, 2022

Weijian Deng, Stephen Gould, and Liang Zheng. On the strong correlation between model invariance and generaliza- tion.Advances in Neural Information Processing Systems, 35:28052–28067, 2022. 8

2022
[15]

A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021. 3, 8

2021
[16]

Fine-tuning language models with just forward passes.NeurIPS, 2023

Malladi et al. Fine-tuning language models with just forward passes.NeurIPS, 2023. 8

2023
[17]

Assessing the accuracy of diagnostic tests.Shanghai archives of psychiatry, 30(3):207, 2018

LI Fangyu and HE Hua. Assessing the accuracy of diagnostic tests.Shanghai archives of psychiatry, 30(3):207, 2018. 7

2018
[18]

arXiv preprint arXiv:2403.00824 , year=

Javier Ferrando and Elena V oita. Information flow routes: Automatically interpreting language models at scale.arXiv preprint arXiv:2403.00824, 2024. 4

work page arXiv 2024
[19]

Leveraging unla- beled data to predict out-of-distribution performance.arXiv preprint arXiv:2201.04234, 2022

Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, and Hanie Sedghi. Leveraging unla- beled data to predict out-of-distribution performance.arXiv preprint arXiv:2201.04234, 2022. 2, 4, 5, 7, 8

work page arXiv 2022
[20]

Rankme: Assessing the downstream perfor- mance of pretrained self-supervised representations by their rank

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. Rankme: Assessing the downstream perfor- mance of pretrained self-supervised representations by their rank. InInternational conference on machine learning, pages 10929–10974. PMLR, 2023. 4, 5, 7, 8

2023
[21]

A survey of uncertainty in deep neural networks.Arti- ficial Intelligence Review, 56(Suppl 1):1513–1589, 2023

Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. A survey of uncertainty in deep neural networks.Arti- ficial Intelligence Review, 56(Suppl 1):1513–1589, 2023. 2

2023
[22]

Shortcut learning in deep neural networks

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 5

2020
[23]

Mldemon: Deployment monitoring for machine learning systems

Tony Ginart, Martin Jinye Zhang, and James Zou. Mldemon: Deployment monitoring for machine learning systems. In International conference on artificial intelligence and statis- tics, pages 3962–3997. PMLR, 2022. 2

2022
[24]

Generalization—a key challenge for respon- sible ai in patient-facing clinical applications.NPJ Digital Medicine, 7(1):126, 2024

Lea Goetz, Nabeel Seedat, Robert Vandersluis, and Mihaela van der Schaar. Generalization—a key challenge for respon- sible ai in patient-facing clinical applications.NPJ Digital Medicine, 7(1):126, 2024. 1

2024
[25]

Predicting with confidence on unseen distributions

Devin Guillory, Vaishaal Shankar, Sayna Ebrahimi, Trevor Darrell, and Ludwig Schmidt. Predicting with confidence on unseen distributions. InProceedings of the IEEE/CVF international conference on computer vision, pages 1134– 1144, 2021. 2

2021
[26]

Interpbench: Semi-synthetic trans- formers for evaluating mechanistic interpretability tech- niques.Advances in Neural Information Processing Systems, 37:92922–92951, 2024

Rohan Gupta, Iv ´an Arcuschin Moreno, Thomas Kwa, and Adri`a Garriga-Alonso. Interpbench: Semi-synthetic trans- formers for evaluating mechanistic interpretability tech- niques.Advances in Neural Information Processing Systems, 37:92922–92951, 2024. 8

2024
[27]

arXiv preprint arXiv:2403.17806 (2024) 3, 5, 9, 19

Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit over- lap when finding model mechanisms.arXiv preprint arXiv:2403.17806, 2024. 2, 3, 8, 4, 7

work page arXiv 2024
[28]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions.arXiv preprint arXiv:1903.12261, 2019. 6, 7, 1

work page internal anchor Pith review arXiv 1903
[29]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural net- works.arXiv preprint arXiv:1610.02136, 2016. 2, 4, 5, 7, 8

work page internal anchor Pith review arXiv 2016
[30]

How transform- ers solve propositional logic problems: A mechanistic anal- ysis

Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, and Rina Panigrahy. How transform- ers solve propositional logic problems: A mechanistic anal- ysis. 2024. 8

2024
[31]

Relations between two sets of variates

Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics: methodology and distribution, pages 162–190. Springer, 1992. 4

1992
[32]

Evalua- tion gaps in machine learning practice

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. Evalua- tion gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876, 2022. 1

2022
[33]

Label-efficient deep learning in medical image anal- ysis: Challenges and future directions.arXiv preprint arXiv:2303.12484, 2023

Cheng Jin, Zhengrui Guo, Yi Lin, Luyang Luo, and Hao Chen. Label-efficient deep learning in medical image anal- ysis: Challenges and future directions.arXiv preprint arXiv:2303.12484, 2023. 1

work page arXiv 2023
[34]

Wilds: A benchmark of in-the- wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the- wild distribution shifts. InInternational conference on machine learning, pages 5637–5664. PMLR, 2021. 4, 7, 1

2021
[35]

Active testing: Sample-efficient model evalu- ation

Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evalu- ation. InInternational Conference on Machine Learning, pages 5753–5763. PMLR, 2021. 1

2021
[36]

Towards inter- pretable sequence continuation: Analyzing shared circuits in large language models.arXiv preprint arXiv:2311.04131,

Michael Lan, Philip Torr, and Fazl Barez. Towards inter- pretable sequence continuation: Analyzing shared circuits in large language models.arXiv preprint arXiv:2311.04131,

work page arXiv
[37]

Mnist hand- written digit database.ATT Labs [Online]

Yann LeCun, Corinna Cortes, and CJ Burges. Mnist hand- written digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. 4

2010
[38]

Deeper, broader and artier domain generaliza- tion

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generaliza- tion. InProceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017. 4, 7, 1

2017
[39]

Optimal ablation for in- terpretability.Advances in Neural Information Processing Systems, 37:109233–109282, 2024

Maximilian Li and Lucas Janson. Optimal ablation for in- terpretability.Advances in Neural Information Processing Systems, 37:109233–109282, 2024. 4

2024
[40]

Beyond accuracy: en- suring correct predictions with correct rationales.Advances in Neural Information Processing Systems, 37:43164–43188,

Tang Li, Mengmeng Ma, and Xi Peng. Beyond accuracy: en- suring correct predictions with correct rationales.Advances in Neural Information Processing Systems, 37:43164–43188,
[41]

Towards good practices for efficiently annotating large-scale image classification datasets

Yuan-Hong Liao, Amlan Kar, and Sanja Fidler. Towards good practices for efficiently annotating large-scale image classification datasets. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4350–4359, 2021. 1

2021
[42]

” why is there a tumor?”: Tell me the reason, show me the ev- idence

Mengmeng Ma, Tang Li, Yunxiang Peng, Lu Lin, V olkan Beylergil, Binsheng Zhao, Oguz Akin, and Xi Peng. ” why is there a tumor?”: Tell me the reason, show me the ev- idence. InForty-second International Conference on Ma- chine Learning. 1
[43]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InProceedings of the AAAI con- ference on artificial intelligence, pages 2302–2310, 2021. 8

2021
[44]

Beyond the feder- ation: Topology-aware federated learning for generalization to unseen clients

Mengmeng Ma, Tang Li, and Xi Peng. Beyond the feder- ation: Topology-aware federated learning for generalization to unseen clients. InProceedings of the International Con- ference on Machine Learning (ICML), 2024. 8

2024
[45]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. Sparse feature cir- cuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. 4

work page internal anchor Pith review arXiv 2024
[46]

Provably robust detection of out-of-distribution data (almost) for free.arXiv preprint arXiv:2106.04260, 2021

Alexander Meinke, Julian Bitterwolf, and Matthias Hein. Provably robust detection of out-of-distribution data (almost) for free.arXiv preprint arXiv:2106.04260, 2021. 2

work page arXiv 2021
[47]

Locating and editing factual associations in gpt.Ad- vances in neural information processing systems, 35:17359– 17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt.Ad- vances in neural information processing systems, 35:17359– 17372, 2022. 3, 8, 4

2022
[48]

Circuit component reuse across tasks in transformer language mod- els.arXiv preprint arXiv:2310.08744, 2023

Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Circuit component reuse across tasks in transformer language mod- els.arXiv preprint arXiv:2310.08744, 2023. 8

work page arXiv 2023
[49]

Accuracy on the line: on the strong correlation between out-of-distribution and in- distribution generalization

John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in- distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021. 2, 4, 5, 8

2021
[50]

Circuit compositions: Exploring modular structures in transformer- based language models.arXiv preprint arXiv:2410.01434,

Philipp Mondorf, Sondre Wold, and Barbara Plank. Circuit compositions: Exploring modular structures in transformer- based language models.arXiv preprint arXiv:2410.01434,

work page arXiv
[51]

Mib: A mechanistic interpretability benchmark,

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv´an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto- Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, and Yonatan Belinkov. Mib: A mec...
[52]

arXiv preprint arXiv:2411.16105 , year=

Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, and David Jensen. Adaptive circuit behavior and generalization in mechanistic interpretability.arXiv preprint arXiv:2411.16105, 2024. 8

work page arXiv 2024
[53]

How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training.arXiv preprint arXiv:2502.11196, 2025

Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training.arXiv preprint arXiv:2502.11196, 2025. 8

work page arXiv 2025
[54]

Energy-based automated model evaluation.arXiv preprint arXiv:2401.12689, 2024

Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, and Junbo Zhao. Energy-based automated model evaluation.arXiv preprint arXiv:2401.12689, 2024. 4, 5, 7, 8

work page arXiv 2024
[55]

Learning to learn single domain generalization

Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12556–12565, 2020. 2, 8

2020
[56]

Mit Press, 2022

Joaquin Qui ˜nonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.Dataset shift in ma- chine learning. Mit Press, 2022. 2

2022
[57]

Failing loudly: an empirical study of methods for detecting dataset shift

S Rabanser, S G ¨unnemann, and ZC Lipton. Failing loudly: an empirical study of methods for detecting dataset shift. arxiv e-prints.arXiv preprint arXiv:1810.11953, 2018. 2

work page arXiv 2018
[58]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretabil- ity for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024. 2, 8

work page arXiv 2024
[59]

arXiv preprint arXiv:2404.14349 (2024) 2, 3

Achyuta Rajaram, Neil Chowdhury, Antonio Torralba, Jacob Andreas, and Sarah Schwettmann. Automatic discovery of visual circuits.arXiv preprint arXiv:2404.14349, 2024. 8

work page arXiv 2024
[60]

Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019. 7, 1

2019
[61]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst- case generalization.arXiv preprint arXiv:1911.08731, 2019. 4

work page internal anchor Pith review arXiv 1911
[62]

Predicting the per- formance of foundation models via agreement-on-the-line

Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, J Zico Kolter, and Aditi Raghunathan. Predicting the per- formance of foundation models via agreement-on-the-line. Advances in Neural Information Processing Systems, 37: 31854–31906, 2024. 8

2024
[63]

Towards understand- ing the role of sharpness-aware minimization algorithms for out-of-distribution generalization.arXiv preprint arXiv:2412.05169, 2024

Samuel Schapiro and Han Zhao. Towards understand- ing the role of sharpness-aware minimization algorithms for out-of-distribution generalization.arXiv preprint arXiv:2412.05169, 2024. 8

work page arXiv 2024
[64]

Measures of diagnostic accuracy: ba- sic definitions.ejifcc, 19(4):203, 2009

Ana-Maria ˇSimundi´c. Measures of diagnostic accuracy: ba- sic definitions.ejifcc, 19(4):203, 2009. 7

2009
[65]

arXiv preprint arXiv:2310.10348 , year=

Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery.arXiv preprint arXiv:2310.10348, 2023. 8, 4

work page arXiv 2023
[66]

The problem with metrics is a big problem for ai

R Thomas. The problem with metrics is a big problem for ai. Retrieved December, 23:2019, 2019. 1

2019
[67]

Netlsd: hearing the shape of a graph

Anton Tsitsulin, Davide Mottin, Panagiotis Karras, Alexan- der Bronstein, and Emmanuel M ¨uller. Netlsd: hearing the shape of a graph. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2347–2356, 2018. 6, 2

2018
[68]

A tutorial on spectral clustering.Statis- tics and computing, 17(4):395–416, 2007

Ulrike V on Luxburg. A tutorial on spectral clustering.Statis- tics and computing, 17(4):395–416, 2007. 6, 2

2007
[69]

On calibration and out-of-domain generalization.Advances in neural information processing systems, 34:2215–2227,

Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization.Advances in neural information processing systems, 34:2215–2227,
[70]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019. 7, 1

2019
[71]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022. 2, 3, 8

work page internal anchor Pith review arXiv 2022
[72]

Annotation-efficient deep learning for automatic medical image segmentation.Nature communica- tions, 12(1):5915, 2021

Shanshan Wang, Cheng Li, Rongpin Wang, Zaiyi Liu, Meiyun Wang, Hongna Tan, Yaping Wu, Xinfeng Liu, Hui Sun, Rui Yang, et al. Annotation-efficient deep learning for automatic medical image segmentation.Nature communica- tions, 12(1):5915, 2021. 1

2021
[73]

Assaying out-of-distribution generalization in transfer learning.Ad- vances in Neural Information Processing Systems, 35:7181– 7198, 2022

Florian Wenzel, Andrea Dittadi, Peter Gehler, Carl-Johann Simon-Gabriel, Max Horn, Dominik Zietlow, David Kernert, Chris Russell, Thomas Brox, Bernt Schiele, et al. Assaying out-of-distribution generalization in transfer learning.Ad- vances in Neural Information Processing Systems, 35:7181– 7198, 2022. 2

2022
[74]

PyTorch Image Models

Ross Wightman. PyTorch Image Models. 7, 2
[75]

Generalized out-of-distribution detection: A survey.Inter- national Journal of Computer Vision, 132(12):5635–5662,

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey.Inter- national Journal of Computer Vision, 132(12):5635–5662,
[76]

Knowledge circuits in pretrained transformers.Advances in Neural Information Processing Systems, 37:118571–118602, 2024

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers.Advances in Neural Information Processing Systems, 37:118571–118602, 2024. 8

2024
[77]

Yosinski, J

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579, 2015. 5

work page arXiv 2015
[78]

A survey on evaluation of out-of-distribution general- ization.arXiv preprint arXiv:2403.01874, 2024

Han Yu, Jiashuo Liu, Xingxuan Zhang, Jiayun Wu, and Peng Cui. A survey on evaluation of out-of-distribution general- ization.arXiv preprint arXiv:2403.01874, 2024. 1, 8

work page arXiv 2024
[79]

Characterizing mechanisms for factual recall in language models.arXiv preprint arXiv:2310.15910, 2023

Qinan Yu, Jack Merullo, and Ellie Pavlick. Characterizing mechanisms for factual recall in language models.arXiv preprint arXiv:2310.15910, 2023. 8

work page arXiv 2023
[80]

and Fergus, Rob , month = nov, year =

Matthew D Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. arxiv.arXiv preprint arXiv:1311.2901, 2013. 5

work page arXiv 2013

Showing first 80 references.