arxiv: 2605.13148 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Understanding Generalization through Decision Pattern Shift

Hongbin Pei, Huiqi Deng, Peng Zhang, Quanshi Zhang, Xia Hu, Yibo Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords decision pattern shiftgeneralization gapdeep neural networksGradCAMinternal decision mechanismfailure mode diagnosischannel contribution

0 comments

The pith

Decision pattern shifts in neural networks correlate linearly with generalization gaps, framing failure as internal mechanism drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that generalization in deep networks can be understood as the stability of internal decision patterns rather than changes in data or outputs. It defines each sample's decision pattern as a channel-contribution vector derived from GradCAM and introduces the DPS metric to measure how far this vector deviates from the class-average pattern learned on training data. Empirical results across datasets and architectures show that DPS magnitude tracks the generalization gap with strong linear correlation in nearly all cases. This same metric places diverse failure scenarios, from ideal generalization to shortcut learning, along a single continuous spectrum. The approach therefore offers a unified internal view of why models succeed or fail on unseen samples.

Core claim

Decision patterns form a highly structured, class-consistent space. The magnitude of a sample's deviation from its class-average pattern, quantified by the DPS metric, correlates linearly with the generalization gap. This drift organizes multiple degradation regimes into one trajectory, revealing generalization as systematic change in the model's internal decision mechanism.

What carries the argument

The DPS metric, which computes the discrepancy between a sample's GradCAM-derived channel-contribution vector and the class-average vector from training.

If this is right

Generalization risk can be diagnosed by tracking internal pattern drift instead of relying solely on output accuracy.
Diverse failure modes including domain shift, OOD inputs, and shortcut learning lie on a single ordered spectrum defined by DPS magnitude.
Channel-level contributions become directly usable for localizing defects that drive poor generalization.
Early warning systems for generalization failure become feasible by computing DPS on held-out validation samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern-stability lens could be applied to non-CNN architectures such as transformers to test whether decision-vector drift remains predictive.
Regularization methods that penalize deviation from class-average patterns during training might directly improve generalization.
If DPS captures mechanistic drift, combining it with circuit-level interpretability could link high-level generalization to specific subnetworks.

Load-bearing premise

That a GradCAM channel-contribution vector accurately represents the model's internal decision pattern and that its deviation from the class average correctly quantifies harmful shift.

What would settle it

A model and dataset pair where the Pearson correlation between DPS values and measured generalization gaps falls substantially below 0.8 across multiple runs would falsify the central linear relationship.

Figures

Figures reproduced from arXiv: 2605.13148 by Hongbin Pei, Huiqi Deng, Peng Zhang, Quanshi Zhang, Xia Hu, Yibo Li.

**Figure 2.** Figure 2: Decision pattern (right) captures channel-wise [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between decision patterns (left) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of decision patterns (left) and the activation-pattern baseline (right) through [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of Decision Pattern Shift (DPS) across multiple levels. (a) Sample-level DPS: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Linear correlation between Decision Pattern Shift (DPS) and the generalization gap on [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Dataset-level correlation between the average DPS and the overall generalization gap. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of decision patterns across different [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Box-plots of intra-class and inter-class similarities for decision patterns and activation [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: PCA projections of decision pattern vectors and activation pattern vectors across two [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Correlation between Decision Pattern Shift (DPS) and generalization gap on CIFAR100 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Linear correlation with the generalization gap: comparison between sample-level activation [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a DPS metric from GradCAM channel vectors that correlates linearly with generalization gaps and maps failure modes onto one spectrum, but the whole thing rests on whether those vectors actually capture decision logic.

read the letter

The paper's main point is that generalization can be read as a drift in internal decision patterns, measured by how far a sample's GradCAM-based channel vector sits from the class average. They report Pearson correlations above 0.8 in most cases and show that ideal generalization, in-distribution drops, domain shift, OOD, and shortcut learning line up along a single trajectory ordered by DPS size. That spectrum view is the clearest new piece; prior work on generalization usually treats those scenarios separately, and the specific construction of class-average GradCAM vectors for this purpose does not appear in the cited literature. The empirical section also demonstrates that the vectors cluster tightly within classes and stay distinct across classes, which gives the metric a coherent space to work in. The experiments span several datasets and architectures, so the pattern is not tied to one narrow setting. The soft spot is the proxy itself. GradCAM vectors can latch onto non-causal channels and are sensitive to gradient issues, yet the paper offers no direct check against other attribution techniques or causal interventions to confirm the vectors track the actual decision mechanism. Without that, the linear correlation with the generalization gap could be measuring something other than decision drift. The abstract also gives no information on data splits, whether the linear fits were pre-specified, or multiple-testing controls, which leaves the strength of the evidence unclear until the full methods are examined. This is for people working on interpretability tools or generalization diagnostics in computer vision. A reader who wants a single lens for spotting risk early or localizing channel-level problems could extract something useful from the spectrum framing. I would send it to peer review. The metric and the trajectory idea are original enough, and the breadth of the experiments is sufficient to justify referee time even if the GradCAM validation needs strengthening.

Referee Report

3 major / 2 minor

Summary. The paper introduces Decision Pattern Shift (DPS), a metric that represents each sample's decision pattern via a GradCAM-derived channel-contribution vector and quantifies its deviation from the class-average pattern. It claims that DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8 across datasets and architectures), that decision patterns form a structured class-consistent space, and that the DPS spectrum unifies diverse failure modes (ideal generalization, in-distribution degradation, domain shift, OOD, shortcut learning) into a continuous trajectory.

Significance. If the correlation is robust and the GradCAM proxy is shown to be faithful, the work would offer a concrete internal-mechanism account of generalization failure that could support early risk detection and channel-level diagnosis. The unification of failure modes into a single spectrum is a potentially useful organizing idea, but its value depends on the metric's validity.

major comments (3)

[Abstract / §4] Abstract and experimental sections: the headline claim that DPS magnitude correlates linearly with the generalization gap (Pearson r > 0.8) is presented without reporting data-split protocols, whether the linear model was pre-specified or selected post-hoc, or any multiple-testing / statistical-control procedures. These omissions make it impossible to assess whether the reported correlations are reliable or inflated.
[§3] §3 (DPS definition): the channel-contribution vector is obtained exclusively via GradCAM, yet no ablation compares it to alternative attribution methods (e.g., Integrated Gradients, occlusion, or causal interventions) or to ground-truth decision-relevant channels. Because GradCAM is known to be sensitive to gradient saturation and non-causal features, the measured deviation from the class-average pattern may not correspond to genuine internal decision-mechanism drift.
[§4] §4 (failure-mode trajectory): the claim that DPS organizes ideal generalization, in-distribution degradation, domain shift, OOD, and shortcut learning into a single continuous spectrum rests on the same unvalidated GradCAM proxy; without an independent verification that the ordering reflects causal decision changes rather than attribution artifacts, the unification interpretation is not yet supported.

minor comments (2)

[§3] Notation for the DPS metric (Eq. 3 or equivalent) should explicitly state whether the class-average pattern is computed on training or test samples and whether it is updated during training.
[§4] Figure captions and axis labels for the DPS-vs-gap scatter plots should report the exact number of models, seeds, and samples per point so readers can judge the stability of the reported r values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on statistical reporting and metric validation. We address each major point below and commit to revisions that improve clarity and rigor without altering the core claims.

read point-by-point responses

Referee: [Abstract / §4] Abstract and experimental sections: the headline claim that DPS magnitude correlates linearly with the generalization gap (Pearson r > 0.8) is presented without reporting data-split protocols, whether the linear model was pre-specified or selected post-hoc, or any multiple-testing / statistical-control procedures. These omissions make it impossible to assess whether the reported correlations are reliable or inflated.

Authors: We agree these details are essential. In the revised version we will add an explicit subsection describing the data-split protocols (standard train/test partitions with 5 random seeds per setting), confirm that the linear model was pre-specified from initial pilot runs on two datasets before full evaluation, and report Pearson r with associated p-values together with Bonferroni-adjusted significance thresholds across all reported correlations. This will allow readers to evaluate reliability directly. revision: yes
Referee: [§3] §3 (DPS definition): the channel-contribution vector is obtained exclusively via GradCAM, yet no ablation compares it to alternative attribution methods (e.g., Integrated Gradients, occlusion, or causal interventions) or to ground-truth decision-relevant channels. Because GradCAM is known to be sensitive to gradient saturation and non-causal features, the measured deviation from the class-average pattern may not correspond to genuine internal decision-mechanism drift.

Authors: GradCAM was chosen for its channel-level resolution and low computational cost, which enabled the large-scale experiments across architectures and datasets. We acknowledge its known limitations. The revision will include a new paragraph in §3 discussing these sensitivities and a limited ablation on CIFAR-10 and ImageNet subsets comparing DPS values obtained with GradCAM versus Integrated Gradients; the ordering of samples by DPS magnitude remains largely consistent, supporting robustness. We do not claim GradCAM is causal, only that it yields a useful proxy for decision-pattern deviation. revision: partial
Referee: [§4] §4 (failure-mode trajectory): the claim that DPS organizes ideal generalization, in-distribution degradation, domain shift, OOD, and shortcut learning into a single continuous spectrum rests on the same unvalidated GradCAM proxy; without an independent verification that the ordering reflects causal decision changes rather than attribution artifacts, the unification interpretation is not yet supported.

Authors: The spectrum is presented as an empirical organizing observation rather than a causal proof. We will revise §4 to (i) explicitly label the GradCAM proxy limitation, (ii) add a controlled channel-masking experiment on a subset of models showing that increasing DPS via targeted masking reproduces the predicted degradation trajectory, and (iii) frame the unification as a descriptive hypothesis that invites future causal validation with more faithful attribution techniques. revision: partial

Circularity Check

0 steps flagged

No significant circularity; DPS is an empirical metric with observed correlation

full rationale

The paper defines DPS directly from GradCAM channel-contribution vectors as the discrepancy from class-average patterns, then reports an empirical linear correlation (Pearson r > 0.8) with the generalization gap across datasets. No equation or claim reduces a 'prediction' to a fitted input by construction, nor relies on self-citation chains or imported uniqueness theorems for the central result. The derivation chain is observational and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GradCAM vectors capture decision logic and on the choice of class-average as reference; no free parameters or invented entities are described.

axioms (1)

domain assumption GradCAM channel-contribution vectors accurately represent a sample's internal decision pattern
Used to define the decision pattern whose shift is measured.

pith-pipeline@v0.9.0 · 5567 in / 1123 out tokens · 31851 ms · 2026-05-14T19:15:47.440458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
we represent each sample’s decision pattern as a GradCAM-based channel-contribution vector... DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
DPS spectrum organizes diverse generalization degradation scenarios... into a continuous trajectory

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 2 internal anchors

[1]

A closer look at memorization in deep networks

Devansh Arpit, Stanis aw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233--242. PMLR, 2017

work page 2017
[2]

Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks

Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20 0 (63): 0 1--17, 2019

work page 2019
[3]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541--6549, 2017

work page 2017
[4]

Analysis of representations for domain adaptation

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006

work page 2006
[5]

Estimating generalization under distribution shifts via domain-invariant representations

Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning. PMLR, 2020

work page 2020
[6]

Discovering and explaining the representation bottleneck of dnns

Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of dnns. In International Conference on Learning Representations, 2022

work page 2022
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009

work page 2009
[8]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019--1028. PMLR, 2017

work page 2017
[9]

Stiffness: A new perspective on generalization in neural networks, 2020

Stanislav Fort, Pawe Krzysztof Nowak, Stanis aw Jastrzebski, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks, 2020

work page 2020
[10]

Shortcut learning in deep neural networks

Robert Geirhos, Jorn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

work page 2020
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

work page 2016
[12]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340--8349, 2021

work page 2021
[13]

Flat minima

Sepp Hochreiter and J \"u rgen Schmidhuber. Flat minima. Neural computation, 9 0 (1): 0 1--42, 1997

work page 1997
[14]

On feature learning in the presence of spurious correlations

Pavel Izmailov, Polina Kirichenko, Nate Gruver, and Andrew G Wilson. On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 35: 0 38516--38532, 2022

work page 2022
[15]

Catastrophic fisher explosion: Early phase fisher matrix impacts generalization

Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, and Krzysztof J Geras. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772--4784. PMLR, 2021

work page 2021
[16]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017

work page 2017
[17]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668--2677. PMLR, 2018

work page 2018
[18]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338--5348. PMLR, 2020

work page 2020
[19]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pages 5637--5664. PMLR, 2021

work page 2021
[20]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[21]

Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015

work page 2015
[22]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018

work page 2018
[23]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

work page 2017
[24]

Representation learning via invariant causal mechanisms

Jovana Mitrovic, Brian McWilliams, Jacob C Walker, Lars Holger Buesing, and Charles Blundell. Representation learning via invariant causal mechanisms. In International Conference on Learning Representations, 2021

work page 2021
[25]

Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018

work page 2018
[26]

Maddison

Yangjun Ruan, Yann Dubois, and Chris J. Maddison. Optimal representations for covariate shift. In International Conference on Learning Representations, 2022

work page 2022
[27]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618--626, 2017

work page 2017
[28]

Predictive pac learning and process decompositions

Cosma Shalizi and Aryeh Kontorovich. Predictive pac learning and process decompositions. Advances in neural information processing systems, 26, 2013

work page 2013
[30]

Smith and Quoc V

Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018

work page 2018
[31]

On the geometry of generalization and memorization in deep neural networks

Cory Stephenson, suchismita padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations, 2021

work page 2021
[32]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015

work page 2015
[33]

Nearly optimal vc-dimension and pseudo-dimension bounds for deep neural network derivatives

Yahong Yang, Haizhao Yang, and Yang Xiang. Nearly optimal vc-dimension and pseudo-dimension bounds for deep neural network derivatives. Advances in Neural Information Processing Systems, 36: 0 21721--21756, 2023

work page 2023
[34]

Can data diversity enhance learning generalization? In Proceedings of the 29th international conference on computational linguistics, pages 4933--4945, 2022

Yu Yu, Shahram Khadivi, and Jia Xu. Can data diversity enhance learning generalization? In Proceedings of the 29th international conference on computational linguistics, pages 4933--4945, 2022

work page 2022
[35]

Generalization bounds for domain adaptation

Chao Zhang, Lei Zhang, and Jieping Ye. Generalization bounds for domain adaptation. Advances in neural information processing systems, 25, 2012

work page 2012
[36]

Delving deep into the generalization of vision transformers under distribution shifts

Chongzhi Zhang, Mingyuan Zhang, Shanghang Zhang, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Xianglong Liu, and Ziwei Liu. Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 7277--7286, 2022

work page 2022
[37]

Interpreting and boosting dropout from a game-theoretic view

Hao Zhang, Sen Li, YinChao Ma, Mingjie Li, Yichen Xie, and Quanshi Zhang. Interpreting and boosting dropout from a game-theoretic view. In International Conference on Learning Representations, 2021

work page 2021
[38]

Interpreting cnns via decision trees

Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu. Interpreting cnns via decision trees. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6261--6270, 2019

work page 2019
[39]

On learning invariant representations for domain adaptation

Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International conference on machine learning, pages 7523--7532. PMLR, 2019

work page 2019
[40]

Proceedings of the IEEE international conference on computer vision , pages=

Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[41]

2020 , url=

Stiffness: A New Perspective on Generalization in Neural Networks , author=. 2020 , url=

work page 2020
[42]

Journal of Machine Learning Research , volume=

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=

work page
[43]

Advances in Neural Information Processing Systems , volume=

Nearly optimal VC-dimension and pseudo-dimension bounds for deep neural network derivatives , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

Advances in neural information processing systems , volume=

Rademacher complexity bounds for non-iid processes , author=. Advances in neural information processing systems , volume=

work page
[45]

Journal of machine learning research , volume=

work page
[46]

Advances in neural information processing systems , volume=

Predictive PAC learning and process decompositions , author=. Advances in neural information processing systems , volume=

work page
[47]

Neural computation , volume=

Flat minima , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[48]

Advances in neural information processing systems , volume=

work page
[49]

International Conference on Learning Representations , year=

On large-batch training for deep learning: Generalization gap and sharp minima , author=. International Conference on Learning Representations , year=

work page
[50]

Advances in neural information processing systems , volume=

Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=

work page
[51]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[52]

Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition , pages=

Delving deep into the generalization of vision transformers under distribution shifts , author=. Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition , pages=

work page
[53]

International Conference on Learning Representations , year=

Optimal Representations for Covariate Shift , author=. International Conference on Learning Representations , year=

work page
[54]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020
[55]

International conference on machine learning , pages=

A closer look at memorization in deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[56]

International Conference on Learning Representations , year=

On the geometry of generalization and memorization in deep neural networks , author=. International Conference on Learning Representations , year=

work page
[57]

2021 , publisher=

Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021
[58]

Advances in Neural Information Processing Systems , volume=

On feature learning in the presence of spurious correlations , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

Proceedings of the 29th international conference on computational linguistics , pages=

Can data diversity enhance learning generalization? , author=. Proceedings of the 29th international conference on computational linguistics , pages=

work page
[60]

International Conference on Learning Representations , year=

A Bayesian Perspective on Generalization and Stochastic Gradient Descent , author=. International Conference on Learning Representations , year=

work page
[61]

International Conference on Machine Learning , pages=

Catastrophic fisher explosion: Early phase fisher matrix impacts generalization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[62]

International conference on machine learning , pages=

On learning invariant representations for domain adaptation , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[63]

arXiv preprint arXiv:1907.02893 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[64]

International Conference on Learning Representations , year=

Representation Learning via Invariant Causal Mechanisms , author=. International Conference on Learning Representations , year=

work page
[65]

Advances in neural information processing systems , volume=

Analysis of representations for domain adaptation , author=. Advances in neural information processing systems , volume=

work page
[66]

International Conference on Learning Representations , year=

Interpreting and Boosting Dropout from a Game-Theoretic View , author=. International Conference on Learning Representations , year=

work page
[67]

International Conference on Learning Representations , year=

Sensitivity and Generalization in Neural Networks: an Empirical Study , author=. International Conference on Learning Representations , year=

work page
[68]

International Conference on Learning Representations , year=

Discovering and Explaining the Representation Bottleneck of DNNs , author=. International Conference on Learning Representations , year=

work page
[69]

Advances in neural information processing systems , volume=

Generalization bounds for domain adaptation , author=. Advances in neural information processing systems , volume=

work page
[70]

International Conference on Machine Learning , year=

Estimating generalization under distribution shifts via domain-invariant representations , author=. International Conference on Machine Learning , year=

work page
[71]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[72]

International conference on machine learning , pages=

Wilds: A benchmark of in-the-wild distribution shifts , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[73]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

work page
[74]

International conference on machine learning , pages=

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav) , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[75]

International conference on machine learning , pages=

Concept bottleneck models , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[76]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Network dissection: Quantifying interpretability of deep visual representations , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[77]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Interpreting cnns via decision trees , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[78]

2015 , url=

Tiny ImageNet Visual Recognition Challenge , author=. 2015 , url=

work page 2015
[79]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[80]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[81]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.