pith. sign in

arxiv: 2605.17180 · v1 · pith:JVWPMLWUnew · submitted 2026-05-16 · 💻 cs.LG · math.OC· stat.ML

The Geometry of Projection Heads: Conditioning, Invariance, and Collapse

Pith reviewed 2026-05-20 14:09 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords projection headsself-supervised learningdimensional collapseRiemannian metricHessian eigenvaluesoptimization geometrycontrastive objectives
0
0 comments X

The pith

Nonlinear projection heads make collapsed states unstable by inducing negative eigenvalues in the Hessian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a geometric theory of projection heads in self-supervised learning by treating the head as a trainable Riemannian metric on the backbone representation manifold. Linear heads perform implicit subspace whitening while nonlinear heads adapt local metrics to the loss topology, with depth controlling that adaptability. Smooth nonlinear activations create negative curvature at collapsed equilibria, destabilizing them under continuous gradient flow, whereas linear and ReLU heads cannot and must depend on discrete steps or BatchNorm. This metric perspective also shows how degeneracy in the induced metric governs the trade-off between information preservation and invariance, explaining the standard practice of discarding the head after pretraining. The result frames the head as a geometric buffer that shields semantic features from the pretraining objective's rigid demands.

Core claim

By modeling the projection head as a trainable Riemannian metric on the backbone representation manifold, the analysis establishes that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, rendering those states unstable. Linear and ReLU heads lack this native negative curvature under continuous-time gradient flow and instead rely on discrete-time dynamics or BatchNorm to escape. The same metric view characterizes how degeneracy controls the information-invariance trade-off and directly accounts for why the head must be removed after training.

What carries the argument

The trainable Riemannian metric induced by the projection head on the backbone representation manifold, which adapts local geometry to loss constraints and generates curvature at collapse points.

If this is right

  • Linear heads implicitly perform subspace whitening.
  • Nonlinear head depth increases the capacity to adapt local metrics to the loss's topological constraints.
  • Smooth activations such as Swish generate explicit negative curvature that enables escape from collapse.
  • Metric degeneracy directly governs the information-invariance trade-off and necessitates discarding the head.
  • The head functions as a universal geometric buffer that decouples the semantic backbone from pretraining constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could select activations to tune the sign of curvature and reduce reliance on BatchNorm for stability.
  • Continuous tracking of Hessian eigenvalues during training offers a practical diagnostic for early collapse risk.
  • The Riemannian-metric framing may extend to other representation-learning components to reveal analogous conditioning effects.
  • Keeping a lightweight nonlinear head at inference time could preserve some invariance benefits without harming downstream tasks.

Load-bearing premise

The projection head can be modeled as a trainable Riemannian metric on the backbone representation manifold.

What would settle it

A computation of the Hessian at a collapsed equilibrium under a smooth nonlinear head that shows all eigenvalues are nonnegative, or a continuous-time simulation in which such heads remain trapped in collapse without discrete steps or BatchNorm.

Figures

Figures reproduced from arXiv: 2605.17180 by Faris Chaudhry.

Figure 1
Figure 1. Figure 1: The SSL pipeline with a projection head. The back￾bone fθ maps augmented inputs tξ(x) to the representation man￾ifold Z. The projection head hϕ acts as a Riemannian precon￾ditioner, mapping z to the loss space where invariance is en￾forced. Supervised downstream tasks (consisting of labeled data (x, y) ∈ X × Y) operate directly on z. The fact that we bypass hϕ(z) to use z directly for the downstream task i… view at source ↗
Figure 3
Figure 3. Figure 3: Geometric preconditioning and collapse recovery (CIFAR-10, ResNet-18). Smooth heads (Swish) natively inject negative curvature (λmin < 0) to navigate the landscape and aggressively escape collapsed equilibria, whereas pure ReLU networks lack this intrinsic mechanism [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Collapse instability and the ReLU gap. Smooth nonlinearities (left) explicitly destabilize the collapsed equilibrium, triggering a mechanical escape that drives representation variance upward. Lacking intrinsic continuous-time curvature, ReLU networks (right) fail to generate escape directions and suffer irreversible dimensional collapse under continuous-like gradient flow (small LR, no BN). Only with BN o… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing metric singularity (orbit collapse). A PCA projection of augmentation orbits constructed by rotation. The star marker denotes the representation of the original, unaugmented image, serving as the anchor for each augmentation orbit. The representation space of the backbone (left) preserves the geometric variance of the rotation transformation, allowing orientation information to be linearly reco… view at source ↗
Figure 6
Figure 6. Figure 6: Residual gradients and Swish instability (CIFAR-10, ResNet-18). Most heads maintain nonvanishing residual gradients (top left), satisfying Assumption 6. Unlike ReLU, Swish can escape collapse across all learning rates and normalization settings (top right). A healthy projection head should have reasonably high condition number (bottom) in order to warp the geometry of the space. This is observed across mos… view at source ↗
Figure 7
Figure 7. Figure 7: Collapse instability (CIFAR-100). Evolution of representation variance during training of 3 seeds. Smooth nonlinearities (GELU, Swish) possess the necessary curvature to destabilize the equilibrium and escape the collapsed basin. While the linear head exhibits recovery within these 20 epochs, it only ends up reaching its initial representation variance. Further, the initial decrease in representation varia… view at source ↗
Figure 8
Figure 8. Figure 8: Optimization geometry (CIFAR-10, ViT-Tiny). Unlike ResNets, the ViT backbone is exceptionally stiff. For all activations, representation variance (top left) remains trapped near zero. It is observed that adding BatchNorm for ReLU (top right) allows for some attempt at escape. The condition numbers (bottom left) remain mostly flat. Finally, the residual gradients are still bounded away from zero for most ac… view at source ↗
read the original abstract

We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. It claims that linear heads perform implicit subspace whitening while nonlinear heads adapt local metrics to the loss's topological constraints (with depth controlling capacity), proves that smooth nonlinear heads induce negative Hessian eigenvalues at collapsed equilibria (rendering them unstable), and empirically tracks optimization geometry to show that activations like Swish generate explicit negative curvature to escape collapse under continuous-time flow (unlike linear/ReLU heads, which rely on discrete dynamics and BatchNorm). The work further characterizes metric degeneracy in the information-invariance trade-off and positions the head as a universal geometric buffer, with evaluations across contrastive and decorrelation objectives on foundation models.

Significance. If the geometric modeling and derivations hold, the paper supplies a coherent framework explaining the functional role of projection heads, their necessity for avoiding destructive constraints during pretraining, and the reason they are discarded afterward. The empirical tracking of curvature across activations and the cross-objective validation on foundation models are concrete strengths that could inform practical design choices for mitigating dimensional collapse.

major comments (2)
  1. [Geometric modeling (opening claim and Hessian analysis)] The central claim that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria rests on modeling the projection head as a trainable Riemannian metric. The explicit map from head weights to the metric tensor, the coordinate chart on the backbone manifold, and the precise dependence of the metric on head parameters are left implicit, so it is unclear whether the eigenvalue sign follows from the architecture or from an auxiliary choice in the geometric construction.
  2. [Proof of Hessian negativity and continuous-time analysis] The abstract asserts proofs of negative eigenvalues together with continuous-time analysis, yet the manuscript provides neither the full derivations nor the explicit connection between the continuous-time gradient flow and the discrete SGD dynamics actually used in training. This gap is load-bearing for the instability result and the claim that only smooth nonlinear heads can escape collapse via curvature.
minor comments (2)
  1. [Notation and definitions] Notation for the Riemannian metric tensor and its dependence on head depth should be introduced with an explicit equation early in the geometric-modeling section to improve readability.
  2. [Empirical validation] The empirical section would benefit from a table summarizing the tracked curvature values (or eigenvalue signs) for Swish, ReLU, and linear heads across the reported objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of the geometric framework. We address the two major comments point by point below, clarifying the modeling and committing to explicit additions that strengthen the derivations without altering the core claims.

read point-by-point responses
  1. Referee: [Geometric modeling (opening claim and Hessian analysis)] The central claim that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria rests on modeling the projection head as a trainable Riemannian metric. The explicit map from head weights to the metric tensor, the coordinate chart on the backbone manifold, and the precise dependence of the metric on head parameters are left implicit, so it is unclear whether the eigenvalue sign follows from the architecture or from an auxiliary choice in the geometric construction.

    Authors: We agree that the geometric construction benefits from greater explicitness. Section 3 defines the head as inducing a trainable metric g_θ via the Jacobian pullback of the Euclidean metric on the output space, with the backbone manifold equipped with the standard coordinate chart induced by the representation embedding. The dependence on head parameters θ enters through the Jacobian of the nonlinear head h_θ. In revision we will insert a new subsection that writes the map θ ↦ g_θ explicitly, shows that the sign of the Hessian eigenvalues at collapse is determined solely by the second derivative of the activation (negative for smooth nonlinearities such as Swish), and confirms no auxiliary choices are required. This addition will make the architectural origin of the negativity unambiguous. revision: yes

  2. Referee: [Proof of Hessian negativity and continuous-time analysis] The abstract asserts proofs of negative eigenvalues together with continuous-time analysis, yet the manuscript provides neither the full derivations nor the explicit connection between the continuous-time gradient flow and the discrete SGD dynamics actually used in training. This gap is load-bearing for the instability result and the claim that only smooth nonlinear heads can escape collapse via curvature.

    Authors: The manuscript contains sketches of the Hessian computation and continuous-time flow in Sections 4–5 together with empirical curvature tracking, but we acknowledge that complete derivations and the discrete-to-continuous link are not expanded. In the revision we will add a self-contained appendix with the full Hessian derivation at collapsed equilibria, showing that smoothness of the activation produces at least one negative eigenvalue. We will also include a paragraph relating the continuous gradient flow to discrete SGD under small learning-rate regimes, noting that the instability induced by negative curvature persists in the discrete setting and is observed in our actual training runs. These additions directly address the load-bearing gap while preserving the original empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; geometric modeling is independent foundation

full rationale

The paper opens by adopting the modeling choice that the projection head is a trainable Riemannian metric on the backbone representation manifold. All subsequent claims—including implicit whitening for linear heads, adaptation of local metrics by nonlinear heads, and the proof of negative Hessian eigenvalues at collapsed equilibria for smooth nonlinear heads—are presented as consequences derived inside this framework. No equations, self-citations, or fitted parameters are shown in the supplied text that would reduce any of these results to the modeling assumption by construction. Empirical tracking of optimization geometry is described separately from the theoretical derivation. The chain therefore remains self-contained and does not exhibit the required explicit reduction for a circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The theory rests on the domain assumption that projection heads define trainable Riemannian metrics; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption The projection head can be modeled as a trainable Riemannian metric on the backbone representation manifold.
    Foundational modeling choice that enables all subsequent geometric analysis of conditioning, invariance, and collapse.

pith-pipeline@v0.9.0 · 5712 in / 1329 out tokens · 67986 ms · 2026-05-20T14:09:20.404007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

  1. [1]

    , series =

    Lee, John M. , series =. Introduction to. 2018 , url =

  2. [2]

    2012 , url =

    Introduction to Smooth Manifolds , author =. 2012 , url =

  3. [3]

    1995 , url =

    Perturbation Theory for Linear Operators , author =. 1995 , url =

  4. [4]

    Matrix Perturbation Theory , author =

  5. [5]

    International Conference on Learning Representations (ICLR) , year =

    Projection Head is Secretly an Information Bottleneck , author =. International Conference on Learning Representations (ICLR) , year =

  6. [6]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =

    A Simple Framework for Contrastive Learning of Visual Representations , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =. 2020 , volume =

  7. [7]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2020 , url =

  8. [8]

    Advances in Neural Information Processing Systems , volume =

    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  9. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Chen, Xinlei and He, Kaiming , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  10. [10]

    Transactions on Machine Learning Research , year =

    Guillotine Regularization: Why Removing Layers Is Needed to Improve Generalization in Self-Supervised Learning , author =. Transactions on Machine Learning Research , year =

  11. [11]

    Journal of Machine Learning Research , volume =

    Emergence of Invariance and Disentanglement in Deep Representations , author =. Journal of Machine Learning Research , volume =. 2018 , url =

  12. [12]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , series =

    Understanding Self-Supervised Learning Dynamics without Contrastive Pairs , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , series =. 2021 , publisher =

  13. [13]

    International Conference on Learning Representations (ICLR) , year =

    Understanding Dimensional Collapse in Contrastive Self-Supervised Learning , author =. International Conference on Learning Representations (ICLR) , year =

  14. [14]

    2016 , publisher =

    Information Geometry and Its Applications , author =. 2016 , publisher =. doi:10.1007/978-4-431-55978-8 , url =

  15. [15]

    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning , url =

    Wen, Zixin and Li, Yuanzhi , booktitle =. The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning , url =

  16. [16]

    Pierre H. Richemond and Jean-Bastien Grill and Florent Altché and Corentin Tallec and Florian Strub and Andrew Brock and Samuel Smith and Soham De and Razvan Pascanu and Bilal Piot and Michal Valko , year=. 2010.10241 , archivePrefix=

  17. [17]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , pages=

    Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St. Proceedings of the 38th International Conference on Machine Learning (ICML) , pages=. 2021 , volume=

  18. [18]

    2022 , url=

    Bardes, Adrien and Ponce, Jean and LeCun, Yann , booktitle=. 2022 , url=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Effects of Data Geometry in Early Deep Learning , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  20. [20]

    Natural Gradient Works Efficiently in Learning , year=

    Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=

  21. [21]

    2019 , eprint=

    Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=

  22. [22]

    and Chopra, S

    Hadsell, R. and Chopra, S. and LeCun, Y. , booktitle=. Dimensionality Reduction by Learning an Invariant Mapping , year=

  23. [23]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Unsupervised Feature Learning via Non-Parametric Instance Discrimination , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2018 , url=

  24. [24]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

    Unsupervised Visual Representation Learning by Context Prediction , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=. 2015 , url=

  25. [25]

    , booktitle=

    Yang, Greg and Hu, Edward J. , booktitle=. Tensor Programs. 2021 , volume=

  26. [26]

    1991 , url =

    Approximation capabilities of multilayer feedforward networks , journal =. 1991 , url =

  27. [27]

    29th Annual Conference on Learning Theory , pages =

    Gradient Descent Only Converges to Minimizers , author =. 29th Annual Conference on Learning Theory , pages =. 2016 , editor =

  28. [28]

    Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of

    Raghu, Aniruddh and Raghu, Maithra and Bengio, Samy and Vinyals, Oriol , booktitle =. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of. 2020 , url =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Deep Learning versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  30. [30]

    Proceedings of the 38th International Conference on Machine Learning , series =

    Whitening for Self-Supervised Representation Learning , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =

  31. [31]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

    An Empirical Study of Training Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =. 2021 , url =

  32. [32]

    When Vision Transformers Outperform

    Chen, Xiangning and Hsieh, Cho-Jui and Gong, Boqing , booktitle =. When Vision Transformers Outperform. 2022 , url =

  33. [33]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel , year=. Gaussian Error Linear Units (. 1606.08415 , archivePrefix=

  34. [34]

    Advances in Neural Information Processing Systems , volume =

    How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

  35. [35]

    Neural Collapse is Globally Optimal in Deep Regularized

    S. Neural Collapse is Globally Optimal in Deep Regularized. Advances in Neural Information Processing Systems , year =

  36. [36]

    Proceedings of the 38th International Conference on Machine Learning , series =

    Training Data-Efficient Image Transformers & Distillation through Attention , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =

  37. [37]

    Advances in Neural Information Processing Systems , volume=

    Do Vision Transformers See Like Convolutional Neural Networks? , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

  38. [38]

    Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) , volume=

    Non-asymptotic Theory of Random Matrices: Extreme Singular Values , author=. Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) , volume=. 2010 , publisher=

  39. [39]

    2018 , eprint=

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , author=. 2018 , eprint=

  40. [40]

    Proceedings of the 36th International Conference on Machine Learning , series=

    Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , author=. Proceedings of the 36th International Conference on Machine Learning , series=. 2019 , publisher=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=. 2018 , url=

  42. [42]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

    Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =. 2021 , url=

  43. [43]

    Advances in Neural Information Processing Systems , volume =

    Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url=

  44. [44]

    Advances in Neural Information Processing Systems , volume =

    Big Self-Supervised Models are Strong Semi-Supervised Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url=

  45. [45]

    Asymptotic and Finite-Time Guarantees for

    Faris Chaudhry , year=. Asymptotic and Finite-Time Guarantees for. 2603.12552 , archivePrefix=

  46. [46]

    2026 , eprint=

    Trajectory-Restricted Optimization Conditions and Geometry-Aware Linear Convergence , author=. 2026 , eprint=

  47. [47]

    2015 , doi =

    Ollivier, Yann , journal =. 2015 , doi =

  48. [48]

    Sokolic, Jure and Giryes, Raja and Sapiro, Guillermo and Rodrigues, Miguel R. D. , year=. Robust Large Margin Deep Neural Networks , volume=. IEEE Transactions on Signal Processing , publisher=

  49. [49]

    2021 , eprint=

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges , author=. 2021 , eprint=

  50. [50]

    International Conference on Learning Representations (ICLR) , year =

    Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author =. International Conference on Learning Representations (ICLR) , year =

  51. [51]

    Journal of Machine Learning Research: Workshop and Conference Proceedings , volume =

    Escaping From Saddle Points -- Online Stochastic Gradient for Tensor Decomposition , author =. Journal of Machine Learning Research: Workshop and Conference Proceedings , volume =. 2015 , url=

  52. [52]

    Advances in Neural Information Processing Systems , volume =

    Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization , author =. Advances in Neural Information Processing Systems , volume =. 2014 , url=

  53. [53]

    2024 , eprint=

    Information Flow in Self-Supervised Learning , author=. 2024 , eprint=

  54. [54]

    2015 IEEE Information Theory Workshop (ITW) , year =

    Deep Learning and the Information Bottleneck Principle , author =. 2015 IEEE Information Theory Workshop (ITW) , year =

  55. [55]

    Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining , booktitle =. A. 2022 , url=

  56. [56]

    International Conference on Learning Representations (ICLR) , year =

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

  57. [57]

    and Simard, P

    Bengio, Y. and Simard, P. and Frasconi, P. , journal=. Learning long-term dependencies with gradient descent is difficult , year=

  58. [58]

    European Conference on Computer Vision (ECCV) , series =

    Identity Mappings in Deep Residual Networks , author =. European Conference on Computer Vision (ECCV) , series =. 2016 , publisher =

  59. [59]

    Approximation Theory of the

    Pinkus, Allan , journal =. Approximation Theory of the. 1999 , url =

  60. [60]

    Advances in Neural Information Processing Systems , volume =

    Implicit Bias of Gradient Descent on Linear Convolutional Networks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url=

  61. [61]

    Journal of Machine Learning Research , volume =

    The Implicit Bias of Gradient Descent on Separable Data , author =. Journal of Machine Learning Research , volume =. 2018 , url=

  62. [62]

    International Conference on Learning Representations (ICLR) , year =

    Investigating the Benefits of Projection Head for Representation Learning , author =. International Conference on Learning Representations (ICLR) , year =