pith. machine review for the scientific record. sign in

arxiv: 2604.20276 · v1 · submitted 2026-04-22 · 💻 cs.LG · stat.ML

Recognition: unknown

Rethinking Intrinsic Dimension Estimation in Neural Representations

David R\"ugamer, Rickmer Schulte

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords intrinsic dimensionneural representationsrepresentation analysismanifold estimationdeep learningID estimatorsactivation geometry
0
0 comments X

The pith

Common intrinsic dimension estimators do not track the true underlying ID of neural network representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that widely used methods for estimating intrinsic dimensions in neural representations diverge from the actual geometry of the learned manifold. This discrepancy arises because the estimators respond to training artifacts, sampling effects, and other incidental properties rather than the representation's core dimensionality. A sympathetic reader would care because a large body of work has interpreted rising or falling ID values as direct evidence of how networks compress data, disentangle features, or approach capacity limits. If those interpretations rest on estimators that miss the target quantity, many reported trends need re-examination. The authors therefore shift focus from the ID numbers themselves to the factors that actually drive the observed estimator outputs.

Core claim

We theoretically and empirically demonstrate that common ID estimators are not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature and offer a new perspective on ID estimation in neural representations.

What carries the argument

The mismatch between theoretical intrinsic dimension of a neural activation manifold and the numerical output of standard local or global ID estimators applied to the same points.

If this is right

  • Interpretations that link ID growth or shrinkage directly to learning dynamics or generalization require additional validation.
  • Reported ID trends in the literature are more likely driven by changes in local density, curvature, or optimization trajectory than by the manifold's intrinsic dimension.
  • Future analyses of neural representations should prioritize diagnostics that separate estimator bias from geometric properties of the data.
  • New ID estimation procedures may need to be developed that explicitly account for the non-uniform sampling and training-induced structure present in neural activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result implies that ID-based arguments for capacity or compressibility in deep networks may need to be replaced by direct measurements of effective dimension via other means such as Hessian spectra or pruning sensitivity.
  • This opens the possibility that many phenomena previously attributed to ID changes are instead symptoms of how gradient descent shapes the distribution of activations.
  • Testable extension: apply the same estimator mismatch analysis to transformer attention heads or diffusion model latents to check whether the discrepancy generalizes beyond standard feed-forward layers.

Load-bearing premise

A well-defined true intrinsic dimension exists for neural representations independently of the estimators being used and can be meaningfully contrasted with those estimators' outputs.

What would settle it

A controlled synthetic dataset whose activations lie on a known low-dimensional manifold where standard estimators such as MLE or correlation dimension recover the ground-truth dimension across varying sample sizes and noise levels.

Figures

Figures reproduced from arXiv: 2604.20276 by David R\"ugamer, Rickmer Schulte.

Figure 1
Figure 1. Figure 1: Layer-wise ID patterns for various architectures [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Estimated IDs using TwoNN and MLE vs. true [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Union of Manifolds vs. Single Manifold Hypothe [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of word embeddings in LLMs: Each [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Estimated IDs of the layer-wise representations [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NN distances of layer-wise LLM representations [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average cosine similarity between layer-wise rep [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Estimated IDs (Gride) and entropy of layer-wise [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average L2 norm of layer-wise representations for llama, mistral and pythia (last layer excluded). The shaded area band represents twice the standard error. Expansion in Latent Space The layer-wise growth of the L2 norms of last hidden state representations cor￾responds to an expansion in latent space over the hidden layers. We believe that there is an intuitive explanation for this phenomenon: For an accu… view at source ↗
Figure 11
Figure 11. Figure 11: Estimated vs. True ID: Estimated IDs using TwoNN and MLE (different [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Estimated ID vs. Ambient Space Dim.: Estimated IDs using TwoNN and MLE ( [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layer-wise comparison of estimated intrinsic dimensions (left y-axis) vs. ambient space dimensions (right y-axis) [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Class-specific estimated IDs of layer-wise representations from various pre-trained convolutional architectures [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average k-NN distances of representations over the layers of various pre-trained convolutional architectures. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Average cosine similarity between layer-wise representations of the LLM models (llama, mistral, and pythia). [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Average cosine similarity between layer-wise representations for different pre-trained ViTs. The shaded area [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Average cosine similarity between layer-wise representations for different pre-trained convolutional architectures. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: NN distances (left) and average length (measured by [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: NN distances (left) and average length measured by [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Average length (measured by L2-distances to the origin) of layer-wise representations for different CNNs. The shaded area band represents twice the standard deviation. C.9 Entropy vs. ID Analysis ViTs and CNNs We also extended the analysis comparing layer-wise ID and entropy estimates in Section 4.5 from LLMs to ViTs and CNNs. The results for the ViT and CNN models are depicted in [PITH_FULL_IMAGE:figure… view at source ↗
Figure 22
Figure 22. Figure 22: Layer-wise comparison of estimated intrinsic dimensions (left y-axis) vs. von Neumann entropy (right y-axis) of [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Layer-wise comparison of estimated intrinsic dimensions (left y-axis) vs. von Neumann entropy (right y-axis) of [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗
read the original abstract

The analysis of neural representation has become an integral part of research aiming to better understand the inner workings of neural networks. While there are many different approaches to investigate neural representations, an important line of research has focused on doing so through the lens of intrinsic dimensions (IDs). Although this perspective has provided valuable insights and stimulated substantial follow-up research, important limitations of this approach have remained largely unaddressed. In this paper, we highlight a crucial discrepancy between theory and practice of IDs in neural representations, theoretically and empirically showing that common ID estimators are, in fact, not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature. Building on these insights, we offer a new perspective on ID estimation in neural representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that common intrinsic dimension (ID) estimators do not track the true underlying ID of neural representations. It provides theoretical arguments and empirical evidence for a discrepancy between theory and practice, investigates factors that may drive commonly reported ID-related findings in the literature, and proposes a new perspective on ID estimation for neural representations.

Significance. If the central negative result holds, the work would be significant for the field of neural representation analysis. It challenges reliance on standard ID estimators in interpretability research and offers both a critique and constructive investigation of driving factors plus an alternative viewpoint. The combination of theoretical and empirical components, along with the focus on underlying factors rather than solely a negative claim, strengthens its potential impact if the independence of the 'true ID' notion is adequately established.

major comments (2)
  1. [Theoretical and empirical evidence sections (as referenced in abstract)] The central negative result—that common ID estimators fail to track the true underlying ID—requires an explicit, estimator-independent definition or construction of that 'true' ID for neural representations (which are induced by optimization rather than given as a priori manifolds). Without this, the discrepancy may reflect disagreement among procedures rather than failure to track an objective truth. This is load-bearing for the subsequent analysis of driving factors and the new perspective.
  2. [Empirical evaluation] The empirical demonstrations should include controls that isolate whether observed discrepancies arise from the estimators themselves or from properties of the trained representations (e.g., via synthetic manifolds with known ground-truth ID that match the geometry induced by neural training).
minor comments (2)
  1. Clarify notation for ID estimators and any new quantities introduced in the 'new perspective' to avoid ambiguity for readers familiar with prior ID literature.
  2. Ensure all figures comparing estimator outputs include statistical details such as variance across runs or seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our central claims. We address each major point below and will revise the manuscript accordingly where needed.

read point-by-point responses
  1. Referee: [Theoretical and empirical evidence sections (as referenced in abstract)] The central negative result—that common ID estimators fail to track the true underlying ID—requires an explicit, estimator-independent definition or construction of that 'true' ID for neural representations (which are induced by optimization rather than given as a priori manifolds). Without this, the discrepancy may reflect disagreement among procedures rather than failure to track an objective truth. This is load-bearing for the subsequent analysis of driving factors and the new perspective.

    Authors: We agree that an explicit, estimator-independent definition of the true ID is necessary to ground the negative result. In the manuscript, the true ID is defined as the minimal dimensionality of the data-generating process that explains the observed variability in the neural activations, derived from the optimization-induced distribution rather than from any ID estimator. This draws on standard manifold assumptions but accounts for the fact that the representation is the output of training. We will revise the theoretical section (and add a dedicated paragraph in the introduction) to state this definition formally and upfront, including why it is independent of the estimators under consideration. This should address the concern that the discrepancy could be merely procedural. revision: yes

  2. Referee: [Empirical evaluation] The empirical demonstrations should include controls that isolate whether observed discrepancies arise from the estimators themselves or from properties of the trained representations (e.g., via synthetic manifolds with known ground-truth ID that match the geometry induced by neural training).

    Authors: We acknowledge the value of such isolating controls. Our current experiments already include some synthetic settings with known IDs, but they do not fully replicate the precise geometry arising from neural optimization on real data. Constructing exact synthetic proxies for trained representations is non-trivial, yet we will add a new controlled experiment using low-dimensional synthetic manifolds (with ground-truth ID) on which we train simple networks to induce comparable local geometries. This will help separate estimator behavior from representation properties and will be included in the revised empirical section. revision: partial

Circularity Check

0 steps flagged

No circularity: negative result rests on independent ground-truth constructions rather than estimator-dependent definitions.

full rationale

The paper advances a negative result by contrasting common ID estimators against controlled settings (synthetic data and theoretical models) where the underlying manifold dimension is fixed by construction of the data-generating process, independent of any estimator. The subsequent analysis of driving factors (e.g., curvature, sampling effects) and the offered new perspective are derived from these discrepancies without redefining the target ID via the estimators under test or via self-citation chains. No load-bearing step reduces to a fitted parameter or ansatz imported from the authors' prior work; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper relies on the domain assumption that neural representations possess a true intrinsic dimension that can be defined separately from estimator behavior. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Neural representations possess a well-defined true intrinsic dimension independent of common estimators
    Implicit in the contrast drawn between estimator outputs and the 'true underlying ID'

pith-pipeline@v0.9.0 · 5432 in / 1148 out tokens · 43957 ms · 2026-05-10T00:47:02.052063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    2006 , publisher=

    Pattern recognition and machine learning , author=. 2006 , publisher=

  2. [2]

    Adaptive Control Processes: A Guided Tour , author =

  3. [3]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  4. [4]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

  5. [5]

    1982 , publisher=

    Stone, Charles J , journal=. 1982 , publisher=

  6. [6]

    2007 , publisher=

    Bickel, Peter J and Li, Bo , journal=. 2007 , publisher=

  7. [7]

    Nakada, Ryumei and Imaizumi, Masaaki , journal=

  8. [8]

    Schmidt-Hieber, Johannes , journal=

  9. [9]

    Chen, Minshuo and Jiang, Haoming and Liao, Wenjing and Zhao, Tuo , journal=

  10. [10]

    2023 , issn =

    Journal of Statistical Planning and Inference , volume =. 2023 , issn =. doi:https://doi.org/10.1016/j.jspi.2022.05.008 , author =

  11. [11]

    Forty-second International Conference on Machine Learning , year=

    Adjustment for Confounding using Pre-Trained Representations , author=. Forty-second International Conference on Machine Learning , year=

  12. [12]

    Science , volume=

    A global geometric framework for nonlinear dimensionality reduction , author=. Science , volume=. 2000 , publisher=

  13. [13]

    Journal of the American Mathematical Society , volume=

    Testing the manifold hypothesis , author=. Journal of the American Mathematical Society , volume=

  14. [14]

    International Conference on Learning Representations , year=

    Isotropy in the contextual embedding space: Clusters and manifolds , author=. International Conference on Learning Representations , year=

  15. [15]

    Pope, Phillip and Zhu, Chen and Abdelkader, Ahmed and Goldblum, Micah and Goldstein, Tom , booktitle=

  16. [16]

    Ansuini, Alessio and Laio, Alessandro and Macke, Jakob H and Zoccolan, Davide , journal=

  17. [17]

    Gong, Sixue and Boddeti, Vishnu Naresh and Jain, Anil K , booktitle=

  18. [18]

    Konz, Nicholas and Mazurowski, Maciej A , booktitle=

  19. [19]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    The geometry of hidden representations of large transformer models , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    Emily Cheng and Diego Doimo and Corentin Kervadec and Iuri Macocco and Lei Yu and Alessandro Laio and Marco Baroni , booktitle=

  22. [22]

    Chunyuan Li and Heerad Farkhoor and Rosanne Liu and Jason Yosinski , booktitle=

  23. [23]

    bioRxiv , pages=

    The geometry of hidden representations of protein language models , author=. bioRxiv , pages=. 2022 , publisher=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    The representation landscape of few-shot learning and fine-tuning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Nicholas Konz and Maciej A Mazurowski , booktitle=

  26. [26]

    arXiv preprint arXiv:2501.10573 , year=

    The geometry of tokens in internal representations of large language models , author=. arXiv preprint arXiv:2501.10573 , year=

  27. [27]

    arXiv preprint arXiv:2505.17998 , year=

    Aljaafari, Nura and Carvalho, Danilo S and Freitas, Andr. arXiv preprint arXiv:2505.17998 , year=

  28. [28]

    Bradley C. A. Brown and Anthony L. Caterini and Brendan Leigh Ross and Jesse C. Cresswell and Gabriel Loaiza-Ganem , title=. 2023 , booktitle=

  29. [29]

    Kiho Park and Yo Joong Choe and Yibo Jiang and Victor Veitch , booktitle=

  30. [30]

    Kiho Park and Yo Joong Choe and Victor Veitch , booktitle=

  31. [31]

    2024 , volume =

    Jiang, Yibo and Rajendran, Goutham and Ravikumar, Pradeep Kumar and Aragam, Bryon and Veitch, Victor , booktitle =. 2024 , volume =

  32. [32]

    Samuel Marks and Max Tegmark , booktitle=

  33. [33]

    Layers at Similar Depths Generate Similar Activations Across

    Christopher Wolfram and Aaron Schein , booktitle=. Layers at Similar Depths Generate Similar Activations Across

  34. [34]

    International Conference on Similarity Search and Applications , pages=

    Relationships Between Local Intrinsic Dimensionality and Tail Entropy , author=. International Conference on Similarity Search and Applications , pages=. 2021 , organization=

  35. [35]

    Entropy , volume=

    Local intrinsic dimensionality, entropy and statistical divergences , author=. Entropy , volume=. 2022 , publisher=

  36. [36]

    Statistics and analysis of shapes , pages=

    Determining intrinsic dimension and entropy of high-dimensional shape spaces , author=. Statistics and analysis of shapes , pages=. 2006 , publisher=

  37. [37]

    Levina, Elizaveta and Bickel, Peter , journal=

  38. [38]

    and Ghahramani, Zoubin , year=

    MacKay, David J.C. and Ghahramani, Zoubin , year=

  39. [39]

    Scientific reports , volume=

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information , author=. Scientific reports , volume=. 2017 , publisher=

  40. [40]

    Scientific Reports , volume=

    The generalized ratios intrinsic dimension estimator , author=. Scientific Reports , volume=. 2022 , publisher=

  41. [41]

    Dimension und

    Hausdorff, Felix , journal=. Dimension und. 1918 , publisher=

  42. [42]

    2013 , publisher=

    Fractal geometry: mathematical foundations and applications , author=. 2013 , publisher=

  43. [43]

    Ergodic theory and dynamical systems , volume=

    Dimension, entropy and Lyapunov exponents , author=. Ergodic theory and dynamical systems , volume=. 1982 , publisher=

  44. [44]

    Lectures on dynamics, fractal geometry, and metric number theory , author=. J. Mod. Dyn , volume=

  45. [45]

    arXiv preprint arXiv:1312.2298 , year=

    On the estimation of pointwise dimension , author=. arXiv preprint arXiv:1312.2298 , year=

  46. [46]

    International Conference on Machine Learning , pages=

    The Lipschitz constant of self-attention , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  47. [47]

    Advances in Neural Information Processing Systems , volume=

    Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

  48. [48]

    arXiv preprint arXiv:1704.00805 , year=

    On the properties of the softmax function with application in game theory and reinforcement learning , author=. arXiv preprint arXiv:1704.00805 , year=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Pay attention to your loss: understanding misconceptions about lipschitz neural networks , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    Information Sciences , volume=

    Intrinsic dimension estimation: Advances and open problems , author=. Information Sciences , volume=. 2016 , publisher=

  51. [51]

    arXiv preprint arXiv:2507.13887 , year=

    A survey of dimension estimation methods , author=. arXiv preprint arXiv:2507.13887 , year=

  52. [52]

    Penrose and J

    Mathew D. Penrose and J. E. Yukich , journal =. Limit theory for point processes in manifolds , urldate =

  53. [53]

    The Annals of Applied Probability , volume=

    Weak laws of large numbers in geometric probability , author=. The Annals of Applied Probability , volume=. 2003 , publisher=

  54. [54]

    Oscar Skean and Md Rifat Arefin and Dan Zhao and Niket Nikul Patel and Jalal Naghiyev and Yann LeCun and Ravid Shwartz-Ziv , booktitle=

  55. [55]

    2007 15th European signal processing conference , pages=

    The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

  56. [56]

    IEEE Transactions on Information Theory , volume=

    Measures of entropy from data using infinitely divisible kernels , author=. IEEE Transactions on Information Theory , volume=. 2014 , publisher=

  57. [57]

    nearest neighbor

    When is “nearest neighbor” meaningful? , author=. International conference on database theory , pages=. 1999 , organization=

  58. [58]

    International conference on database theory , pages=

    On the surprising behavior of distance metrics in high dimensional space , author=. International conference on database theory , pages=. 2001 , organization=

  59. [59]

    Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , booktitle=

  60. [60]

    International Conference on Machine Learning , pages=

    Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  61. [61]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  62. [62]

    Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and Lélio Renard Lavaud and Marie-Anne Lachaux and Pierre Stock and Teven Le Scao and Thibaut Lavril and Thomas Wang and Timothée Lacroix and Willi...

  63. [63]

    Michael Robinson and Sourya Dey and Tony Chiang , booktitle=

  64. [64]

    Advances in Neural Information Processing Systems , volume=

    Sparse manifold clustering and embedding , author=. Advances in Neural Information Processing Systems , volume=

  65. [65]

    IEEE Signal Processing Magazine , volume=

    Subspace clustering , author=. IEEE Signal Processing Magazine , volume=. 2011 , publisher=

  66. [66]

    TorchVision maintainers and contributors , year = 2016, journal =

  67. [67]

    Advances in Neural Information Processing Systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in Neural Information Processing Systems , volume=

  68. [68]

    Entropy , volume=

    Scikit-dimension: a python package for intrinsic dimension estimation , author=. Entropy , volume=. 2021 , publisher=

  69. [69]

    2022 , issn =

    Patterns , pages =. 2022 , issn =

  70. [70]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  71. [71]

    Advances in Neural Information Processing Systems , volume=

    Imagenet classification with deep convolutional neural networks , author=. Advances in Neural Information Processing Systems , volume=

  72. [72]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  73. [73]

    2009 , organization=

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=. 2009 , organization=

  74. [74]

    Visual Transformers: Token-based Image Representation and Processing for Computer Vision.arXiv:2006.03677

    Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda , year=. 2006.03677 , archivePrefix=

  75. [75]

    AI Alignment Forum , pages=

    Residual stream norms grow exponentially over the forward pass , author=. AI Alignment Forum , pages=

  76. [76]

    Gupta, Akshat and Ozdemir, Atahan and Anumanchipalli, Gopala , journal=

  77. [77]

    Lawson, Tim and Farnik, Lucy and Houghton, Conor and Aitchison, Laurence , booktitle=

  78. [78]

    Benjamin Matthias Ruppik and Julius von Rohrscheidt and Carel van Niekerk and Michael Heck and Renato Vukovic and Shutong Feng and Hsien-chin Lin and Nurul Lubis and Bastian Rieck and Marcus Zibrowius and Milica Gasic , booktitle=

  79. [79]

    Simon Roschmann and Quentin Bouniot and Zeynep Akata , booktitle=

  80. [80]

    Hanzhang Wang and Jiawen Zhang and Qingyuan Ma , booktitle=

Showing first 80 references.