Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

Jinning Yang; Mengnan Du; Shuai Zhang; Wenjie Sun

arxiv: 2605.27989 · v1 · pith:HUJHHNIInew · submitted 2026-05-27 · 💻 cs.LG

Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

Wenjie Sun , Jinning Yang , Shuai Zhang , Mengnan Du This is my paper

Pith reviewed 2026-06-29 14:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural interactiondepth-width ratiogeneralizationscaling lawssuperpositiongradient spaceLLMsmodel shape

0 comments

The pith

Tuning depth-width ratio places neural networks in an efficient interaction interval that supports better generalization under fixed budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the idea of superposition from parameter space to gradient space using the Neural Feature Ansatz, defining neural interaction as a measure of how efficiently a model uses its resources. It shows that under a fixed compute budget, models with efficient neural interactions tend to generalize better. By adjusting the depth-to-width ratio, a model can be moved into an interval where interactions are efficient, and this interval stays roughly the same even as the overall budget grows larger. Comparisons with existing models suggest that those operating near this interval achieve stronger results on benchmarks.

Core claim

Under a fixed budget, good generalization is accompanied by efficient neural interactions defined in gradient space, and adjusting the depth-width ratio R_D/W can position the model in a stable efficient interaction interval.

What carries the argument

The Neural Feature Ansatz, which defines neural interaction efficiency in gradient space as an extension of superposition.

If this is right

Adjusting R_D/W can improve generalization by targeting the efficient interaction interval.
The efficient interaction interval remains stable as compute budget increases.
Models near the efficient interval perform better on MMLU-Pro.
Resource utilization efficiency depends on the depth-width shape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architecture design could prioritize depth-width ratios that target this interval for new models.
The stability of the interval might allow predicting good shapes for larger scales without extensive search.
Similar principles could extend to other architectures or tasks beyond dense LLMs.

Load-bearing premise

The Neural Feature Ansatz gives a definition of neural interaction whose efficiency directly determines generalization performance.

What would settle it

Finding a model with high interaction efficiency but poor generalization performance under the same fixed budget, or a model outside the interval with unexpectedly strong generalization.

Figures

Figures reproduced from arXiv: 2605.27989 by Jinning Yang, Mengnan Du, Shuai Zhang, Wenjie Sun.

**Figure 1.** Figure 1: Ltest, AOFE, and AOFE-ratio across dataset sizes. (a) Ltest and AOFE as functions of training set size. (b) AOFE-ratio as a function of training set size. (c) Representative AGOP heatmaps at selected training set sizes. Liu et al. [14] argue that loss can arise from interference between features induced by superposition. Through the NFA, such parameter space interference has a gradient space counterpart: t… view at source ↗

**Figure 2.** Figure 2: Cross-Network fixed budget shape sweeps. Top row: test loss versus [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Budget-wise best points in the Tiny Transformer shape sweep. Left: [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: RD/W distance and MMLU-Pro performance in small dense LLMs.(a) RD/W of contemporary small dense LLMs. The shaded region denotes the interaction-efficient interval (0.023 ≤ L/dmodel ≤ 0.047), color indicates MMLU-Pro. (b) MMLU-Pro versus distance to the interval, grouped by parameter scale, with fitted linear trends within each group. in the 7–9B group (r = −0.18). The weakening at larger scales is expected… view at source ↗

read the original abstract

The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ($R_{D/W}$). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the $R_{D/W}$ influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: https://anonymous.4open.science/r/Neural_Interaction_Law-D788

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports an empirical correlation between depth-width ratio and a gradient-space interaction measure that appears stable with scale and lines up with MMLU-Pro scores in small dense models.

read the letter

The main point is that under fixed compute, models with depth-width ratios landing in a certain range show better generalization, and that range stays roughly the same as budgets grow. They reach this by extending superposition ideas into gradient space with the Neural Feature Ansatz, label the result neural interaction, and observe that efficient interaction tracks good performance while the shape ratio can steer models into that regime.

What stands out is the claim that the efficient interval remains stable across scales and that existing small LLMs sitting near it tend to score higher on MMLU-Pro. The code release lets others inspect the measurements directly, which is useful.

The weak parts are straightforward. The abstract supplies no explicit formula or threshold for interaction efficiency, no error bars, and no statistical checks. The interval itself is identified from the same runs used to demonstrate the correlation, so it risks being a post-hoc description rather than an independent finding. The ansatz is used without a derivation or ablation showing it isolates the relevant quantity from other depth-width effects such as optimization behavior or raw capacity.

This is aimed at people who design or initialize small-to-medium dense models under tight budgets and want a shape-based lever. It engages honestly with scaling questions even if the current evidence is observational. I would send it for peer review so the measurements, interval selection, and causal status of the ansatz can be examined in detail.

Referee Report

4 major / 2 minor

Summary. The paper claims that under a fixed computational budget, good generalization is accompanied by efficient neural interactions (defined by extending superposition to gradient space via the Neural Feature Ansatz), that the depth-width ratio R_{D/W} can place a model inside a stable 'efficient interaction interval', and that models operating near this interval perform better on MMLU-Pro; these observations are presented as a 'Law of Neural Interaction'.

Significance. If the Neural Feature Ansatz were shown to be a causally relevant and independently validated metric, the work could provide a new lens on depth-width trade-offs and resource utilization. As written, the correlational nature of the results and absence of validation for the core metric limit significance to an exploratory observation rather than a substantiated law.

major comments (4)

The entire central claim rests on the Neural Feature Ansatz supplying a valid definition of neural interaction whose efficiency governs generalization; the manuscript provides no derivation, independent validation, ablation isolating it from other depth-width effects (e.g., optimization dynamics), or counter-example test (Abstract; § on Neural Feature Ansatz).
The efficient interaction interval boundaries are identified from the same model runs used to demonstrate the generalization correlation, rendering the reported 'law' at least partly descriptive rather than predictive; no out-of-sample test or pre-defined boundaries are shown (Results on interval stability).
No quantitative definition of interaction efficiency, error bars, statistical tests, or details on how interval boundaries were determined are reported, so the claims of correlation, adjustability via R_{D/W}, and scale stability cannot be assessed for reliability (Experimental results and benchmark comparison).
The MMLU-Pro comparison with existing small-scale dense LLMs lacks controls for confounding factors such as training data volume or optimizer settings, weakening the inference that proximity to the interval drives performance (Benchmark comparison section).

minor comments (2)

Clarify the exact mathematical definition of the Neural Feature Ansatz extension to gradient space and how efficiency is quantified (e.g., a specific equation or algorithm).
The anonymous code link should be replaced with a permanent repository containing the exact scripts used to compute interactions and intervals.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional rigor and clarity can strengthen the presentation of our exploratory findings on neural interaction efficiency. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: The entire central claim rests on the Neural Feature Ansatz supplying a valid definition of neural interaction whose efficiency governs generalization; the manuscript provides no derivation, independent validation, ablation isolating it from other depth-width effects (e.g., optimization dynamics), or counter-example test (Abstract; § on Neural Feature Ansatz).

Authors: The Neural Feature Ansatz is presented as a direct extension of superposition into gradient space, with the manuscript providing the motivating connection and initial empirical support. We agree that an explicit derivation, independent validation experiments, and ablations isolating the metric from optimization dynamics would improve the work. We will add a dedicated subsection with the mathematical derivation, an ablation study, and discussion of potential counter-examples in the revised manuscript. revision: yes
Referee: The efficient interaction interval boundaries are identified from the same model runs used to demonstrate the generalization correlation, rendering the reported 'law' at least partly descriptive rather than predictive; no out-of-sample test or pre-defined boundaries are shown (Results on interval stability).

Authors: We acknowledge that the interval was initially characterized from the primary experimental runs. To strengthen the predictive claim, we will conduct and report additional out-of-sample experiments on held-out model configurations and scales, using boundaries pre-defined from a subset of the data. This will be added to the Results section on interval stability. revision: yes
Referee: No quantitative definition of interaction efficiency, error bars, statistical tests, or details on how interval boundaries were determined are reported, so the claims of correlation, adjustability via R_{D/W}, and scale stability cannot be assessed for reliability (Experimental results and benchmark comparison).

Authors: We agree these quantitative details are essential. The revision will include: (i) a precise mathematical definition of interaction efficiency, (ii) error bars on all relevant figures, (iii) statistical tests for reported correlations, and (iv) explicit methodology for boundary determination (e.g., threshold selection criteria). These additions will appear in the Experimental results and benchmark comparison sections. revision: yes
Referee: The MMLU-Pro comparison with existing small-scale dense LLMs lacks controls for confounding factors such as training data volume or optimizer settings, weakening the inference that proximity to the interval drives performance (Benchmark comparison section).

Authors: This is a fair observation; the current comparison is observational. We will revise the Benchmark comparison section to explicitly discuss confounding factors, qualify the correlational nature of the inference, and add any feasible controls or sensitivity analyses using available model metadata. The language will be adjusted to reflect these limitations. revision: partial

Circularity Check

2 steps flagged

Neural Feature Ansatz supplies the interaction-efficiency metric; efficient interval identified from same runs used to report the correlation

specific steps

ansatz smuggled in via citation [Abstract]
"By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction."

The paper adopts the Neural Feature Ansatz as the definition of the central quantity (neural interaction) without re-deriving or independently validating it inside this manuscript; the subsequent 'law' is then built on correlations measured with that ansatz-derived metric.
fitted input called prediction [Abstract]
"We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio (R_D/W). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable."

The efficient interaction interval and its stability are identified by inspecting the same model runs whose generalization performance is being correlated with the interaction-efficiency metric; the reported 'law' is therefore a statistical description of the observed data rather than an independent prediction.

full rationale

The manuscript defines neural interaction by extending superposition via the Neural Feature Ansatz into gradient space, then reports that good generalization occurs inside an 'efficient interaction interval' whose location is stable with scale. Both the metric and the interval boundaries are obtained from the identical set of depth-width experiments; no independent derivation, external validation, or ablation isolating the ansatz quantity from other depth-width effects is supplied. This reduces the claimed 'law' to a post-hoc description of the fitted data rather than a first-principles prediction. The central claim therefore exhibits partial circularity of the fitted-input-called-prediction and ansatz-smuggled-in varieties, warranting a score of 6.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the Neural Feature Ansatz as the bridge from superposition to gradient-space interaction and on the empirical identification of the efficient interval; no explicit free parameters are named but the interval boundaries are necessarily data-derived.

free parameters (1)

efficient interaction interval boundaries
The interval is located by observing model behavior, implying data-dependent thresholds rather than a parameter-free derivation.

axioms (1)

domain assumption Neural Feature Ansatz correctly extends superposition to gradient space for measuring interaction efficiency
Invoked to define neural interaction as the basis for the efficiency metric.

invented entities (1)

neural interaction no independent evidence
purpose: Quantity in gradient space whose efficiency is claimed to control generalization
Newly introduced term built on the Neural Feature Ansatz with no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5733 in / 1290 out tokens · 35515 ms · 2026-06-29T14:11:34.048327+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Scaling Laws for Neural Language Models

Kaplan , J., McCandlish , S., Henighan , T., Brown , T. B., Chess , B., Child , R., Gray , S., Radford , A., Wu , J., & Amodei , D. (2020) Scaling laws for neural language models.arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Hestness , J., Narang , S., Ardalani , N., Diamos , G., Jun , H., Kianinejad , H., Patwary , M. M. A., Yang , Y ., & Zhou , Y . (2017) Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

org/abs/1909.12673

Rosenfeld , J. S., Rosenfeld , A., Belinkov , Y ., & Shavit , N. (2019) A constructive prediction of the generalization error across scales.arXiv preprint arXiv:1909.12673

work page arXiv 2019
[4]

Brown , T., Mann , B., Ryder , N., Subbiah , M., Kaplan , J. D., Dhariwal , P., Neelakantan , A., Shyam , P., Sastry , G., Askell , A., & others (2020) Language models are few-shot learners.Advances in neural information processing systems33:1877–1901

2020
[5]

Training Compute-Optimal Large Language Models

Hoffmann , J., Borgeaud , S., Mensch , A., Buchatskaya , E., Cai , T., Rutherford , E., Casas , D., Hendricks , L. A., Welbl , J., Clark , A., & others (2022) Training compute-optimal large language models.arXiv preprint arXiv:2203.1555610

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Scaling Laws for Transfer

Hernandez , D., Kaplan , J., Henighan , T., & McCandlish , S. (2021) Scaling laws for transfer.arXiv preprint arXiv:2102.01293

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

(2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

Ghorbani , B., Firat , O., Freitag , M., Bapna , A., Krikun , M., Garcia , X., Chelba , C., & Cherry , C. (2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

work page arXiv 2021
[8]

M., Neyshabur , B., & Zhai , X

Alabdulmohsin , I. M., Neyshabur , B., & Zhai , X. (2022) Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems35:22300–22312

2022
[9]

(2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

Bordelon , B., Atanasov , A., & Pehlevan , C. (2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

work page arXiv 2024
[10]

(2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

Bahri , Y ., Dyer , E., Kaplan , J., Lee , J., & Sharma , U. (2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

2024
[11]

(2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

Sorscher , B., Geirhos , R., Shekhar , S., Ganguli , S., & Morcos , A. (2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

2022
[12]

Hutter, Learning curve theory, arXiv preprint arXiv:2102.04074 (2021)

Hutter , M. (2021) Learning curve theory.arXiv preprint arXiv:2102.04074

work page arXiv 2021
[13]

(2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

Bordelon , B., Atanasov , A., & Pehlevan , C. (2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

2025
[14]

Superposition Yields Robust Neural Scaling

Liu , Y ., Liu , Z., & Gore , J. (2025) Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Elhage , N., Hume , T., Olsson , C., Schiefer , N., Henighan , T., Kravec , S., Hatfield-Dodds , Z., Lasenby , R., Drain , D., Chen , C., & others (2022) Toy models of superposition.arXiv preprint arXiv:2209.10652

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham , H., Ewart , A., Riggs , L., Huben , R., & Sharkey , L. (2023) Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

(2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

Henighan , T., Carter , S., Hume , T., Elhage , N., Lasenby , R., Fort , S., Schiefer , N., & Olah , C. (2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

2023
[18]

(2023) Privileged bases in the transformer residual stream

Elhage , N., Lasenby , R., & Olah , C. (2023) Privileged bases in the transformer residual stream. Transformer Circuits Thread24

2023
[19]

(2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

2024
[20]

(2025) On the neural feature ansatz for deep neural networks

Tansley , E., Massart , E., & Cartis , C. (2025) On the neural feature ansatz for deep neural networks. arXiv preprint arXiv:2510.15563

work page arXiv 2025
[21]

(2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

d’Angelo , F., Andriushchenko , M., Varre , A., & Flammarion , N. (2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

2024
[22]

S., Gunasekar , S., & Srebro , N

Soudry , D., Hoffer , E., Nacson , M. S., Gunasekar , S., & Srebro , N. (2018) The implicit bias of gradient descent on separable data.Journal of Machine Learning Research19(70):1–57. 10

2018
[23]

Y ., & others (2011) Reading digits in natural images with unsupervised feature learning

Netzer , Y ., Wang , T., Coates , A., Bissacco , A., Wu , B., Ng , A. Y ., & others (2011) Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning 2011, pp. 4. Granada

2011
[24]

& Kaplan , J

Sharma , U. & Kaplan , J. (2022) Scaling laws from the data manifold dimension.Journal of Machine Learning Research23(9):1–34

2022
[25]

(2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

Song , J., Liu , Z., Tegmark , M., & Gore , J. (2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

work page arXiv 2024
[26]

E., Bhojanapalli , S., Neyshabur , B., & Srebro , N

Gunasekar , S., Woodworth , B. E., Bhojanapalli , S., Neyshabur , B., & Srebro , N. (2017) Implicit regularization in matrix factorization.Advances in neural information processing systems30

2017
[27]

(2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

Jacot , A., Gabriel , F., & Hongler , C. (2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

2018
[28]

Scaling and evaluating sparse autoencoders

Gao , L., Tour , T. D., Tillman , H., Goh , G., Troll , R., Radford , A., Sutskever , I., Leike , J., & Wu , J. (2024) Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

(2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

Lieberum , T., Rajamanoharan , S., Conmy , A., Smith , L., Sonnerat , N., Varma , V ., Kramár , J., Dragan , A., Shah , R., & Nanda , N. (2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 278–300

2024
[30]

() The price of amortized inference in sparse autoencoders

Sun , W., Wang , D., & Hu , L. () The price of amortized inference in sparse autoencoders. InThe Fourteenth International Conference on Learning Representations
[31]

(2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

Lawson , T., Farnik , L., Houghton , C., & Aitchison , L. (2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

work page arXiv 2024
[32]

(2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

Balagansky , N., Maksimov , I., & Gavrilov , D. (2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

work page arXiv 2024
[33]

(2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

Lindsey , J., Templeton , A., Marcus , J., Conerly , T., Batson , J., & Olah , C. (2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

2024
[34]

(2025) Route sparse autoencoder to interpret large language models

Shi , W., Li , S., Liang , T., Wan , M., Ma , G., Wang , X., & He , X. (2025) Route sparse autoencoder to interpret large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processingpages 6812–6826

2025
[35]

(2025) Circuit-tracer: A new library for finding feature circuits

Hanna , M., Piotrowski , M., Lindsey , J., & Ameisen , E. (2025) Circuit-tracer: A new library for finding feature circuits. InProceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 239–249

2025
[36]

(2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

Beaglehole , D., Súkeník , P., Mondelli , M., & Belkin , M. (2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

2024
[37]

(2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

work page arXiv 2022
[38]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle , J. & Carbin , M. (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Limitations

Mallinar , N., Beaglehole , D., Zhu , L., Radhakrishnan , A., Pandit , P., & Belkin , M. (2024) Emergence in non-neural models: grokking modular arithmetic via average gradient outer product.arXiv preprint arXiv:2407.20199 11 A More Related Work A.1 Superposition Hypothesis The superposition hypothesis was originally introduced to explain the phenomenon o...

work page arXiv 2024
[40]

Justification: The paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

[1] [1]

Scaling Laws for Neural Language Models

Kaplan , J., McCandlish , S., Henighan , T., Brown , T. B., Chess , B., Child , R., Gray , S., Radford , A., Wu , J., & Amodei , D. (2020) Scaling laws for neural language models.arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Hestness , J., Narang , S., Ardalani , N., Diamos , G., Jun , H., Kianinejad , H., Patwary , M. M. A., Yang , Y ., & Zhou , Y . (2017) Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

org/abs/1909.12673

Rosenfeld , J. S., Rosenfeld , A., Belinkov , Y ., & Shavit , N. (2019) A constructive prediction of the generalization error across scales.arXiv preprint arXiv:1909.12673

work page arXiv 2019

[4] [4]

Brown , T., Mann , B., Ryder , N., Subbiah , M., Kaplan , J. D., Dhariwal , P., Neelakantan , A., Shyam , P., Sastry , G., Askell , A., & others (2020) Language models are few-shot learners.Advances in neural information processing systems33:1877–1901

2020

[5] [5]

Training Compute-Optimal Large Language Models

Hoffmann , J., Borgeaud , S., Mensch , A., Buchatskaya , E., Cai , T., Rutherford , E., Casas , D., Hendricks , L. A., Welbl , J., Clark , A., & others (2022) Training compute-optimal large language models.arXiv preprint arXiv:2203.1555610

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Scaling Laws for Transfer

Hernandez , D., Kaplan , J., Henighan , T., & McCandlish , S. (2021) Scaling laws for transfer.arXiv preprint arXiv:2102.01293

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

(2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

Ghorbani , B., Firat , O., Freitag , M., Bapna , A., Krikun , M., Garcia , X., Chelba , C., & Cherry , C. (2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

work page arXiv 2021

[8] [8]

M., Neyshabur , B., & Zhai , X

Alabdulmohsin , I. M., Neyshabur , B., & Zhai , X. (2022) Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems35:22300–22312

2022

[9] [9]

(2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

Bordelon , B., Atanasov , A., & Pehlevan , C. (2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

work page arXiv 2024

[10] [10]

(2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

Bahri , Y ., Dyer , E., Kaplan , J., Lee , J., & Sharma , U. (2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

2024

[11] [11]

(2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

Sorscher , B., Geirhos , R., Shekhar , S., Ganguli , S., & Morcos , A. (2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

2022

[12] [12]

Hutter, Learning curve theory, arXiv preprint arXiv:2102.04074 (2021)

Hutter , M. (2021) Learning curve theory.arXiv preprint arXiv:2102.04074

work page arXiv 2021

[13] [13]

(2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

Bordelon , B., Atanasov , A., & Pehlevan , C. (2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

2025

[14] [14]

Superposition Yields Robust Neural Scaling

Liu , Y ., Liu , Z., & Gore , J. (2025) Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Elhage , N., Hume , T., Olsson , C., Schiefer , N., Henighan , T., Kravec , S., Hatfield-Dodds , Z., Lasenby , R., Drain , D., Chen , C., & others (2022) Toy models of superposition.arXiv preprint arXiv:2209.10652

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham , H., Ewart , A., Riggs , L., Huben , R., & Sharkey , L. (2023) Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

(2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

Henighan , T., Carter , S., Hume , T., Elhage , N., Lasenby , R., Fort , S., Schiefer , N., & Olah , C. (2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

2023

[18] [18]

(2023) Privileged bases in the transformer residual stream

Elhage , N., Lasenby , R., & Olah , C. (2023) Privileged bases in the transformer residual stream. Transformer Circuits Thread24

2023

[19] [19]

(2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

2024

[20] [20]

(2025) On the neural feature ansatz for deep neural networks

Tansley , E., Massart , E., & Cartis , C. (2025) On the neural feature ansatz for deep neural networks. arXiv preprint arXiv:2510.15563

work page arXiv 2025

[21] [21]

(2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

d’Angelo , F., Andriushchenko , M., Varre , A., & Flammarion , N. (2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

2024

[22] [22]

S., Gunasekar , S., & Srebro , N

Soudry , D., Hoffer , E., Nacson , M. S., Gunasekar , S., & Srebro , N. (2018) The implicit bias of gradient descent on separable data.Journal of Machine Learning Research19(70):1–57. 10

2018

[23] [23]

Y ., & others (2011) Reading digits in natural images with unsupervised feature learning

Netzer , Y ., Wang , T., Coates , A., Bissacco , A., Wu , B., Ng , A. Y ., & others (2011) Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning 2011, pp. 4. Granada

2011

[24] [24]

& Kaplan , J

Sharma , U. & Kaplan , J. (2022) Scaling laws from the data manifold dimension.Journal of Machine Learning Research23(9):1–34

2022

[25] [25]

(2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

Song , J., Liu , Z., Tegmark , M., & Gore , J. (2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

work page arXiv 2024

[26] [26]

E., Bhojanapalli , S., Neyshabur , B., & Srebro , N

Gunasekar , S., Woodworth , B. E., Bhojanapalli , S., Neyshabur , B., & Srebro , N. (2017) Implicit regularization in matrix factorization.Advances in neural information processing systems30

2017

[27] [27]

(2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

Jacot , A., Gabriel , F., & Hongler , C. (2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

2018

[28] [28]

Scaling and evaluating sparse autoencoders

Gao , L., Tour , T. D., Tillman , H., Goh , G., Troll , R., Radford , A., Sutskever , I., Leike , J., & Wu , J. (2024) Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

(2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

Lieberum , T., Rajamanoharan , S., Conmy , A., Smith , L., Sonnerat , N., Varma , V ., Kramár , J., Dragan , A., Shah , R., & Nanda , N. (2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 278–300

2024

[30] [30]

() The price of amortized inference in sparse autoencoders

Sun , W., Wang , D., & Hu , L. () The price of amortized inference in sparse autoencoders. InThe Fourteenth International Conference on Learning Representations

[31] [31]

(2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

Lawson , T., Farnik , L., Houghton , C., & Aitchison , L. (2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

work page arXiv 2024

[32] [32]

(2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

Balagansky , N., Maksimov , I., & Gavrilov , D. (2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

work page arXiv 2024

[33] [33]

(2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

Lindsey , J., Templeton , A., Marcus , J., Conerly , T., Batson , J., & Olah , C. (2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

2024

[34] [34]

(2025) Route sparse autoencoder to interpret large language models

Shi , W., Li , S., Liang , T., Wan , M., Ma , G., Wang , X., & He , X. (2025) Route sparse autoencoder to interpret large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processingpages 6812–6826

2025

[35] [35]

(2025) Circuit-tracer: A new library for finding feature circuits

Hanna , M., Piotrowski , M., Lindsey , J., & Ameisen , E. (2025) Circuit-tracer: A new library for finding feature circuits. InProceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 239–249

2025

[36] [36]

(2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

Beaglehole , D., Súkeník , P., Mondelli , M., & Belkin , M. (2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

2024

[37] [37]

(2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

work page arXiv 2022

[38] [38]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle , J. & Carbin , M. (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Limitations

Mallinar , N., Beaglehole , D., Zhu , L., Radhakrishnan , A., Pandit , P., & Belkin , M. (2024) Emergence in non-neural models: grokking modular arithmetic via average gradient outer product.arXiv preprint arXiv:2407.20199 11 A More Related Work A.1 Superposition Hypothesis The superposition hypothesis was originally introduced to explain the phenomenon o...

work page arXiv 2024

[40] [40]

Justification: The paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...