pith. sign in

arxiv: 2605.27989 · v1 · pith:HUJHHNIInew · submitted 2026-05-27 · 💻 cs.LG

Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

Pith reviewed 2026-06-29 14:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural interactiondepth-width ratiogeneralizationscaling lawssuperpositiongradient spaceLLMsmodel shape
0
0 comments X

The pith

Tuning depth-width ratio places neural networks in an efficient interaction interval that supports better generalization under fixed budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the idea of superposition from parameter space to gradient space using the Neural Feature Ansatz, defining neural interaction as a measure of how efficiently a model uses its resources. It shows that under a fixed compute budget, models with efficient neural interactions tend to generalize better. By adjusting the depth-to-width ratio, a model can be moved into an interval where interactions are efficient, and this interval stays roughly the same even as the overall budget grows larger. Comparisons with existing models suggest that those operating near this interval achieve stronger results on benchmarks.

Core claim

Under a fixed budget, good generalization is accompanied by efficient neural interactions defined in gradient space, and adjusting the depth-width ratio R_D/W can position the model in a stable efficient interaction interval.

What carries the argument

The Neural Feature Ansatz, which defines neural interaction efficiency in gradient space as an extension of superposition.

If this is right

  • Adjusting R_D/W can improve generalization by targeting the efficient interaction interval.
  • The efficient interaction interval remains stable as compute budget increases.
  • Models near the efficient interval perform better on MMLU-Pro.
  • Resource utilization efficiency depends on the depth-width shape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architecture design could prioritize depth-width ratios that target this interval for new models.
  • The stability of the interval might allow predicting good shapes for larger scales without extensive search.
  • Similar principles could extend to other architectures or tasks beyond dense LLMs.

Load-bearing premise

The Neural Feature Ansatz gives a definition of neural interaction whose efficiency directly determines generalization performance.

What would settle it

Finding a model with high interaction efficiency but poor generalization performance under the same fixed budget, or a model outside the interval with unexpectedly strong generalization.

Figures

Figures reproduced from arXiv: 2605.27989 by Jinning Yang, Mengnan Du, Shuai Zhang, Wenjie Sun.

Figure 1
Figure 1. Figure 1: Ltest, AOFE, and AOFE-ratio across dataset sizes. (a) Ltest and AOFE as functions of training set size. (b) AOFE-ratio as a function of training set size. (c) Representative AGOP heatmaps at selected training set sizes. Liu et al. [14] argue that loss can arise from interference between features induced by superposition. Through the NFA, such parameter space interference has a gradient space counterpart: t… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-Network fixed budget shape sweeps. Top row: test loss versus [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Budget-wise best points in the Tiny Transformer shape sweep. Left: [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RD/W distance and MMLU-Pro performance in small dense LLMs.(a) RD/W of contemporary small dense LLMs. The shaded region denotes the interaction-efficient interval (0.023 ≤ L/dmodel ≤ 0.047), color indicates MMLU-Pro. (b) MMLU-Pro versus distance to the interval, grouped by parameter scale, with fitted linear trends within each group. in the 7–9B group (r = −0.18). The weakening at larger scales is expected… view at source ↗
read the original abstract

The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ($R_{D/W}$). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the $R_{D/W}$ influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: https://anonymous.4open.science/r/Neural_Interaction_Law-D788

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper claims that under a fixed computational budget, good generalization is accompanied by efficient neural interactions (defined by extending superposition to gradient space via the Neural Feature Ansatz), that the depth-width ratio R_{D/W} can place a model inside a stable 'efficient interaction interval', and that models operating near this interval perform better on MMLU-Pro; these observations are presented as a 'Law of Neural Interaction'.

Significance. If the Neural Feature Ansatz were shown to be a causally relevant and independently validated metric, the work could provide a new lens on depth-width trade-offs and resource utilization. As written, the correlational nature of the results and absence of validation for the core metric limit significance to an exploratory observation rather than a substantiated law.

major comments (4)
  1. The entire central claim rests on the Neural Feature Ansatz supplying a valid definition of neural interaction whose efficiency governs generalization; the manuscript provides no derivation, independent validation, ablation isolating it from other depth-width effects (e.g., optimization dynamics), or counter-example test (Abstract; § on Neural Feature Ansatz).
  2. The efficient interaction interval boundaries are identified from the same model runs used to demonstrate the generalization correlation, rendering the reported 'law' at least partly descriptive rather than predictive; no out-of-sample test or pre-defined boundaries are shown (Results on interval stability).
  3. No quantitative definition of interaction efficiency, error bars, statistical tests, or details on how interval boundaries were determined are reported, so the claims of correlation, adjustability via R_{D/W}, and scale stability cannot be assessed for reliability (Experimental results and benchmark comparison).
  4. The MMLU-Pro comparison with existing small-scale dense LLMs lacks controls for confounding factors such as training data volume or optimizer settings, weakening the inference that proximity to the interval drives performance (Benchmark comparison section).
minor comments (2)
  1. Clarify the exact mathematical definition of the Neural Feature Ansatz extension to gradient space and how efficiency is quantified (e.g., a specific equation or algorithm).
  2. The anonymous code link should be replaced with a permanent repository containing the exact scripts used to compute interactions and intervals.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional rigor and clarity can strengthen the presentation of our exploratory findings on neural interaction efficiency. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: The entire central claim rests on the Neural Feature Ansatz supplying a valid definition of neural interaction whose efficiency governs generalization; the manuscript provides no derivation, independent validation, ablation isolating it from other depth-width effects (e.g., optimization dynamics), or counter-example test (Abstract; § on Neural Feature Ansatz).

    Authors: The Neural Feature Ansatz is presented as a direct extension of superposition into gradient space, with the manuscript providing the motivating connection and initial empirical support. We agree that an explicit derivation, independent validation experiments, and ablations isolating the metric from optimization dynamics would improve the work. We will add a dedicated subsection with the mathematical derivation, an ablation study, and discussion of potential counter-examples in the revised manuscript. revision: yes

  2. Referee: The efficient interaction interval boundaries are identified from the same model runs used to demonstrate the generalization correlation, rendering the reported 'law' at least partly descriptive rather than predictive; no out-of-sample test or pre-defined boundaries are shown (Results on interval stability).

    Authors: We acknowledge that the interval was initially characterized from the primary experimental runs. To strengthen the predictive claim, we will conduct and report additional out-of-sample experiments on held-out model configurations and scales, using boundaries pre-defined from a subset of the data. This will be added to the Results section on interval stability. revision: yes

  3. Referee: No quantitative definition of interaction efficiency, error bars, statistical tests, or details on how interval boundaries were determined are reported, so the claims of correlation, adjustability via R_{D/W}, and scale stability cannot be assessed for reliability (Experimental results and benchmark comparison).

    Authors: We agree these quantitative details are essential. The revision will include: (i) a precise mathematical definition of interaction efficiency, (ii) error bars on all relevant figures, (iii) statistical tests for reported correlations, and (iv) explicit methodology for boundary determination (e.g., threshold selection criteria). These additions will appear in the Experimental results and benchmark comparison sections. revision: yes

  4. Referee: The MMLU-Pro comparison with existing small-scale dense LLMs lacks controls for confounding factors such as training data volume or optimizer settings, weakening the inference that proximity to the interval drives performance (Benchmark comparison section).

    Authors: This is a fair observation; the current comparison is observational. We will revise the Benchmark comparison section to explicitly discuss confounding factors, qualify the correlational nature of the inference, and add any feasible controls or sensitivity analyses using available model metadata. The language will be adjusted to reflect these limitations. revision: partial

Circularity Check

2 steps flagged

Neural Feature Ansatz supplies the interaction-efficiency metric; efficient interval identified from same runs used to report the correlation

specific steps
  1. ansatz smuggled in via citation [Abstract]
    "By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction."

    The paper adopts the Neural Feature Ansatz as the definition of the central quantity (neural interaction) without re-deriving or independently validating it inside this manuscript; the subsequent 'law' is then built on correlations measured with that ansatz-derived metric.

  2. fitted input called prediction [Abstract]
    "We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio (R_D/W). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable."

    The efficient interaction interval and its stability are identified by inspecting the same model runs whose generalization performance is being correlated with the interaction-efficiency metric; the reported 'law' is therefore a statistical description of the observed data rather than an independent prediction.

full rationale

The manuscript defines neural interaction by extending superposition via the Neural Feature Ansatz into gradient space, then reports that good generalization occurs inside an 'efficient interaction interval' whose location is stable with scale. Both the metric and the interval boundaries are obtained from the identical set of depth-width experiments; no independent derivation, external validation, or ablation isolating the ansatz quantity from other depth-width effects is supplied. This reduces the claimed 'law' to a post-hoc description of the fitted data rather than a first-principles prediction. The central claim therefore exhibits partial circularity of the fitted-input-called-prediction and ansatz-smuggled-in varieties, warranting a score of 6.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the Neural Feature Ansatz as the bridge from superposition to gradient-space interaction and on the empirical identification of the efficient interval; no explicit free parameters are named but the interval boundaries are necessarily data-derived.

free parameters (1)
  • efficient interaction interval boundaries
    The interval is located by observing model behavior, implying data-dependent thresholds rather than a parameter-free derivation.
axioms (1)
  • domain assumption Neural Feature Ansatz correctly extends superposition to gradient space for measuring interaction efficiency
    Invoked to define neural interaction as the basis for the efficiency metric.
invented entities (1)
  • neural interaction no independent evidence
    purpose: Quantity in gradient space whose efficiency is claimed to control generalization
    Newly introduced term built on the Neural Feature Ansatz with no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5733 in / 1290 out tokens · 35515 ms · 2026-06-29T14:11:34.048327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Kaplan , J., McCandlish , S., Henighan , T., Brown , T. B., Chess , B., Child , R., Gray , S., Radford , A., Wu , J., & Amodei , D. (2020) Scaling laws for neural language models.arXiv preprint arXiv:2001.08361

  2. [2]

    Hestness , J., Narang , S., Ardalani , N., Diamos , G., Jun , H., Kianinejad , H., Patwary , M. M. A., Yang , Y ., & Zhou , Y . (2017) Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409

  3. [3]

    org/abs/1909.12673

    Rosenfeld , J. S., Rosenfeld , A., Belinkov , Y ., & Shavit , N. (2019) A constructive prediction of the generalization error across scales.arXiv preprint arXiv:1909.12673

  4. [4]

    Brown , T., Mann , B., Ryder , N., Subbiah , M., Kaplan , J. D., Dhariwal , P., Neelakantan , A., Shyam , P., Sastry , G., Askell , A., & others (2020) Language models are few-shot learners.Advances in neural information processing systems33:1877–1901

  5. [5]

    Training Compute-Optimal Large Language Models

    Hoffmann , J., Borgeaud , S., Mensch , A., Buchatskaya , E., Cai , T., Rutherford , E., Casas , D., Hendricks , L. A., Welbl , J., Clark , A., & others (2022) Training compute-optimal large language models.arXiv preprint arXiv:2203.1555610

  6. [6]

    Scaling Laws for Transfer

    Hernandez , D., Kaplan , J., Henighan , T., & McCandlish , S. (2021) Scaling laws for transfer.arXiv preprint arXiv:2102.01293

  7. [7]

    (2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

    Ghorbani , B., Firat , O., Freitag , M., Bapna , A., Krikun , M., Garcia , X., Chelba , C., & Cherry , C. (2021) Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740

  8. [8]

    M., Neyshabur , B., & Zhai , X

    Alabdulmohsin , I. M., Neyshabur , B., & Zhai , X. (2022) Revisiting neural scaling laws in language and vision.Advances in Neural Information Processing Systems35:22300–22312

  9. [9]

    (2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

    Bordelon , B., Atanasov , A., & Pehlevan , C. (2024) A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092

  10. [10]

    (2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

    Bahri , Y ., Dyer , E., Kaplan , J., Lee , J., & Sharma , U. (2024) Explaining neural scaling laws.Proceedings of the National Academy of Sciences121(27):e2311878121

  11. [11]

    (2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

    Sorscher , B., Geirhos , R., Shekhar , S., Ganguli , S., & Morcos , A. (2022) Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems35: 19523–19536

  12. [12]

    Hutter, Learning curve theory, arXiv preprint arXiv:2102.04074 (2021)

    Hutter , M. (2021) Learning curve theory.arXiv preprint arXiv:2102.04074

  13. [13]

    (2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

    Bordelon , B., Atanasov , A., & Pehlevan , C. (2025) How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment2025(8):084002

  14. [14]

    Superposition Yields Robust Neural Scaling

    Liu , Y ., Liu , Z., & Gore , J. (2025) Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465

  15. [15]

    Elhage , N., Hume , T., Olsson , C., Schiefer , N., Henighan , T., Kravec , S., Hatfield-Dodds , Z., Lasenby , R., Drain , D., Chen , C., & others (2022) Toy models of superposition.arXiv preprint arXiv:2209.10652

  16. [16]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham , H., Ewart , A., Riggs , L., Huben , R., & Sharkey , L. (2023) Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600

  17. [17]

    (2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

    Henighan , T., Carter , S., Hume , T., Elhage , N., Lasenby , R., Fort , S., Schiefer , N., & Olah , C. (2023) Superposition, memorization, and double descent.Transformer Circuits Thread6(24):1725–1744

  18. [18]

    (2023) Privileged bases in the transformer residual stream

    Elhage , N., Lasenby , R., & Olah , C. (2023) Privileged bases in the transformer residual stream. Transformer Circuits Thread24

  19. [19]

    (2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

    Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2024) Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science383(6690):1461–1467

  20. [20]

    (2025) On the neural feature ansatz for deep neural networks

    Tansley , E., Massart , E., & Cartis , C. (2025) On the neural feature ansatz for deep neural networks. arXiv preprint arXiv:2510.15563

  21. [21]

    (2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

    d’Angelo , F., Andriushchenko , M., Varre , A., & Flammarion , N. (2024) Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems37:23191–23223

  22. [22]

    S., Gunasekar , S., & Srebro , N

    Soudry , D., Hoffer , E., Nacson , M. S., Gunasekar , S., & Srebro , N. (2018) The implicit bias of gradient descent on separable data.Journal of Machine Learning Research19(70):1–57. 10

  23. [23]

    Y ., & others (2011) Reading digits in natural images with unsupervised feature learning

    Netzer , Y ., Wang , T., Coates , A., Bissacco , A., Wu , B., Ng , A. Y ., & others (2011) Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning 2011, pp. 4. Granada

  24. [24]

    & Kaplan , J

    Sharma , U. & Kaplan , J. (2022) Scaling laws from the data manifold dimension.Journal of Machine Learning Research23(9):1–34

  25. [25]

    (2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

    Song , J., Liu , Z., Tegmark , M., & Gore , J. (2024) A resource model for neural scaling law.arXiv preprint arXiv:2402.05164

  26. [26]

    E., Bhojanapalli , S., Neyshabur , B., & Srebro , N

    Gunasekar , S., Woodworth , B. E., Bhojanapalli , S., Neyshabur , B., & Srebro , N. (2017) Implicit regularization in matrix factorization.Advances in neural information processing systems30

  27. [27]

    (2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

    Jacot , A., Gabriel , F., & Hongler , C. (2018) Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems31

  28. [28]

    Scaling and evaluating sparse autoencoders

    Gao , L., Tour , T. D., Tillman , H., Goh , G., Troll , R., Radford , A., Sutskever , I., Leike , J., & Wu , J. (2024) Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093

  29. [29]

    (2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

    Lieberum , T., Rajamanoharan , S., Conmy , A., Smith , L., Sonnerat , N., Varma , V ., Kramár , J., Dragan , A., Shah , R., & Nanda , N. (2024) Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 278–300

  30. [30]

    () The price of amortized inference in sparse autoencoders

    Sun , W., Wang , D., & Hu , L. () The price of amortized inference in sparse autoencoders. InThe Fourteenth International Conference on Learning Representations

  31. [31]

    (2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

    Lawson , T., Farnik , L., Houghton , C., & Aitchison , L. (2024) Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185

  32. [32]

    (2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

    Balagansky , N., Maksimov , I., & Gavrilov , D. (2024) Mechanistic permutability: Match features across layers.arXiv preprint arXiv:2410.07656

  33. [33]

    (2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

    Lindsey , J., Templeton , A., Marcus , J., Conerly , T., Batson , J., & Olah , C. (2024) Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Threadpages 3982–3992

  34. [34]

    (2025) Route sparse autoencoder to interpret large language models

    Shi , W., Li , S., Liang , T., Wan , M., Ma , G., Wang , X., & He , X. (2025) Route sparse autoencoder to interpret large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processingpages 6812–6826

  35. [35]

    (2025) Circuit-tracer: A new library for finding feature circuits

    Hanna , M., Piotrowski , M., Lindsey , J., & Ameisen , E. (2025) Circuit-tracer: A new library for finding feature circuits. InProceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLPpages 239–249

  36. [36]

    (2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

    Beaglehole , D., Súkeník , P., Mondelli , M., & Belkin , M. (2024) Average gradient outer product as a mechanism for deep neural collapse.Advances in Neural Information Processing Systems37:130764– 130796

  37. [37]

    (2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

    Radhakrishnan , A., Beaglehole , D., Pandit , P., & Belkin , M. (2022) Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.arXiv preprint arXiv:2212.13881

  38. [38]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Frankle , J. & Carbin , M. (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635

  39. [39]

    Limitations

    Mallinar , N., Beaglehole , D., Zhu , L., Radhakrishnan , A., Pandit , P., & Belkin , M. (2024) Emergence in non-neural models: grokking modular arithmetic via average gradient outer product.arXiv preprint arXiv:2407.20199 11 A More Related Work A.1 Superposition Hypothesis The superposition hypothesis was originally introduced to explain the phenomenon o...

  40. [40]

    Justification: The paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...