pith. machine review for the scientific record. sign in

arxiv: 2605.10164 · v1 · submitted 2026-05-11 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Hyperparameter Transfer for Dense Associative Memories

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords dense associative memoryhyperparameter transferenergy landscapeshared weightsscaling relationsneural network dynamicsactivation functions
0
0 comments X

The pith

Dense associative memories admit explicit scaling rules that transfer hyperparameters from small models to large ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives formulas that let experimenters tune hyperparameters such as learning rates on small dense associative memory networks and then apply them directly to much larger versions. These formulas arise from the way temporal dynamics unfold on the model's energy landscape when weights are shared both across and inside layers. A sympathetic reader cares because exhaustive searches at large scale are expensive, and these rules promise to cut that cost while still using the sharp, rapidly peaking activations that distinguish DenseAMs from ordinary feed-forward nets. The work demonstrates close agreement between the predicted values and actual training runs across different model sizes.

Core claim

The authors derive explicit prescriptions for transferring hyperparameters tuned on small DenseAM models to larger ones. The prescriptions rest on scaling relations that capture how the temporal dynamics on the energy landscape and the effects of shared weights change with model size. These relations remain accurate enough to produce excellent agreement with empirical results even when the activation functions become sharply peaked.

What carries the argument

Scaling relations obtained from the temporal dynamics on the energy landscape together with the constraints of weight sharing.

If this is right

  • Hyperparameters found on small models can be used at large scale without additional search.
  • Training runs at large scale become cheaper because the search phase stays on small models.
  • The same scaling approach applies to any architecture whose weights are shared across layers and whose dynamics follow an energy landscape.
  • Rapidly peaking activation functions can be retained at scale once the appropriate hyperparameter adjustment is made.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same style of derivation could be attempted for other energy-based or recurrent architectures that reuse weights.
  • If the scaling relations prove robust, they might guide the design of new activation functions whose sharpness can be compensated by the transfer rule.
  • Empirical checks on datasets or tasks outside the paper's experiments would show whether the relations remain accurate under different data statistics.

Load-bearing premise

Simple scaling relations between small and large models hold without important corrections from finite-size effects or changes in activation sharpness.

What would settle it

Train a sequence of DenseAM models at steadily increasing sizes using the prescribed hyperparameter values and measure whether performance deviates systematically from the small-model baseline after the scaling is applied.

Figures

Figures reproduced from arXiv: 2605.10164 by Boris Hanin, Dmitry Krotov, Roi Holtzman.

Figure 1
Figure 1. Figure 1: Learning rate transfer and training loss dynamics for a linear DenseAM trained on a denoising objective (1.2) using mini-batch SGD in the proportional regime (1.3) with scale factors κ = 3, ρ = 10, β = 0.1 and 256 epochs. Here η0 is the effective learning rate (see [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning rate transfer for a ReLU DenseAM trained by minimizing the denoising objective (1.2) using mini-batch SGD in the proportional regime (1.3) with scale factors κ = 3, ρ = 10, β = 0.1 and 256 epochs. Here η0 is the effective learning rate from (4.11) and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning rate transfer for SGD and Adam in the proportional regime (1.3) (top row), with κ = 2, ρ = 5, β = 0.1, and the width-only scaling (bottom row), with N = 128, P = 256, β = 0.1 for ReLU DenseAMs. Models are trained to minimize denoising error (1.2) for anisotropic inputs xα ∼ N (0, D), where D is a diagonal matrix with i-th entry proportional to i −2/5 and trace N. • Learning rate transfer. We obser… view at source ↗
Figure 4
Figure 4. Figure 4: HP transfer for DenseAMs trained on MNIST using ReLU or softmax activations and SGD or Adam in the proportional regime (1.3) with κ = 2, ρ = 5, β = 0.1 trained for 256 epochs. We vary input dimension by a plaquette of size j coarse-graining as described in Appendix J. bounded function to keep the energy function (3.1) bounded. Such DenseAMs have favorable theoretical properties that allow them to memorize … view at source ↗
Figure 5
Figure 5. Figure 5: Dynamical consistency across scales for MSE (left), ∆ (1) W Z, ∆ (2) W Z for a ReLU DenseAM trained with SGD in the proportional regime (1.3) with κ = 5, ρ = 2, β = 0.1, and η0 = 0.005. The two terms in ∂W L in (4.6) come from the change of the outer layer (first term) and the change of the hidden features (second term), which are summed since first and second layer weights are tied. For SGD2 , the changes… view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics collapse for ReLUp DenseAMs trained with SGD in the proportional regime (1.3) with κ = 2, ρ = 5, β = 0.1. We observe that the weight updates ∆(1)W, ∆(2)W in Eq. (B.3), the preactivations updates ∆ (1) W Z, ∆ (2) W Z in Eq. (B.6), and the output updates ∆ (1,1) W F, ∆ (1,2) W F, ∆ (2,1) W F, ∆ (2,2) W F in Eqs. (B.9) all behave the same as N increases. 25 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 7
Figure 7. Figure 7: Spike and bulk behaviors for centered and non-centered ReLU DenseAM trained mini-batch SGD in the proportional regime (1.3) with scale factors κ = 2, ρ = 5, β = 0.1 and 256 epochs. The left panels show the maximal eigenvalues λmax(S ⊤S/K) and λmax(S˜⊤S/K ˜ ), corresponding to the non-centered network and the centered network, respectively. These reflect the spike discussed in App. E.1. The right panels sho… view at source ↗
Figure 8
Figure 8. Figure 8: HP transfer for centered and non-centered ReLU DenseAM trained mini-batch SGD in the proportional regime (1.3) with scale factors κ = 2, ρ = 5, β = 0.1 and 256 epochs. The left panels show the MSE loss after 256 training epochs as function of η0 following the prescription in [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learning rate transfer for a DenseAM with centered ReLUp non-linearity trained by minimizing the denoising objective (1.2) using mini-batch SGD or Adam in the proportional regime (1.3) with scale factors κ = 2, ρ = 5, β = 0.1 and 256 epochs. As the nonlinearity p increases, SGD often exhibits instability at larger learning rates, while Adam remains stable. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Spike and bulk behaviors for centered and non-centered ReLU DenseAM trained with mini-batch Adam in the proportional regime (1.3) with scale factors κ = 2, ρ = 5, β = 0.1 and 256 epochs. In contrast to SGD, the Gram matrix S ⊤S does not enter directly into the updates. We show these diagnostics to provide a comparison with the SGD case in [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: HP transfer for centered and non-centered ReLU DenseAM trained mini-batch Adam in the proportional regime (1.3) with scale factors κ = 2, ρ = 5, β = 0.1 and 256 epochs. The left panels show the MSE loss after 256 training epochs as function of η0 following the prescription in [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The Adam update (F.5) for non-centered and centered ReLU DenseAM in the proportional regime, with κ = 2, ρ = 5, β = 0.1 and η0 = 0.005. Centering (b) provides good collapse of the update dynamics with respect to the non-centered DenseAM in (a). This corresponds to the better HP transfer in [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LR transfer for DenseAM with centered vs. non-centered softmax activations trained with Adam in the proportional regime (1.3) with κ = 2, ρ = 5, β = 0.1 trained for 256 epochs and s1 = 1/ √ N, s2 = √ K, ηW = η0. Panels (a,c) show the final MSE loss as a function of η0. We observe improved HP transfer by centering. Panels (b,d) show the training dynamics of the loss for η0 = 0.005. Centering exhibits bette… view at source ↗
Figure 14
Figure 14. Figure 14: Learning rate transfer for softmax DenseAM trained with SGD and Adam in the proportional regime (1.3) with κ = 2, ρ = 5, β = 0.1 trained for 256 epochs. For both optimizers we use s1 = 1/ √ N, s2 = √ K, but the learning rate differs: For SGD we use ηW = η0K whereas for Adam we use ηW = η0 (as in [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Training dynamics of softmax DenseAM with Adam in the proportional regime (1.3) with κ = 2, ρ = 5, β = 0.1 with η0 = 0.08. For both optimizers we use s1 = 1/ √ N, s2 = √ K, but the learning rate differs: For SGD we use ηW = η0K whereas for Adam we use ηW = η0 (as in [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Learning rate transfer for SGD and Adam for width-only scaling of DenseAMs with ReLU and softmax activations, with N = 128, P = 256, β = 0.1, trained for 256 epochs using the prescription in [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of denoisers across N, K, P, B for DenseAMs trained on MNIST digits. The left three columns are the unnoised image y, the corrupted image x1(0) and the denoised output x1(200) for a DenseAM trained with input dimension N = 784. The final column records the factor of m = 2 plaquette downsampling Dm(x1(200)) of x1(200). The fourth column is factor of m downsampling Dm(x1(0)) of the original corru… view at source ↗
read the original abstract

Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims to initiate hyperparameter transfer methods for Dense Associative Memories (DenseAMs), which perform temporal dynamics on an energy landscape with weights shared across and within layers and rapidly peaking activations. It derives explicit prescriptions for transferring hyperparameters tuned on small models to larger scales and reports excellent agreement between these prescriptions and empirical results.

Significance. If the derivations and empirical matches hold under scrutiny, the work would be significant for enabling reliable scaling of DenseAM architectures, a promising but under-explored family that differs from standard feed-forward nets in its weight-sharing structure and activation properties. The explicit prescriptions address a clear gap where existing transfer methods do not apply directly. Credit is given for providing both theoretical scaling rules grounded in the model structure and empirical validation, which together could support more efficient hyperparameter tuning at scale.

major comments (2)
  1. [§4] §4 (Derivation of scaling prescriptions): The central claim rests on explicit scaling rules derived from energy-landscape dynamics and shared-weight structure. These rules treat the effects of rapidly peaking activations as scale-invariant and assume simple scaling relations remain valid without finite-size corrections or non-linear changes in basin structure. This assumption is load-bearing; if violated at larger sizes, the prescribed transfers will systematically deviate from optimal values. The manuscript should include a concrete test or bound on the size at which corrections become significant.
  2. [Empirical validation section] Empirical validation section (and abstract): The claim of 'excellent agreement' is central but unsupported by reported details on error bars, number of runs, data exclusion criteria, or the exact procedure used to obtain the prescriptions from small-model tuning. Without these, it is impossible to assess whether the agreement is robust or selective.
minor comments (3)
  1. [§2] Notation for the energy function and activation peaking parameter should be introduced earlier and used consistently when stating the scaling relations.
  2. [Figures] Figures comparing small- and large-model performance should include error bars and state the number of independent trials.
  3. [Abstract] The abstract would be clearer if it briefly named the key assumptions underlying the derived prescriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential significance of our work. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: §4 (Derivation of scaling prescriptions): The central claim rests on explicit scaling rules derived from energy-landscape dynamics and shared-weight structure. These rules treat the effects of rapidly peaking activations as scale-invariant and assume simple scaling relations remain valid without finite-size corrections or non-linear changes in basin structure. This assumption is load-bearing; if violated at larger sizes, the prescribed transfers will systematically deviate from optimal values. The manuscript should include a concrete test or bound on the size at which corrections become significant.

    Authors: The scaling rules in §4 follow directly from the mean-field analysis of the continuous-time energy dynamics under the shared-weight structure and the fixed-point behavior induced by the rapidly peaking activations. The scale invariance emerges because the activation normalization and basin attraction are independent of system size within the approximation. We have verified the prescriptions empirically across more than two orders of magnitude in model size with no systematic deviation, indicating that finite-size corrections remain small in the tested regime. To address the concern, we will add a dedicated paragraph in the revised §4 that derives a rough bound on the validity range from the mean-field assumptions (specifically, when the variance of activation peaks stays below a threshold set by the inverse system size) and explicitly states the largest scale at which we expect the prescriptions to hold without correction. revision: partial

  2. Referee: Empirical validation section (and abstract): The claim of 'excellent agreement' is central but unsupported by reported details on error bars, number of runs, data exclusion criteria, or the exact procedure used to obtain the prescriptions from small-model tuning. Without these, it is impossible to assess whether the agreement is robust or selective.

    Authors: We agree that the empirical section requires additional methodological detail to substantiate the claim of excellent agreement. In the revised manuscript we will expand the validation section (and update the abstract if space permits) to report: the number of independent runs per scale (10), error bars computed as standard error of the mean, the exact small-model tuning procedure (grid search over learning rate, momentum, and activation sharpness within explicitly stated ranges), and confirmation that no runs were excluded beyond a standard convergence threshold. These additions will make the robustness of the match fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: prescriptions derived from energy-landscape scaling relations and validated empirically

full rationale

The paper states that it derives explicit prescriptions for hyperparameter transfer from the structure of DenseAMs (shared weights, rapidly peaking activations, temporal dynamics on an energy landscape). These are then checked against empirical results on small-to-large models. No equations or steps are presented that reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The scaling relations are asserted as assumptions rather than tautologically defined from the target-scale data, and the empirical agreement is reported as validation rather than the source of the prescriptions. This is the normal case of a self-contained derivation whose central claim retains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the derivation is described only at the level of 'explicit prescriptions' without listing assumptions or fitted quantities.

pith-pipeline@v0.9.0 · 5431 in / 945 out tokens · 33544 ms · 2026-05-12T03:14:13.245761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We seek scalings of s1, s2, ηW, ηb, ηc ... to satisfy: Desideratum 1 (Stability) ... entries of Z, F be order 1 ... Desideratum 2 (Maximality) ... ΔZ, ΔF ... order 1 ... Desideratum 3 (Balance) ... each term ... order 1

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Greg Yang and Edward J. Hu. Feature learning in infinite-width neural networks, 2022

  2. [2]

    Alemi, Roman Novak, Peter J

    Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers, 2024

  3. [3]

    Don’t be lazy: Completep enables compute-efficient deep transformers

    Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  4. [4]

    A spectral condition for feature learning

    Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2023

  5. [5]

    Infinite limits of multi- head transformer dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

    Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi- head transformer dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

  6. [6]

    arXiv preprint arXiv:2601.20205 , year=

    Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026

  7. [7]

    Dense associative memory for pattern recognition

    Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016

  8. [8]

    Modern methods in associative memory.arXiv preprint arXiv:2507.06211, 2025

    Dmitry Krotov, Benjamin Hoover, Parikshit Ram, and Bao Pham. Modern methods in associative memory.arXiv preprint arXiv:2507.06211, 2025

  9. [9]

    Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554– 2558, 1982

    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554– 2558, 1982. 14

  10. [10]

    Storing infinite numbers of patterns in a spin-glass model of neural networks.Physical review letters, 55(14):1530, 1985

    Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Storing infinite numbers of patterns in a spin-glass model of neural networks.Physical review letters, 55(14):1530, 1985

  11. [11]

    A new frontier for hopfield networks.Nature Reviews Physics, 5(7):366– 367, 2023

    Dmitry Krotov. A new frontier for hopfield networks.Nature Reviews Physics, 5(7):366– 367, 2023

  12. [12]

    Energy transformer.Ad- vances in neural information processing systems, 36:27532–27559, 2023

    Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed Zaki, and Dmitry Krotov. Energy transformer.Ad- vances in neural information processing systems, 36:27532–27559, 2023

  13. [13]

    Operator learning for reconstructing flow fields from sparse measurements: an energy transformer approach

    Qian Zhang, Dmitry Krotov, and George Em Karniadakis. Operator learning for reconstructing flow fields from sparse measurements: an energy transformer approach. Journal of Computational Physics, 538:114148, 2025

  14. [14]

    NRGPT: An Energy-based Alternative for GPT

    Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, and Dmitry Krotov. Nrgpt: An energy-based alternative for gpt.arXiv preprint arXiv:2512.16762, 2025

  15. [15]

    Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer, 2025

    Blake Bordelon and Cengiz Pehlevan. Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer, 2025

  16. [16]

    Large associative memory problem in neurobiology and machine learning

    Dmitry Krotov and John J Hopfield. Large associative memory problem in neurobiology and machine learning. InInternational Conference on Learning Representations, 2021. 15 A Model and definitions We study dense associative memory (DenseAM) networks of the form f(x) =f W (x) =s 2W ⊤σ s1W g(x) +b +c, x, g(x), c∈R N , b∈R K, W∈R K×N .(A.1) Our goal is to trai...

  17. [17]

    the localization of softmax, is correlated with the divergence of the training updates, supporting this analysis

    Moreover, we observe that for SGD, the drop inKeff, i.e. the localization of softmax, is correlated with the divergence of the training updates, supporting this analysis. See top panels of Fig. 15. IK-only scaling regime In this section we analyze theK-only scaling, namely,K→ ∞ , while N, P, B are fixed. To keep the analysis simple we will consider center...