arxiv: 2605.05683 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.LG

Recognition: unknown

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Andy Zeyi Liu, Elliot Paquette, John Sous

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:37 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords activation spectragradient spectraLLM optimizationrepresentation geometrytoken efficiencybatch size effectsmechanistic modellearning dynamics

0 comments

The pith

Early activation covariance spectra forecast token efficiency in language model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training loss alone can mask differences in how language models build internal representations. This paper introduces spectral measurements of activations and gradients as diagnostics to uncover these differences. It finds that batch size influences the geometry of representations even at equal loss, and that the early tail of the activation covariance spectrum predicts later token efficiency. A mechanistic model links these spectra to the development of task-aligned features, with the signals holding across model sizes.

Core claim

Using activation covariance and per-sample gradient SVD spectra as diagnostics on decoder-only models, the work finds that batch size shapes representation geometry at equal loss, that early activation tails forecast token efficiency, and that spectral head movement separates learning dynamics. A mechanistic model explains the correlation between activation spectra and task-aligned feature learning.

What carries the argument

Activation covariance spectra and per-sample gradient SVD spectra, which diagnose representation geometry and learning dynamics.

If this is right

Runs reaching the same loss can have different activation spectra depending on batch size.
Early activation covariance tail predicts downstream token efficiency.
Spectral changes characterize shifts in learning dynamics, distinguishing architectural from execution improvements.
These patterns persist in 12-, 36-, and 48-layer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring these spectra could allow early intervention in training to improve efficiency without completing full runs.
The mechanistic model might extend to other architectures if the correlation with feature learning generalizes beyond the tested scales.
Spectral diagnostics could inform hyperparameter choices like batch size to optimize representation learning directly.

Load-bearing premise

The spectral patterns observed in this family of decoder-only models reflect general properties of language model optimization rather than being specific to the chosen implementation or scales.

What would settle it

A new experiment where the early activation covariance tail does not correlate with final token efficiency on a different model family or larger scale would falsify the predictive claim.

Figures

Figures reproduced from arXiv: 2605.05683 by Andy Zeyi Liu, Elliot Paquette, John Sous.

**Figure 1.** Figure 1: Spectral diagnostics as operational tools. Each panel compares decoder-only language-model runs using trace-normalized activation covariance spectra and per-sample gradient SVD summaries, aligned either at matched loss or a fixed early token budget. (a) Matched loss, distinct internal geometry. Final-layer activation spectra of FlexWin d36 runs aligned at a common target loss. Dashed overlays show power-l… view at source ↗

**Figure 2.** Figure 2: Early spectral-tail signal predicts efficient training regimes across scale. In both panels, the x-axis shows the normalized early tail exponent αtail, while the y-axis shows token inefficiency ϵtok(B), so lower values indicate better token efficiency. (a) At d12 scale, this early spectral-tail statistic already organizes later efficiency across model families: families that move to the right also tend to … view at source ↗

**Figure 3.** Figure 3: Architectural tricks fall into clear empirical taxonomy. (a) Each consecutive d12 transition is summarized by token gain, throughput gain, and its taxonomy label. The outcome columns separate learning-side, throughput-side, joint, and tradeoff effects, but the activation-led versus gradient-led split requires the spectral evidence in panels (b)–(c); a representative path-level example is deferred to Append… view at source ↗

**Figure 4.** Figure 4: Toy simulation links spectral diagnostics to Fourier feature learning. (a) In the Muon stage, a local activation-tail statistic on ranks 10:40 predicts eventual token efficiency early in training, reaching perfect Spearman correlation around 20% training progress. (b) Targeted best-regime replays track HS, the task-band concentration score used in the theory, across the baseline, RoPE, Muon, and untied st… view at source ↗

**Figure 5.** Figure 5: FineWeb-10B and FineWeb-100B samples have nearly identical token-window spectra. The plot compares trace-normalized covariance spectra from matched 1,024-token windows. The very small spectral Jensen–Shannon divergence supports treating the data switch as a minor spectral confound relative to the batch and architecture effects analyzed in the paper. To scale to a target tier B from reference B0, we form s… view at source ↗

**Figure 6.** Figure 6: Gradient spectra depend on the probed weight matrix. Representative matrix-level comparison from the d12 BetterWin bs8 run at layer 11. The query projection, value projection, attention-output projection, and MLP-output projection produce different concentration levels and RankMe trajectories. This is why gradient spectra are interpreted as tensor-specific complements to activation spectra rather than arc… view at source ↗

**Figure 7.** Figure 7: Batch-dependent activation spectra appear under both Muon and Adam matrix updates. Each panel shows final layer-11 activation covariance spectra for LSWA across effective batch tiers. The Adam variant still shows batchdependent spectral separation, so the hidden-regime effect is not intrinsic to Muon. The separation is stronger and more sharply structured in the Muon runs, consistent with Muon changing th… view at source ↗

**Figure 8.** Figure 8: Early-prediction strength evolves with training progress for d12 variants. Each small panel shows one d12 model family. The colored lines report Spearman correlation between final token efficiency and local activation-spectrum exponents fit over rank windows 10–40, 40–90, and 90–200. The deeper-tail window 90–200 typically reaches a high positive correlation by about 20% of training, often saturating near … view at source ↗

**Figure 9.** Figure 9: FlexWin tier-16 spectra are stable across random seeds. The left panel shows validation-loss curves; the middle and right panels show activation and gradient spectra at the shared step-3000 checkpoint. The small seed-to-seed variation supports treating the batch-tier separation as larger than ordinary seed noise in this setting. types to different layers, so a single-layer probe sees only one mask regime; … view at source ↗

**Figure 10.** Figure 10: Late-layer probes carry the clearest early-prediction signal in the d36 support runs. (a) Spearman correlation between the early activation tail exponent and tokens-to-target proxy across batch tiers, computed at the saved checkpoint closest to 0.25B training tokens. Each line is one d36 family and each x-position is a stored probe layer. The tail exponent is fit over ranks 200–400 of the activation covar… view at source ↗

**Figure 11.** Figure 11: The modular-arithmetic toy links matched-loss spectra to taskaligned feature learning. (a) At matched validation loss, the Untied toy runs retain batch-dependent activation spectra across B ∈ {32, 64, 128, 256, 512}, paralleling the hidden-regime phenomenon in the language-model experiments. Smaller batches produce visibly steeper tails. (b) Consecutive intervention gains measured by mean ∆Hpeak at thre… view at source ↗

**Figure 12.** Figure 12: Loss-curve atlases. Validation-aligned and train-loss-only evidence are separated to keep the support runs distinct from the main protocol. Within every family, all batch tiers reach the target loss, and the tokens-to-target spread across tiers is visible directly in the curves. G.3. Weight-matrix spectra. Activation and gradient spectra describe the data-side and updateside of training view at source ↗

**Figure 13.** Figure 13: Spectral atlas for the legacy prefix variants. Rows show Baseline, RoPE, Muon, and Untied. Each consecutive variant produces visibly distinct activation and gradient spectra, dominating the activation-led column of the main taxonomy figure. comparing the layer-11 attention-output projection WO against the layer-11 MLP-output projection at the final checkpoint, with head exponents shown at step 1600 and a… view at source ↗

**Figure 14.** Figure 14: Spectral atlas for the first half of the matched trunk. Rows show ValueMix, U-Net, FixedWin, FlexWin, VTE, and BetterWin. Per-row spectral differences are smaller than across the legacy prefix; the taxonomic split is best read off the joint activation–gradient summary trajectories. We therefore treat phase-like dynamics as secondary qualitative evidence rather than a universal training signature, since th… view at source ↗

**Figure 15.** Figure 15: Spectral atlas for the second half of the matched trunk. Rows show SparseV, TruncRoPE, SoftCap, FP8Head, LSWA, and AttnScale. The throughput-leaning variants (FP8Head, LSWA, AttnScale) show smaller activationside shifts than the earlier trunk variants, consistent with the Section 4 taxonomy view at source ↗

**Figure 16.** Figure 16: Tier-2 spectral atlas for the d36/d48 scale follow-up. Activation covariance, gradient spectra, RankMe, and tail-exponent trajectories for FlexWin d36, BetterWin d36, SparseV d36, and BetterWin d48. The qualitative spectral signatures match the corresponding d12 variants, supporting the cross-scale claim of Section 3 view at source ↗

**Figure 17.** Figure 17: Weight spectra are informative but tensor-dependent. Layer-11 attention-output and MLP-output projections at the final checkpoint, plus their head exponents at step 1600 and at the end of training. The MLP-output projection shows clearer parameter-side divergence across variants, consistent with WO’s stable architectural role and the additional cross-variant variance accumulated by the feedforward writeb… view at source ↗

**Figure 18.** Figure 18: Phase-like RankMe trajectories are batch-regime dependent. Collapse–expansion–compression behavior is not uniform across batch size or variant; the phase sequence reported in prior geometry work appears most clearly in intermediate tiers, so we treat it as qualitative support rather than a universal law view at source ↗

read the original abstract

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spectral diagnostics for LLM training look promising but are demonstrated only within one narrow model family.

read the letter

The main thing to know is that this paper proposes using activation covariance spectra and gradient SVD spectra to diagnose LLM training dynamics, claiming that the early covariance tail can forecast token efficiency and that these measures separate different kinds of improvements. What the paper does well is lay out an empirical protocol on a controlled set of decoder-only models from 12 to 48 layers based on NanoGPT. It shows batch size affecting activation spectra even at equal loss, the predictive power of the tail, and how head movement in the spectrum with gradients can characterize learning versus execution changes. These signals hold across the model sizes they tested. The mechanistic model is presented as explaining the correlation with task-aligned features. The limitations are straightforward. Everything is confined to this one model family and codebase, raising the question of whether the results are general or tied to specific choices in architecture and training setup. The mechanistic model is said to prove the findings, but the abstract lacks any equations or proof details, so it's difficult to assess if it provides independent grounding or risks circularity. Without reported error bars or additional controls, the strength of the forecasting claim is hard to evaluate from the given information. Readers interested in practical tools for monitoring LLM optimization would find this relevant, particularly if they work with similar decoder models. It could be a starting point for developing better diagnostics. The paper shows clear thinking in its empirical design and engagement with the idea of hidden representation mechanics. I recommend sending this to peer review. The novelty in the spectral approach warrants referee input to address the generalizability and to clarify the mechanistic part.

Referee Report

3 major / 2 minor

Summary. The paper proposes spectral diagnostics based on activation covariance spectra and per-sample gradient SVD spectra to probe internal representation geometry and learning dynamics during LLM training, beyond what loss and throughput reveal. Using a controlled family of decoder-only transformer models (12-48 layers) adapted from the modded NanoGPT codebase, it reports three empirical findings: (1) batch size systematically shapes activation spectra even among runs that reach identical final loss; (2) the tail of the early-training activation covariance spectrum reliably predicts downstream token efficiency; (3) movement of the leading spectral modes together with gradient spectra distinguishes learning-side architectural gains from execution-side improvements. A mechanistic model is presented as proving these observations by linking activation covariance spectra to task-aligned feature learning. The signals are claimed to hold across the tested model depths.

Significance. If the empirical patterns and the mechanistic account are robust, the work supplies concrete, low-overhead diagnostics that could guide hyperparameter selection and architecture decisions earlier in training. The forecasting claim for token efficiency and the separation of learning versus execution dynamics would be practically valuable for large-scale training. The paper's use of a single controlled model family allows clean isolation of batch-size and depth effects, which is a methodological strength.

major comments (3)

[Abstract] Abstract: the claim that 'a mechanistic model proves the main observations' is load-bearing for the paper's explanatory contribution, yet the abstract (and the provided manuscript excerpt) supplies no equations, assumptions, or derivation steps for this model. Without these details it is impossible to assess whether the model supplies independent grounding or merely restates the observed spectral correlations.
[Abstract] Abstract and experimental description: all reported results, including the forecasting reliability of the activation covariance tail and the persistence across depths, are obtained exclusively on decoder-only models adapted from a single NanoGPT codebase variant. The central claim that these spectral behaviors diagnose general LLM optimization dynamics therefore rests on an untested assumption of transferability; no replication on other architectures, optimizers, or codebases is described.
[Abstract] Abstract: the three empirical findings are stated without reference to controls, error bars, or statistical tests. For the forecasting claim in particular, it is unclear whether the reported reliability survives multiple-testing correction, different random seeds, or alternative spectral truncation choices.

minor comments (2)

[Abstract] The abstract refers to 'activation covariance and per-sample gradient SVD spectra' without defining the precise matrix construction or normalization used; this notation should be introduced explicitly in the methods section.
The manuscript excerpt provides no figure or table captions, making it difficult to judge how the spectra are visualized or how quantitative thresholds (e.g., 'tail') are operationalized.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We respond point by point below, indicating revisions where appropriate to improve clarity and address concerns about scope and statistical rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'a mechanistic model proves the main observations' is load-bearing for the paper's explanatory contribution, yet the abstract (and the provided manuscript excerpt) supplies no equations, assumptions, or derivation steps for this model. Without these details it is impossible to assess whether the model supplies independent grounding or merely restates the observed spectral correlations.

Authors: We agree that the abstract is too terse on the mechanistic model. The full manuscript (Section 4) derives the model from a simplified feature-learning dynamics with explicit assumptions of linear task alignment and covariance-driven updates, showing how the leading spectral modes predict token efficiency. We will revise the abstract to include a concise statement of the core assumptions and the key derivation linking spectral tails to aligned feature learning, allowing independent evaluation of whether the model provides explanatory power beyond correlation. revision: yes
Referee: [Abstract] Abstract and experimental description: all reported results, including the forecasting reliability of the activation covariance tail and the persistence across depths, are obtained exclusively on decoder-only models adapted from a single NanoGPT codebase variant. The central claim that these spectral behaviors diagnose general LLM optimization dynamics therefore rests on an untested assumption of transferability; no replication on other architectures, optimizers, or codebases is described.

Authors: The single controlled family was selected precisely to isolate batch-size and depth effects on representation geometry without implementation confounds, strengthening internal validity. We acknowledge that this precludes strong claims of universality across all LLMs. We will revise the abstract and add an explicit limitations paragraph noting the decoder-only NanoGPT scope and the need for future replication on other architectures and optimizers. The mechanistic model is formulated at a level that does not depend on specific codebases, but empirical breadth remains limited. revision: partial
Referee: [Abstract] Abstract: the three empirical findings are stated without reference to controls, error bars, or statistical tests. For the forecasting claim in particular, it is unclear whether the reported reliability survives multiple-testing correction, different random seeds, or alternative spectral truncation choices.

Authors: The full manuscript reports all main results as averages over multiple random seeds with error bars, and the appendix contains robustness checks across spectral truncation thresholds. We will update the abstract to reference these controls and the multi-seed validation. For the forecasting claim we will add a statement confirming that the predictive correlation remains significant after FDR correction for multiple comparisons and holds under varied truncation choices. revision: yes

standing simulated objections not resolved

Replication of the reported spectral patterns and forecasting reliability on architectures other than the tested decoder-only family or with different optimizers and codebases.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports empirical spectral measurements on a controlled NanoGPT-derived decoder-only family, identifies three patterns (batch-size geometry effects, early-tail forecasting of token efficiency, and head-movement diagnostics), and states that a mechanistic model explains the correlation with task-aligned features. No equations, fitted-parameter renamings, or self-citation chains are supplied that would reduce any claimed prediction or proof to the input spectra by construction. The forecasting relation uses temporally separated measurements (early activation covariance versus later efficiency), which is statistically independent of the later data. The mechanistic model is asserted to prove the observations but is not shown to be a re-expression of the same fitted quantities. All load-bearing claims therefore remain externally falsifiable and non-tautological on the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The mechanistic model likely relies on some assumptions about spectra representing feature learning, but details are unavailable.

pith-pipeline@v0.9.0 · 5467 in / 1262 out tokens · 39824 ms · 2026-05-08T05:37:31.251368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 25 canonical work pages · 10 internal anchors

[1]

Scaling laws for neural language models, 2020

Kaplan et al. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/20 01.08361

2020
[2]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. doi: 10.48550/arXiv.2203.15556. URLhttps://arxiv.org/abs/2203.15556

work page internal anchor Pith review doi:10.48550/arxiv.2203.15556 2022
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64 a-Ab...

1901
[4]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,

work page internal anchor Pith review arXiv
[5]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

doi: 10.48550/arXiv.2112.11446. URLhttps://arxiv.org/abs/2112.11446

work page internal anchor Pith review doi:10.48550/arxiv.2112.11446
[6]

Predictability and surprise in large generative models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022. doi: 10.1145/3531146.3533229. URL https: //do...

work page doi:10.1145/3531146.3533229 2022
[7]

Tracing the representation geometry of language models from pretraining to post-training.arXiv preprint arXiv:2509.23024, 2025

Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, and Blake A. Richards. Tracing the representation geometry of language models from pretraining to post-training, 2025. URLhttps://arxiv.org/abs/2509.23024

work page arXiv 2025
[8]

Superposition Yields Robust Neural Scaling

Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025. doi: 10.48550/arXiv.2505.10465. URLhttps://arxiv.org/ abs/2505.10465

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.10465 2025
[9]

Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121,

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121,
[10]

Explaining neural scaling laws , volume=

doi: 10.1073/pnas.2311878121. URLhttps://doi.org/10.1073/pnas.2311878121

work page doi:10.1073/pnas.2311878121
[11]

Martin and Michael W

Charles H. Martin and Michael W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021. URLhttps://www.jmlr.org/papers/v22/20-410.html

2021
[12]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Jordan et al. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https: //github.com/KellerJordan/modded-nanogpt

2024
[13]

Tyler Chang, Zhuowen Tu, and Benjamin K. Bergen. The geometry of multilingual language model representations. InProceedings of EMNLP, 2022. doi: 10.18653/v1/2022.emnlp-main.9. URLhttps://aclanthology.org/2022.emnlp-main.9/

work page doi:10.18653/v1/2022.emnlp-main.9 2022
[14]

Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann LeCun. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 10929–10974. PMLR, 2023. URLhttps://proceedings...

2023
[15]

Roberts, and Ethan Dyer

Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace,
[16]

URLhttps://arxiv.org/abs/1812.04754

work page Pith review arXiv
[17]

An investigation into neural net optimization via hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InProceedings of ICML, 2019. URL https: //proceedings.mlr.press/v97/ghorbani19b.html. SPECTRAL LENS: ACTIVATION AND GRADIENT SPECTRA AS DIAGNOSTICS OF LLM OPTIMIZATION 13

2019
[18]

Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians

Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. InProceedings of ICML, 2019. URLhttps://proceedings.mlr.press/ v97/papyan19a.html

2019
[19]

Mahoney and Charles H

Michael W. Mahoney and Charles H. Martin. Traditional and heavy tailed self regularization in neural network models. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4284–4293. PMLR,
[20]

URLhttps://proceedings.mlr.press/v97/mahoney19a.html
[21]

On the overlooked structure of stochastic gradients

Zeke Xie, Qian-Yuan Tang, Mingming Sun, and Ping Li. On the overlooked structure of stochastic gradients. InProceedings of NeurIPS, 2023. URLhttps://proceedings.neurips.cc/paper_f iles/paper/2023/hash/d0b2eda0386f477ab14d7e181e16c899-Abstract-Conference.html

2023
[22]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017. URLhttps://arxiv. org/abs/1609.04836

work page internal anchor Pith review arXiv 2017
[23]

An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162, 2018. URLhttps://arxiv.org/ab s/1812.06162

work page Pith review arXiv 2018
[24]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URLhttps://papers.neurips.cc/paper_files/paper/2017/has h/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

2017
[25]

arXiv preprint arXiv:2210.16859 (2022)

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws, 2022. URLhttps://arxiv.org/abs/2210.16859

work page arXiv 2022
[26]

Spectrum dependent learning curves in kernel regression and wide neural networks

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020

2020
[27]

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(2914), 2021

Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(2914), 2021

2021
[28]

Asymptotic learning curves of kernel methods: Empirical data versus teacher–student paradigm, 2019

Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: Empirical data versus teacher–student paradigm, 2019

2019
[29]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlicek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024. doi: 10.48550/arXiv.2406.17557. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review doi:10.48550/arxiv.2406.17557 2024
[30]

A system for massively parallel hyperparameter tuning

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A system for massively parallel hyperparameter tuning. InProceedings of Machine Learning and Systems, volume 2, pages 230–246, 2020. URL https://proceedings.mlsys.org/paper_files/paper/2020/hash/a06f20b349c6cf09a6b1 71c71b8...

2020
[31]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2301.05217. URLhttps://arxiv.org/abs/2301.0 5217

work page internal anchor Pith review doi:10.48550/arxiv.2301.05217 2023
[32]

Nolte, Eric J

Ziming Liu, Ouail Kitouni, Niklas S. Nolte, Eric J. Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning. InAdvances in Neural Information Processing Systems, volume 35, pages 34651–34663, 2022. URLhttps: //proceedings.neurips.cc/paper_files/paper/2022/hash/dfc310e81992d2e4cedc09ac4 7eff13...

2022
[33]

Grokking modular arithmetic, 2023

Andrey Gromov. Grokking modular arithmetic, 2023. URLhttps://arxiv.org/abs/2301.0 2679

2023
[34]

Richards.α-ReQ: Assessing representation quality in self-supervised learning by measuring eigenspectrum decay

Kumar Krishna Agrawal, Arnab Kumar Mondal, Arna Ghosh, and Blake A. Richards.α-ReQ: Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. InAdvances in Neural Information Processing Systems, 2022. URLhttps://proceedings.ne urips.cc/paper_files/paper/2022/hash/70596d70542c51c8d9b4e423f4bf2736-Abstrac t-Conference.html

2022
[35]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stephane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InProceedings of ICML, 2021. URLhttps: //proceedings.mlr.press/v139/zbontar21a.html

2021
[36]

Vicreg: Variance-invariance-covariance regularization for self-supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. InProceedings of ICLR, 2022. URL https: //openreview.net/forum?id=xm6YD62D1Ub

2022
[37]

doi: 10.18653/v1/D19-1006

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 55–65, 2019. doi: 10.18653/v1/D19-1006. URL https://a...

work page doi:10.18653/v1/d19-1006 2019
[38]

All bark and no bite: Rogue dimensions in transformer language models obscure representational quality

William Timkey and Marten van Schijndel. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4527–4546, 2021. doi: 10.18653/v1/2021.emnlp-main.372. URL https://aclanthology.org/2021.emnlp-main.37 2/

work page doi:10.18653/v1/2021.emnlp-main.372 2021
[39]

Evolution of the spectral dimension of transformer activations

Andy Zeyi Liu, Elliot Paquette, and John Sous. Evolution of the spectral dimension of transformer activations. InOPT 2025: Optimization for Machine Learning, 2025. URL https://openreview.net/forum?id=Va5is76bTP

2025
[40]

Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022. URLhttps://www.jmlr.org/papers/v23/20-1 111.html

2022
[41]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2007. URLhttps://papers.nips.cc/paper_fil es/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html

2007
[42]

Nonlinear spiked covariance matrices and signal propagation in deep neural networks

Zhichao Wang, Denny Wu, and Zhou Fan. Nonlinear spiked covariance matrices and signal propagation in deep neural networks. InProceedings of COLT, 2024. URLhttps://proceedi ngs.mlr.press/v247/wang24b.html

2024
[43]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URLhttps://arxiv.or g/abs/2201.02177

work page internal anchor Pith review arXiv 2022
[44]

On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026

Jianliang He, Leda Wang, Siyu Chen, and Zhuoran Yang. On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026. URLhttps://arxiv.or g/abs/2602.16849

work page arXiv 2026
[45]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks, 2023. URLhttps://arxiv.org/abs/2303.11873

work page arXiv 2023
[46]

Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas

Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas. Predicting grokking long before it happens: A look into the loss landscape of models which grok, 2023. URLhttps://arxiv.org/abs/2306.13253

work page arXiv 2023
[47]

Language models are unsupervised multitask learners, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URLhttps://cdn.openai.com/b etter-language-models/language_models_are_unsupervised_multitask_learners.pdf. SPECTRAL LENS: ACTIVATION AND GRADIENT SPECTRA AS DIAGNOSTICS OF LLM OPTIMIZATION 15

2019
[48]

modded-nanogpt record 1: llm.c baseline, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 1: llm.c baseline, 2024. URL https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_1 _short/2024-10-13_llmc/main.log. Replacement link for the repository’s record-1 log; the originally uploaded 2024-05-28_llmc path did not resolve

2024
[49]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021. URLhttps://arxiv.org/abs/ 2104.09864

work page internal anchor Pith review arXiv 2021
[50]

modded-nanogpt record 2: Tuned learning rate and rotary embeddings, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 2: Tuned learning rate and rotary embeddings, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt/b lob/master/records/track_1_short/2024-06-06_AdamW/f66d43d7-e449-4029-8adf-e85 37bab49ea.log

2024
[51]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps: //kellerjordan.github.io/posts/muon/

2024
[52]

modded-nanogpt record 3: Introduced the muon optimizer, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 3: Introduced the muon optimizer, 2024. URL https://github.com/KellerJordan/modded-nanogpt#world-recor d-history. The repository’s world-record table lists record 3, but says no log is available; the originally uploaded 2024-10-04_Muon path did not resolve

2024
[53]

modded-nanogpt record 8: Untied embedding and head, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 8: Untied embedding and head, 2024. URL https://github.com/KellerJordan/modded-nanogpt/blob/master/recor ds/track_1_short/2024-11-03_UntieEmbed/d6b50d71-f419-4d26-bb39-a60d55ae7a04.tx t

2024
[54]

modded-nanogpt record 9: Value and embedding skip connections, momentum warmup, logit softcap, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 9: Value and embedding skip connections, momentum warmup, logit softcap, 2024. URLhttps://github.com/KellerJ ordan/modded-nanogpt/blob/master/records/track_1_short/2024-11-06_ShortcutsTwe aks/dd7304a6-cc43-4d5e-adb8-c070111464a1.txt

2024
[55]

U-Net: Convolutional Networks for Biomedical Image Segmentation, pp.\ 234–241

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Interven- tion, 2015. doi: 10.1007/978-3-319-24574-4_28. URL https://doi.org/10.1007/978-3-319 -24574-4_28

work page doi:10.1007/978-3-319-24574-4_28 2015
[56]

modded-nanogpt record 11: U-net pattern skip connections and double lr, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 11: U-net pattern skip connections and double lr, 2024. URLhttps://github.com/KellerJordan/modded-nanog pt/blob/master/records/track_1_short/2024-11-10_UNetDoubleLr/c87bb826-797b-4f3 7-98c7-d3a5dad2de74.txt

2024
[57]

modded-nanogpt record 12: 1024-ctx dense causal attention to 64k-ctx flexattention, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 12: 1024-ctx dense causal attention to 64k-ctx flexattention, 2024. URLhttps://github.com/KellerJordan/modded-n anogpt/blob/master/records/track_1_short/2024-11-19_FlexAttention/8384493d-dba 9-4991-b16b-8696953f5e6d.txt

work page arXiv 2024
[58]

modded-nanogpt record 13: Attention window warmup, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 13: Attention window warmup, 2024. URL https://github.com/KellerJordan/modded-nanogpt/blob/master/r ecords/track_1_short/2024-11-24_WindowWarmup/cf9e4571-c5fc-4323-abf3-a98d862ec 6c8.txt

2024
[59]

modded-nanogpt record 14: Value embeddings,

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 14: Value embeddings,
[60]

URL https://github.com/KellerJordan/modded-nanogpt/tree/master/records/t rack_1_short/2024-12-04_ValueEmbed

2024
[61]

modded-nanogpt record 16: Split value embeddings, block sliding window, separate block mask, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 16: Split value embeddings, block sliding window, separate block mask, 2024. URLhttps://github.com/KellerJordan/ modded-nanogpt/tree/master/records/track_1_short/2024-12-10_MFUTweaks

2024
[62]

modded-nanogpt record 17: Sparsify value embed- dings, improve rotary embeddings, drop an attn layer, 2024

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 17: Sparsify value embed- dings, improve rotary embeddings, drop an attn layer, 2024. URLhttps://github.com/Kelle rJordan/modded-nanogpt/tree/master/records/track_1_short/2024-12-17_SparsifyEm 16 SPECTRAL LENS: ACTIVATION AND GRADIENT SPECTRA AS DIAGNOSTICS OF LLM OPTIMIZATION beds

2024
[63]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review arXiv 2024
[64]

modded-nanogpt record 18: Lower logit softcap from 30 to 15, 2025

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 18: Lower logit softcap from 30 to 15, 2025. URLhttps://github.com/KellerJordan/modded-nanogpt/blob/mast er/records/track_1_short/2025-01-04_SoftCap/31d6c427-f1f7-4d8a-91be-a67b5dcd1 3fd.txt

2025
[65]

Fp8 formats for deep learning,

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. FP8 formats for deep learning, 2022. URLhttps://arxiv.org/abs/2209.05433

work page arXiv 2022
[66]

modded-nanogpt record 19: Fp8 head, offset logits, lr decay to 0.1 instead of 0.0, 2025

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 19: Fp8 head, offset logits, lr decay to 0.1 instead of 0.0, 2025. URLhttps://github.com/KellerJordan/modded-nanog pt/blob/master/records/track_1_short/2025-01-13_Fp8LmHead/c51969c2-d04c-40a7-b cea-c092c3c2d11a.txt

2025
[67]

largest run

KellerJordan/modded-nanogpt contributors. modded-nanogpt record 20: Merged qkv weights, long-short attention, attention scale, lower adam epsilon, batched muon, 2025. URLhttps: //github.com/KellerJordan/modded-nanogpt/blob/master/records/track_1_short/202 5-01-16_Sub3Min/1d3bd93b-a69e-4118-aeb8-8184239d7566.txt. SPECTRAL LENS: ACTIVATION AND GRADIENT SPEC...

2025