citation dossier

Progress measures for grokking via mechanistic interpretability

N · 2023 · arXiv 2301.05217

18Pith papers citing it

19reference links

cs.LGtop field · 12 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 18 reviewed papers. Its strongest current cluster is cs.LG (12 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to generative phenomena including double descent and out-of-equilibrium biases.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

ILDR: Geometric Early Detection of Grokking

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Dimensional Criticality at Grokking Across MLPs and Transformers

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A Random Matrix Theory method identifies growing Correlation Traps in neural network weight spectra during an 'anti-grokking' overfitting phase, and applies the same diagnostic to some foundation LLMs.

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.

LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.

Grokking as Dimensional Phase Transition in Neural Networks

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.

PhiNet: Speaker Verification with Phonetic Interpretability

eess.AS · 2026-04-02 · unverdicted · novelty 6.0

PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

Emergent Semantic Role Understanding in Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned models, and representations shifting toward more distributed forms at larger scales.

Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

cs.AI · 2026-05-02 · unverdicted · novelty 4.0

AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

cs.LG · 2026-04-28 · unverdicted · novelty 4.0

Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 18 of 18 citing papers.

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models cs.LG · 2026-05-09 · unverdicted · none · ref 80
In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to generative phenomena including double descent and out-of-equilibrium biases.
Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 120
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG · 2026-05-05 · unverdicted · none · ref 7
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
ILDR: Geometric Early Detection of Grokking cs.LG · 2026-04-22 · unverdicted · none · ref 5
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
Grokking of Diffusion Models: Case Study on Modular Addition cs.LG · 2026-04-20 · unverdicted · none · ref 15
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Dimensional Criticality at Grokking Across MLPs and Transformers cs.LG · 2026-04-06 · unverdicted · none · ref 4
Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory cs.LG · 2026-05-12 · unverdicted · none · ref 17
A Random Matrix Theory method identifies growing Correlation Traps in neural network weight spectra during an 'anti-grokking' overfitting phase, and applies the same diagnostic to some foundation LLMs.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation cs.LG · 2026-05-12 · unverdicted · none · ref 65
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 31
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 8 · 2 links
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces cs.CL · 2026-04-07 · unverdicted · none · ref 12
LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.
Grokking as Dimensional Phase Transition in Neural Networks cs.LG · 2026-04-06 · unverdicted · none · ref 3
Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
PhiNet: Speaker Verification with Phonetic Interpretability eess.AS · 2026-04-02 · unverdicted · none · ref 58
PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds cs.LG · 2026-05-10 · unverdicted · none · ref 17
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
Emergent Semantic Role Understanding in Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 46
Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned models, and representations shifting toward more distributed forms at larger scales.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance cs.AI · 2026-05-02 · unverdicted · none · ref 25
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking cs.LG · 2026-04-28 · unverdicted · none · ref 3
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 170
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Progress measures for grokking via mechanistic interpretability

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer