arxiv: 2604.04037 · v2 · submitted 2026-04-05 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Nilesh Sarkar , Dawar Jyoti Deka

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge distillationsuperpositionneural networksfeature capacityloss floorsparse autoencodersmodel compressionminimum width

0 comments

The pith

Neural network width imposes a geometric limit on retained features during knowledge distillation, creating a permanent loss floor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that knowledge distillation between neural networks encounters an unavoidable performance floor because of geometric constraints on representation capacity. Smaller student models can only encode a limited number of features from the teacher through superposition, specifically at most their width multiplied by a sparsity-dependent factor g(α). Features exceeding this capacity are lost permanently, leading to an importance-weighted error that cannot be trained away. This limit holds across different training methods and can be predicted ahead of time using measurements from sparse autoencoders on the teacher model. Understanding this helps explain why distillation stops improving at a certain point and connects representation geometry to practical compression limits.

Core claim

Neural networks represent more features than their dimensions allow by using superposition, but a student model with width d_S is limited to encoding at most d_S · g(α) features, where g(α) = 1/((1-α)ln(1/(1-α))). Any features beyond this budget are permanently lost, producing an importance-weighted loss floor in knowledge distillation that persists regardless of training method or objective.

What carries the argument

The superposition capacity bound d_S · g(α), with g(α) as the sparsity-dependent capacity function that determines the maximum number of features encodable in a given width.

If this is right

Distillation performance is bounded by the geometric floor set by student width and teacher feature sparsity α.
The observed loss floor can be decomposed into a geometric component and a width-independent architectural baseline with high accuracy.
Coarse concepts remain even after losing 88 percent of features, but the loss comes from fine-grained features in the long tail of the importance distribution.
Distillation floors can be predicted solely from sparse autoencoder measurements of the teacher's features without performing the distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bound holds, increasing student width should monotonically lower the floor in a predictable way.
The result may generalize to other model compression techniques that reduce effective width or capacity.
Practitioners could use SAE feature counts to decide minimum viable student sizes before training.

Load-bearing premise

The superposition capacity of any network is strictly bounded by its width times g(α), independent of training method or objective, so that excess features cause permanent unrecoverable loss.

What would settle it

Finding a student model that recovers more features than d_S · g(α) as measured by sparse autoencoders after distillation, or achieving a lower loss floor than predicted by the geometric bound.

Figures

Figures reproduced from arXiv: 2604.04037 by Dawar Jyoti Deka, Nilesh Sarkar.

**Figure 1.** Figure 1: Capacity function and critical width. Left: g(α) grows exponentially with sparsity; colored dots mark our toy model sparsity levels. Right: Critical width d ∗ S = F/g(α) shrinks as sparsity increases, since sparser features need fewer dimensions. 3.2 Capacity of a sparse representation From compressed sensing theory, a d-dimensional space can represent at most d · g(α) features at sparsity α: g(α) = 1 (1 −… view at source ↗

**Figure 2.** Figure 2: Loss floor vs. student width across 48 configurations (rows: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Predicted vs. actual floor (log-log, 140 points). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: SAE training convergence. Layer 8 (blue) encodes a denser feature set; deeper layers 12 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Feature importance follows a power law: the top ∼20 features dominate, with a cliff near rank ∼3,000 where thousands reach ∼10−7 . This heavy tail is why compression works. (b) Predicted floor vs. width at layers 8, 12, 16. All layers agree (d ∗ S ∈ [1065, 1186]), converging near zero at dS = 1024. the teacher itself is in superposition. 6.3 Two-component floor decomposition The formula predicts floors… view at source ↗

**Figure 6.** Figure 6: Distillation results. (a) Eval loss for all widths; narrower students plateau higher. (b) Floor decreases from 1.320 to 0.586 nats. Dotted line marks d ∗ S ≈ 1065. Fit Parameters R2 Linear (y = Cx) C = 8.97 −1.982 Affine (y = Cx + B) C = 8.97, B = 0.623 0.993 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Two-component decomposition. (a) Linear fit (R2 = −1.982) fails. (b) Affine fit (R2 = 0.993): observed = 8.97 × predicted + 0.623. Baseline B = 0.623: architectural floor; slope C = 8.97: amplification through transformer layers. dS Observed CLˆ∗ B % Geom. 128 1.320 0.713 0.586 55.6% 256 1.008 0.359 0.586 41.9% 512 0.733 0.100 0.586 20.1% 768 0.652 0.014 0.586 10.1% 1024 0.586 0.001 0.586 0.1% [PITH_FULL_… view at source ↗

**Figure 8.** Figure 8: Linear probe results. (a) Heatmap: all concepts survive compression. (b) ±3 pp shifts reflect reallocation. (c) All above chance (50%). 7.3 Interpretation: the granularity mismatch The key insight is a granularity mismatch between the concepts we probe and the features the bottleneck drops. Each coarse domain (e.g., “French text”) is supported by hundreds of SAE features. At dS = 128, 3,446 features surviv… view at source ↗

**Figure 9.** Figure 9: Effect of α on floor at dT = 5 for n ∈ {10, 20, 40}. Higher sparsity (more features/dim) yields lower floors. Solid = actual; dashed = predicted. Error distributions. The refined formula concentrates errors near 100% accuracy across all sparsities; the naive formula degrades at high α ( [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Prediction error by sparsity. Refined (colored): median [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy heatmap (refined). Rows: α; cols: dT; panels: n. Green = > 99%. Zipf importance. The toy model uses Ii ∝ 1/i, matching real SAE distributions ( [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Zipf importance. Left: Linear scale. Right: Log-log confirms power law. Universal scaling. When plotted against dS/d∗ S , all configurations collapse onto one curve ( [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Normalized floor vs. dS/d∗ S . All configurations collapse: floor drops sharply at dS = d ∗ S (dashed). This universal scaling confirms the phase transition. Training dynamics. Students converge to distinct floors within ∼200 steps, confirming the floor is capacity-limited, not training-limited ( [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Training curves at different widths for six configurations. Dashed = predicted floors. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Per-layer SAE curves: layers 8 (top), 12 (mid), 16 (bottom). Layer 8 has lower recon [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: (a) Training loss for all widths. (b) Two seeds at dS = 128: floors differ by ∆ = 6.4 (0.24%), confirming the floor is deterministic [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Normalized predicted (SAE, dashed gray) vs. observed (distillation, solid red) floors. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Distillation summary. Top left: eval curves with floor estimates. Top right: per-token KL floor vs. width. Bottom left: normalized observed vs. predicted floors. Bottom right: seed variance at dS = 128. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

read the original abstract

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(\alpha)$ features, where $g(\alpha) = 1/((1-\alpha)\ln\frac{1}{1-\alpha})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $\alpha \approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links superposition geometry to a distillation loss floor via a minimum-width theorem, but the claim of a hard objective-independent bound needs more scrutiny on selective feature allocation.

read the letter

The core new piece is the minimum-width theorem: a student of width d_S holds at most d_S · g(α) features where g(α) = 1/((1-α)ln(1/(1-α))), so excess features produce a permanent importance-weighted loss floor. They derive this from superposition theory and test it on a toy model across 48 configurations with median accuracy above 93 percent, plus Pythia-410M runs where SAE measurements give α ≈ 0.992 and a predicted critical width around 1065. The five student widths follow the expected monotonic ordering of floors, and they decompose the observed floor into a geometric component plus a width-independent baseline with R² = 0.993. Linear probing shows coarse concepts persist even after 88 percent feature loss, which pins the floor on the long tail of fine-grained features. That connection between SAE-derived capacity and practical distillation limits is the useful part; it gives a way to predict performance from teacher measurements alone without running the student. The experiments are clean enough on the ordering and decomposition to earn credit. The soft spot is the assumption that the capacity bound is strictly geometric and does not depend on the training objective. Distillation uses an importance-weighted loss, so the student could allocate its limited dimensions preferentially to high-saliency features rather than suffering uniform loss of the excess. The abstract does not show the full derivation of g(α) or test whether selective allocation can raise the effective preserved count above d_S · g(α). Measuring α from the target model itself also introduces some data-dependent fitting into what is presented as a prediction. This is worth a serious referee to check the math and run controls on whether the floor softens under different objectives. It is aimed at people working on compression and representation limits; the idea is worth discussing even if the hard bound does not fully hold.

Referee Report

3 major / 2 minor

Summary. The paper claims that knowledge distillation exhibits a geometric loss floor arising from superposition: networks encode more features than dimensions, so a student of width d_S encodes at most d_S · g(α) features where g(α) = 1/((1-α) ln(1/(1-α))), with excess features permanently lost and producing an importance-weighted floor. The claim is supported by a toy-model validation (48 configurations, median accuracy >93%) and Pythia-410M experiments where SAEs measure F ≈ 28,700 features at α ≈ 0.992 (critical width d_S^* ≈ 1,065); distillation into five student widths confirms monotonic floor ordering, the observed floor decomposes into geometric and architectural components (R² = 0.993), and linear probing shows coarse concepts survive 88% feature loss.

Significance. If the minimum-width theorem holds and the capacity bound is objective-independent, the work supplies a practical, SAE-based predictor of distillation floors and links representation geometry to compression limits. The toy-model accuracy and Pythia monotonicity are concrete strengths; the decomposition into geometric and baseline components is a useful empirical contribution. However, the result's significance is tempered by the need to demonstrate that g(α) is not post-hoc or data-dependent.

major comments (3)

[Abstract / §3] Abstract and §3 (theorem statement): the explicit derivation of g(α) = 1/((1-α) ln(1/(1-α))) from superposition is not provided; without it the minimum-width claim that d_S · g(α) is a hard, objective-independent budget cannot be evaluated and risks being an ad-hoc fit to the observed floor.
[Pythia experiments] Pythia experiments (α measurement and prediction): α ≈ 0.992 is obtained from SAE on the teacher itself and then used to predict the distillation floor on students of the same architecture; this introduces circularity that undermines the claim that the floor is a geometric prediction rather than a fitted quantity.
[Distillation results / linear probing] Claim of permanent loss: the manuscript asserts features beyond d_S · g(α) are unrecoverable, yet no ablation compares distillation against direct training on the student or against objectives that reallocate capacity to high-importance features; without this the permanence and objective-independence of the bound remain unshown.

minor comments (2)

[Decomposition paragraph] Clarify whether the importance-weighted loss used in the floor decomposition is the same as the distillation objective or a post-hoc diagnostic; this affects whether the geometric component is truly predictive.
[Toy model validation] The toy-model section should report the exact distribution of the 48 configurations (widths, α values, feature importances) so readers can assess coverage of the claimed regime.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each of the major comments in detail below, clarifying the theoretical foundations, experimental design, and proposing specific revisions to enhance the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (theorem statement): the explicit derivation of g(α) = 1/((1-α) ln(1/(1-α))) from superposition is not provided; without it the minimum-width claim that d_S · g(α) is a hard, objective-independent budget cannot be evaluated and risks being an ad-hoc fit to the observed floor.

Authors: We agree that an explicit derivation is essential for evaluating the claim. The function g(α) is derived from the superposition theory by calculating the maximum number of features that can be represented in d dimensions with activation sparsity α, specifically using the formula arising from the expected overlap and the logarithmic term from the entropy or capacity calculation in sparse coding models. In the revised manuscript, we will add a dedicated subsection in §3 providing the full mathematical derivation step-by-step, starting from the feature activation model and arriving at g(α) = 1/((1-α) ln(1/(1-α))). This will establish that the bound is theoretically grounded rather than fitted. revision: yes
Referee: [Pythia experiments] Pythia experiments (α measurement and prediction): α ≈ 0.992 is obtained from SAE on the teacher itself and then used to predict the distillation floor on students of the same architecture; this introduces circularity that undermines the claim that the floor is a geometric prediction rather than a fitted quantity.

Authors: We appreciate the concern about potential circularity. The value of α is measured via SAE on the teacher to characterize its representation geometry, and the same α is used for prediction because the student models, being of the same family and trained via distillation to match the teacher's outputs, are hypothesized to operate under similar sparsity regimes. To mitigate this, we will revise the experiments section to include post-training SAE measurements of α for each student width, demonstrating consistency in α across models. Additionally, we will discuss the theoretical basis for why α is expected to be architecture-dependent but not task-specific in this context. revision: yes
Referee: [Distillation results / linear probing] Claim of permanent loss: the manuscript asserts features beyond d_S · g(α) are unrecoverable, yet no ablation compares distillation against direct training on the student or against objectives that reallocate capacity to high-importance features; without this the permanence and objective-independence of the bound remain unshown.

Authors: This comment highlights a key area for strengthening the evidence. Our current results, including the high R² decomposition and linear probing showing survival of coarse concepts, provide support for the geometric component being independent of specific objectives. However, to directly address permanence, we will include in the revision an additional experiment: training the student networks from scratch on the same downstream task without any distillation, and compare the achieved performance to the distillation floors. If the floors align closely, it would support that the bound is geometric and objective-independent. We believe this ablation is feasible and will add it to the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent measurements and validation

full rationale

The paper presents g(α) as a derived capacity function from superposition theory, measures α via SAE on the teacher (separate from distillation runs), and validates the resulting floor predictions empirically on toy models (48 configs) and Pythia distillation experiments. The central claim does not reduce by construction to its inputs; the observed floor is decomposed into geometric and baseline components with reported R²=0.993, and linear probing provides additional independent evidence. No load-bearing self-citation or fitted-parameter renaming is present.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard superposition assumptions from prior work plus an empirically measured sparsity parameter α; no new entities are postulated.

free parameters (1)

α (sparsity level)
Measured from SAE on Pythia-410M to compute numerical capacity and critical width; used to predict the loss floor.

axioms (1)

domain assumption Superposition permits representing more features than dimensions with sparsity controlled by α
Invoked to derive the capacity bound g(α) = 1/((1-α)ln(1/(1-α)))

pith-pipeline@v0.9.0 · 5546 in / 1318 out tokens · 65020 ms · 2026-05-13T16:52:04.725897+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition
cs.LG 2026-05 unverdicted novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Toy Models of Superposition

11 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2022 , archivePrefix=

Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

work page arXiv
[6]

B Additional toy model results Sparsity effect.Higher sparsity yields lower floors at every width because g(α) packs more features per dimension (Figure 9)

A Sparsity capacity function α1−α g(α) Features/dim 0.00 1.000 1.0 1.0× 0.50 0.500 2.9 2.9× 0.80 0.200 3.1 3.1× 0.90 0.100 4.3 4.3× 0.95 0.050 6.7 6.7× 0.99 0.010 21.7 21.7× 0.992 0.008 27.0 27.0× 0.999 0.001 145 145× Table 7: Reference values forg(α). B Additional toy model results Sparsity effect.Higher sparsity yields lower floors at every width becaus...

work page 2048