pith. sign in

arxiv: 2606.19379 · v1 · pith:BZ7WVKG2new · submitted 2026-06-12 · 💻 cs.LG · cs.AI· cs.CL

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Pith reviewed 2026-06-27 04:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords transformerfeed-forward networklinearityrecoverabilityGPT-2Pythiacompression
0
0 comments X

The pith

Transformer FFN blocks show sharply varying linearity that training learns per block rather than architecture imposing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes each feed-forward block into its exact least-squares linear map from inputs to outputs plus residual, then measures how much held-out output variance the linear map explains. This recoverability metric ranges from near 1.0 to below 0.3 across adjacent blocks inside the same model and differs markedly between models that share the same GELU activation. Because the pattern is neither uniform nor dictated by activation choice, the degree of linearity is a property acquired during training for each individual block. The same decomposition also identifies which blocks can safely be replaced by low-parameter linear approximations and which cannot.

Core claim

Each FFN block is treated as a position-wise map; its optimal linear approximation is obtained in closed form, and the fraction of held-out variance this map accounts for (R^2_lin) is heterogeneous and non-monotone with depth, independent of the activation function, and therefore a learned attribute of the trained block.

What carries the argument

Closed-form least-squares linear map from a block's activations to its outputs, with R^2_lin on held-out data quantifying the recoverable linear component.

If this is right

  • Blocks with high R^2_lin admit single-layer linear replacements that reduce parameters by factors of eight while adding less than one perplexity point.
  • Blocks with low R^2_lin concentrate the computation that cannot be recovered linearly and therefore mark where nonlinear capacity is required.
  • Low-rank bilinear probes recover only a few additional points of R^2 from the residual, showing that the remaining computation is not a simple position-wise product.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training appears to allocate nonlinear capacity unevenly across layers rather than spreading it uniformly.
  • Targeted linear replacement guided by per-block R^2_lin could yield more efficient hybrid transformer variants.
  • The method supplies a diagnostic that could be applied during training to monitor how linearity evolves.

Load-bearing premise

That the position-wise input-to-output map of an FFN block can be usefully split into an optimal linear map plus a residual whose size measures nonlinearity.

What would settle it

Observing that every block in a trained model has nearly identical R^2_lin values close to 1, or that GPT-2 and Pythia-160m exhibit matching per-block recoverability profiles despite different training.

Figures

Figures reproduced from arXiv: 2606.19379 by Stuart Whipp.

Figure 1
Figure 1. Figure 1: Distillation as measurement. A frozen language model is run once over a corpus, and a [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-block linear recoverability R2 lin (exact closed-form ceiling) and zero-shot swap ∆PPL across all twelve blocks of GPT-2, Pythia-160m, and llama-160m. Linearity is jagged and non￾monotone, and the profiles differ sharply across models, including the two same-size GELU models (GPT-2, Pythia). • Corpus: WikiText-2-raw [Merity et al., 2017]. ∼15–30 k token rows captured per block (seq len 128); held-out 2… view at source ↗
Figure 3
Figure 3. Figure 3: Residual nonlinearity (1 − R2 lin) vs. the gain a rank-16 bilinear probe adds over the linear ceiling (R2 poly − R2 lin), per block. The gain is small everywhere and uncorrelated with residual nonlinearity (Pearson r ≈ 0): the residual is not recovered by a low-rank degree-2 (bilinear) layer. in between (≈ 0.31). And on a near-linear target poly’s multiplicative-recruitment gate de-recruits during fitting … view at source ↗
Figure 4
Figure 4. Figure 4: Global R2 lin vs. per-feature median R2 , point size ∝ effective rank. Points near the diagonal are broadly linear (high rank); points far below the diagonal at high R2 lin are low-rank, outlier￾concentrated (rank ≈ 1). A high R2 lin hides two structurally opposite regimes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-block linear recoverability R2 lin across three corpus domains (colour = model, linestyle = corpus). The three profiles overlay closely per model, so the recoverability profile is a property of the model, not the corpus. The downstream and multiplicative findings are corpus-robust too. Repeating the full depth sweep (closed-form ceiling + seeded-poly + zero-shot ∆PPL, with its own held-out train/test s… view at source ↗
read the original abstract

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines R^2_lin as the coefficient of determination achieved by the closed-form least-squares linear map from FFN input activations to output activations, evaluated on held-out data drawn from each model's forward passes. It reports that R^2_lin varies sharply and non-monotonically across the 12 blocks of GPT-2, Pythia-160m and Llama-160m (near 1.0 in some blocks, below 0.3 in others), that identical GELU activations produce dissimilar layer-wise profiles in GPT-2 versus Pythia-160m, and therefore that per-block linear recoverability is a learned property rather than an architectural one. A low-rank bilinear probe recovers only a few additional points of variance in the residual, and the measure is proposed as a compression signal that identifies blocks safe for large linear replacement.

Significance. If the per-block R^2_lin values faithfully isolate the degree of linearity in the learned weights rather than merely reflecting model-specific activation statistics, the work supplies an optimiser-free diagnostic that challenges the uniform treatment of FFNs as nonlinear stores and supplies a concrete, falsifiable signal for targeted compression. The closed-form normal-equation construction and explicit warning about under-convergence of trained linear baselines are concrete methodological contributions.

major comments (2)
  1. [Abstract and §4 (cross-model comparison)] The central claim that recoverability 'is a learned property of individual trained blocks, not an architectural one' rests on the cross-model comparison of R^2_lin profiles for GPT-2 and Pythia-160m (both GELU). Because the held-out activations are drawn from each model's own residual stream, differences in input covariance or dynamic range could produce the observed profile divergence even if the weight matrices were identical. No experiment that feeds activations from one model into the FFN of the other, or that normalises input statistics, is described; this leaves the inference from profile difference to 'learned' status load-bearing and untested.
  2. [§3 (definition of R^2_lin) and experimental setup] The weakest-assumption paragraph notes that R^2_lin is computed on held-out activations from the model's own forward passes. The manuscript provides no quantitative characterisation of how narrow or model-specific these distributions are (e.g., condition numbers of the input Gram matrix, effective support size), which directly affects whether the reported heterogeneity can be attributed to the learned map rather than to the test distribution.
minor comments (2)
  1. [Abstract and §4] The abstract states that 'the abstract provides no details on data splits, token sampling, or statistical significance'; the full manuscript should supply these (number of tokens, train/test split per block, bootstrap or jackknife error bars on R^2_lin) so that the reported heterogeneity can be verified.
  2. [§3] Notation for the linear map (W_lin, b_lin) and the precise definition of the residual variance should be stated once in a single equation block rather than scattered across the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important nuances in the interpretation of R^2_lin and the strength of evidence for our central claim. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4 (cross-model comparison)] The central claim that recoverability 'is a learned property of individual trained blocks, not an architectural one' rests on the cross-model comparison of R^2_lin profiles for GPT-2 and Pythia-160m (both GELU). Because the held-out activations are drawn from each model's own residual stream, differences in input covariance or dynamic range could produce the observed profile divergence even if the weight matrices were identical. No experiment that feeds activations from one model into the FFN of the other, or that normalises input statistics, is described; this leaves the inference from profile difference to 'learned' status load-bearing and untested.

    Authors: We agree that the cross-model comparison relies on model-specific activation distributions and that an explicit activation-swapping or input-normalization experiment would provide stronger isolation of the effect of learned weights. The within-model heterogeneity across blocks (which share identical architecture and activation within each model) remains the primary evidence that recoverability is not uniformly determined by architecture. The GPT-2 vs. Pythia-160m difference with matched GELU and width is offered as supporting evidence that training produces distinct profiles, but we acknowledge the referee's point that input statistics are not controlled. In revision we will add a clarifying paragraph noting this limitation and the value of future cross-feeding experiments; we do not alter the central claim but qualify its evidential basis. revision: partial

  2. Referee: [§3 (definition of R^2_lin) and experimental setup] The weakest-assumption paragraph notes that R^2_lin is computed on held-out activations from the model's own forward passes. The manuscript provides no quantitative characterisation of how narrow or model-specific these distributions are (e.g., condition numbers of the input Gram matrix, effective support size), which directly affects whether the reported heterogeneity can be attributed to the learned map rather than to the test distribution.

    Authors: We accept that quantitative descriptors of the input distributions would help readers assess whether heterogeneity reflects the learned map. In the revised manuscript we will report, for each block and model, the condition number of the input Gram matrix (computed on the same held-out activations used for R^2_lin), the effective rank, and a simple measure of dynamic range. These additions will be placed in §3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: R^2_lin is standard closed-form regression on held-out activations

full rationale

The paper computes R^2_lin directly as the held-out variance explained by the exact least-squares linear map (closed-form normal equations) fitted to each FFN block's position-wise input-output pairs. This quantity is defined from the observed activations alone and does not reduce to any model parameter, self-citation, or prior result by construction. The inference that recoverability is learned (rather than architectural) rests on the empirical observation of heterogeneous, non-monotone profiles that differ across models with identical activations; these comparisons are external to the measurement itself and introduce no self-referential loop. No load-bearing step invokes fitted inputs renamed as predictions, uniqueness theorems, or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The measurement rests only on the standard mathematical fact that the least-squares solution is the unique L2-optimal linear map; no free parameters are introduced beyond the linear coefficients themselves, and no new entities are postulated.

axioms (1)
  • standard math The ordinary least-squares solution minimises the L2 residual for a linear map between finite-dimensional vector spaces.
    Invoked to obtain the exact linear approximation without optimisation.

pith-pipeline@v0.9.1-grok · 5831 in / 1337 out tokens · 42320 ms · 2026-06-27T04:40:25.146299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

    Transformer Feed-Forward Layers Are Key-Value Memories , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2021 , doi =

  2. [2]

    Shazeer, Noam , journal =

  3. [3]

    Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle =

  4. [4]

    Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , editor =

  5. [5]

    International Joint Conference on Neural Networks (IJCNN) , pages =

    The Pi-Sigma Network: An Efficient Higher-Order Neural Network for Pattern Classification and Function Approximation , author =. International Joint Conference on Neural Networks (IJCNN) , pages =

  6. [6]

    International Conference on Learning Representations (ICLR) , year =

    Multiplicative Interactions and Where to Find Them , author =. International Conference on Learning Representations (ICLR) , year =

  7. [7]

    IEEE International Conference on Data Mining (ICDM) , pages =

    Factorization Machines , author =. IEEE International Conference on Data Mining (ICDM) , pages =. 2010 , doi =

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Higher-Order Factorization Machines , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  9. [9]

    Language Models are Unsupervised Multitask Learners , author =

  10. [10]

    Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and van der Wal, Oskar , booktitle =

  11. [11]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year =

  12. [12]

    International Conference on Learning Representations (ICLR) , year =

    Pointer Sentinel Mixture Models , author =. International Conference on Learning Representations (ICLR) , year =

  13. [13]

    Distilling the Knowledge in a Neural Network

    Distilling the Knowledge in a Neural Network , author =. arXiv preprint arXiv:1503.02531 , year =

  14. [14]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Revisiting Model Stitching to Compare Neural Representations , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  15. [15]

    Understanding intermediate layers using linear classifier probes

    Understanding Intermediate Layers Using Linear Classifier Probes , author =. arXiv preprint arXiv:1610.01644 , year =

  16. [16]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

    Deep Polynomial Neural Networks , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2022 , doi =

  17. [17]

    When Does the Pi Branch Fire? Multiplicative Hypernetworks for Few-Shot Weight Initialisation , author =