pith. sign in

arxiv: 2604.02459 · v1 · submitted 2026-04-02 · 💻 cs.LG · cs.AI· cs.CL

On the Geometric Structure of Layer Updates in Deep Language Models

Pith reviewed 2026-05-13 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords geometric structurelayer updatestokenwise componentresiduallanguage modelsTransformersalignmentapproximation error
0
0 comments X

The pith

Layer updates in language models decompose into a dominant tokenwise component aligned with the full update and a geometrically distinct residual tied to functional changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how representations evolve across layers in deep language models by decomposing layer updates into parts. It finds that the main part follows tokenwise patterns closely, matching the full update almost perfectly in alignment. The leftover residual stands out geometrically with weaker alignment and bigger deviations. This separation matters because errors when approximating with only the tokenwise part strongly predict how much the model's output shifts, with high correlations in bigger models. The approach works across different architectures like Transformers and state-space models, offering a way to probe what each layer actually does.

Core claim

Layerwise updates admit a decomposition into a dominant tokenwise component and a residual not captured by restricted tokenwise function classes. The full layer update aligns almost perfectly with the tokenwise component, while the residual shows weaker alignment, larger angular deviation, and lower projection onto the tokenwise subspace. Approximation error under the restricted tokenwise model correlates strongly with output perturbation, with Spearman values often above 0.7 and up to 0.95 in larger models. This indicates that significant computation concentrates in the geometrically distinct residual.

What carries the argument

Decomposition of layer updates into tokenwise component and residual, with measurements of alignment, angular deviation, and projection onto dominant subspaces.

If this is right

  • The residual is not a minor correction but a distinct part of the transformation.
  • Approximation errors from tokenwise models link directly to output changes via high Spearman correlations.
  • The pattern holds across Transformers and state-space models.
  • Most layer updates act as structured reparameterizations along a dominant direction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This geometric split could help design simpler models by approximating the main part while handling the residual separately.
  • Targeting the residual might enable more precise interventions in model behavior without affecting the bulk of the update.
  • Future work could test if this structure appears in other sequence models or during training dynamics.
  • The high correlations suggest the residual captures the non-trivial computation layers perform.

Load-bearing premise

The restricted tokenwise function classes meaningfully capture the dominant behavior of layer updates and the alignment metrics reflect genuine separation rather than artifacts.

What would settle it

Finding that the approximation error under the tokenwise model shows low or no correlation with output perturbation in tested models, or that the residual aligns equally well as the full update.

Figures

Figures reproduced from arXiv: 2604.02459 by Jun-Sik Yoo.

Figure 1
Figure 1. Figure 1: Illustration of the decomposition into a local linear approximation and a residual term. The tokenwise prediction captures the dominant update direction, while the residual reflects deviations from this direction that arise from non-local structure. we will show empirically, this component is not simply a small correction, but is often weakly aligned or substantially deviated from the tokenwise update dire… view at source ↗
Figure 2
Figure 2. Figure 2: Residual-output relationship across models and layers. (a) Across architectures, residual error strongly correlates with output deviation. (b) Representative token-level scatter (Pythia-1B). (c) Layer-wise variation of residual–output alignment. (d) residual magnitude varies across layers, revealing structured regimes. Regime RelErr (Lin) ↓ RelErr (MLP) ↓ KL (Lin) ↓ KL (MLP) ↓ ρ (Lin) ↑ ρ (MLP) ↑ Low 0.18 … view at source ↗
Figure 3
Figure 3. Figure 3: Geometric structure of layer updates. The full update is strongly aligned with the tokenwise approximation (left), while the residual exhibits large angular deviation (right), indicating a geometrically distinct component. Full Tok Resid 0.0 0.2 0.4 0.6 0.8 1.0 Projection fraction Full Tok Resid Projection fraction Full Tok Resid Projection fraction [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Projection onto dominant tokenwise subspace. The full and tokenwise updates lie almost entirely within a low￾dimensional subspace, while the residual exhibits significantly lower projection, confirming its geometric separation. Results shown for top-1, top-4, and top-8 singular vectors. In high-error regimes, this alignment degrades for linear maps, while MLPs partially recover it due to increased ex￾press… view at source ↗
Figure 5
Figure 5. Figure 5: shows that increasing k improves stability of the local fit, while increasing rank r reduces approximation error, indicating that the dominant tokenwise structure is low-dimensional but not strictly diagonal. 50 100 150 200 250 k (local neighborhood size) 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 RelErr EleutherAI_pythia-70m: locality sweep (L=5, r=64) 10 20 30 40 50 60 rank r 0.06 0.08 0.10 0.12 0.14 0.16 0… view at source ↗
read the original abstract

We study the geometric structure of layer updates in deep language models. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that layer updates in deep language models decompose into a dominant tokenwise component (captured by restricted function classes) and a geometrically distinct residual. Across Transformers and state-space models, the full update aligns almost perfectly with the tokenwise part, while the residual shows weaker alignment, larger angular deviation, and lower projection onto the tokenwise subspace. Approximation error under the restricted tokenwise model correlates strongly with output perturbation (Spearman often >0.7, up to 0.95 in larger models), suggesting layer updates act as structured reparameterizations with key computation in the residual. The work offers an architecture-agnostic probing framework.

Significance. If the decomposition is robust to a priori fixed function classes, the framework offers a simple, architecture-agnostic tool for separating geometric and functional aspects of layer transformations. This could inform interpretability, analysis of reparameterization vs. novel computation, and techniques like pruning or editing in Transformers and SSMs by highlighting that most updates follow dominant directions while residuals carry functional weight.

major comments (1)
  1. Abstract: The reported Spearman correlations (exceeding 0.7, up to 0.95) between approximation error and output perturbation lack error bars, data-exclusion criteria, or verification that the restricted tokenwise function classes were not tuned or selected post-hoc. This is load-bearing because the geometric separation (full update vs. residual alignment and projection) and functional link depend directly on class definition; without a priori specification, the residual may be defined as the remainder after best fit, rendering the distinction partly mechanical rather than intrinsic.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comment below and will revise the paper to incorporate additional statistical details and clarifications.

read point-by-point responses
  1. Referee: Abstract: The reported Spearman correlations (exceeding 0.7, up to 0.95) between approximation error and output perturbation lack error bars, data-exclusion criteria, or verification that the restricted tokenwise function classes were not tuned or selected post-hoc. This is load-bearing because the geometric separation (full update vs. residual alignment and projection) and functional link depend directly on class definition; without a priori specification, the residual may be defined as the remainder after best fit, rendering the distinction partly mechanical rather than intrinsic.

    Authors: We thank the referee for highlighting this point. The restricted tokenwise function classes are defined a priori in Section 3 (Methods) as the class of functions f(h_i) = W h_i + b for each token i, where the parameters W and b are shared across all tokens in the sequence. This definition is fixed in advance based on the token-independent structure common to many language model layers and is not tuned or selected after observing the data. The same class is applied uniformly across all models, layers, and experiments. No post-hoc selection occurred. We will add bootstrap-derived error bars (95% confidence intervals) to the reported Spearman correlations in the revised abstract, figures, and text. No data points were excluded; the statistics include all layers from all evaluated models. These changes clarify that the geometric distinction arises from the pre-specified class rather than a mechanical remainder after arbitrary fitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements of alignment and correlations stand independently.

full rationale

The paper presents a decomposition of layer updates into a tokenwise component and residual based on applying restricted tokenwise function classes, followed by direct empirical measurements of alignment, angular deviation, projection onto subspaces, and Spearman correlations (0.7–0.95) between approximation error and output perturbation. These quantities are computed from observed data across models and are not shown to reduce by construction to any fitted parameter, self-defined quantity, or self-citation chain. No equations or steps in the provided text equate a prediction to its own input via definitional closure, post-hoc selection that forces the reported separation, or imported uniqueness theorems. The framework is described as architecture-agnostic probing, with results framed as observations rather than derivations that presuppose their own outcomes. This is the common case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a restricted class of tokenwise functions can be defined to isolate the dominant update direction; this is a domain assumption introduced for the analysis rather than derived from prior results.

axioms (1)
  • domain assumption Restricted tokenwise function classes adequately isolate the dominant component of layer updates
    Invoked to define the decomposition and to interpret alignment metrics.

pith-pipeline@v0.9.0 · 5510 in / 1189 out tokens · 56198 ms · 2026-05-13T21:20:04.995916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Spectral phase transitions and trainability in neural network learning dynamics

    cond-mat.dis-nn 2026-06 unverdicted novelty 6.0

    SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    URL https://transformer-circuits. pub/2021/framework/index.html. Accessed: 2026-01-28. Gu, A. and Dao, T. Mamba: Linear-time sequence mod- eling with selective state spaces, 2024. URL https: //arxiv.org/abs/2312.00752. Lipton, Z. C. The mythos of model interpretability, 2017. URLhttps://arxiv.org/abs/1606.03490. nostalgebraist. Interpreting GPT-2 with the...

  2. [2]

    naacl-main.401/

    URL https://aclanthology.org/2021. naacl-main.401/. 8 On the Geometric Structure of Layer Updates in Deep Language Models A. Detailed Experimental Setup This section provides complete experimental details for reproducibility. A.1. Models We evaluate our method across the following pretrained sequence models: •DistilGPT2(distilgpt2) •Pythia-70M(EleutherAI/...