Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

James Henry

arxiv: 2605.25848 · v1 · pith:EDCUG5WDnew · submitted 2026-05-25 · 💻 cs.LG · cs.AI

Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

James Henry This is my paper

Pith reviewed 2026-06-29 22:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords concept probesresidual streamsdirectional rotationhandoff layerconcept allocation zonegeometric evolution mapstransformer interpretability

0 comments

The pith

Concept probes from transformer residual streams are more reliable when extracted from the stable handoff layer after directional rotation ends in the concept allocation zone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that concept representations rotate substantially in direction while forming inside the primary Concept Allocation Zone and only settle into a usable direction at a later handoff layer. Geometric Evolution Maps follow the entire trajectory of a concept through successive residual-stream activations to locate that handoff point and pull the settled probe direction from it. Standard practice of picking a fixed late layer or the peak of a separation score therefore often selects an unstable direction. Across 23 models and 17 concepts the GEM-derived probes match or beat peak-layer probes in 68.5 percent of cases and strictly beat them in 66.2 percent, with a clearer advantage in multi-head attention architectures.

Core claim

GEMs track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases after the primary Concept Allocation Zone, and extract the settled probe direction from that layer. The entry-to-exit cosine similarity within CAZs averages 0.233, showing that early directions are poor predictors of final ones. Ablation tests on 391 concept-model pairs confirm that handoff-layer probes are at least as precise as peak-layer probes in 268 cases and strictly better in 259.

What carries the argument

Geometric Evolution Maps (GEMs), which follow the directional trajectory of concept activations through residual streams to locate the post-CAZ handoff layer where rotation stabilizes.

If this is right

GEM probes match or exceed peak-layer performance in roughly two-thirds of tested concept-model pairs.
The advantage appears more consistently in MHA architectures than in GQA ones.
An adaptive ablation-width rule improves results in three-quarters of the near-final-layer cases it triggers.
Control ablations show the effect is specific to the learned concept direction rather than generic directional suppression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability pipelines could replace single-layer selection with trajectory-based stabilization checks as a default step.
The same rotation-tracking logic might reveal analogous stabilization points in non-transformer sequence models.
If handoff layers prove consistent across tasks, they could serve as a lightweight diagnostic for whether a model has finished assembling a given concept.

Load-bearing premise

The point where directional rotation stops after the primary Concept Allocation Zone supplies the single best stable direction for probing that concept.

What would settle it

A new collection of models and concepts in which handoff-layer probes show lower average precision than peak-layer probes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.25848 by James Henry.

**Figure 1.** Figure 1: summarizes the key motivating observation. Across 391 concept × model pairs (23 architectures, 17 concept types), we compute the cosine similarity between the dominant concept direction at CAZ entry and at CAZ exit — the entry-exit cosine (EEC). A value of 1.0 would mean no rotation: the concept direction is identical before and after the allocation zone. A value of 0.0 would mean the post-assembly directi… view at source ↗

**Figure 2.** Figure 2: Violin plots of random-direction separation reduction (10 seeds per pair) by architecture cohort, with concept-direction reduction overlaid as individual points. Concept directions (mean 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 2.** Figure 2: Random-direction ablation control 45.6% reduction) are consistently outliers relative to the random distribution (mean 0.24%); median specificity ratio 377×. 5.4 Multi-Node Relay Structure Motivation. The preceding validation treats each concept’s GEM as a single node — the handoff layer identified by the angular velocity threshold. In practice, many concepts have multiple GEM nodes: the CAZ Framework anal… view at source ↗

read the original abstract

Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEM tracks directional rotation in residual streams to pick a post-CAZ handoff layer for probes, and the ablations show it matches or beats peak-layer baselines in most cases especially for MHA models, but the rule itself is not tested against other stability criteria.

read the letter

The core contribution is a geometric procedure that follows the full directional path of a concept through the residual stream, locates the Concept Allocation Zone, and pulls the probe from the first layer after rotation stops. Across 23 models and 17 concepts the entry-to-exit cosine inside those zones averages 0.233, which already shows why fixed late-layer or peak-separation choices can be unstable. The ablation on 391 pairs finds the GEM direction at least as good as the peak-layer baseline 68.5 percent of the time and strictly better 66.2 percent of the time, with a clear split: 78 percent win rate on MHA models versus 47 percent on GQA, backed by a model-level Wilcoxon at p=0.01. They also add a direction-specificity control and an adaptive width rule that fixes 60 of the 79 near-final cases.

The main gap is that the paper never pits the rotation-cessation rule against other plausible ways to pick a stable direction, such as the layer that maximizes held-out probe accuracy or the point of minimal rotation inside a sliding window. The need for the adaptive rule on 79 cases already hints that simple cessation is incomplete. Without those head-to-head checks it is still possible that the reported gains come from picking a different layer rather than from having found the uniquely best stable one.

The work is aimed at people doing concept probing inside mechanistic interpretability. The scale of the experiment, the statistical tests, and the control for direction specificity are solid enough that a serious editor should send it out for review rather than desk-reject, even though the central selection rule would benefit from extra validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces Geometric Evolution Maps (GEMs) to track directional trajectories of concepts through transformer residual streams. It identifies a handoff layer after the primary Concept Allocation Zone (CAZ) where directional rotation ceases and extracts probes from that settled direction. Across 23 architectures (70M–14B params) and 17 concept types, entry-to-exit cosine within CAZs averages 0.233. Ablations on 391 concept-model pairs show GEM probes match or exceed peak-layer (separation-score) probes in 268/391 cases (68.5%) and strictly outperform in 259/391 (66.2%), with architecture differences (MHA: 173/221 favor handoff; GQA: 56/119). Model-level Wilcoxon W=214, N=23, p=0.010 (one-sided). An adaptive ablation rule handles 79 near-final cases (improves 60/79). A direction-specificity control shows median 377× suppression vs. random directions.

Significance. If the central comparisons hold, the work supplies a geometrically motivated alternative to fixed-layer or peak-separation probing, with concrete scale (391 pairs), statistical reporting, and a direction-specificity control that strengthens the ablation interpretation. The reference implementation (rosetta_tools v1.3.1) is a positive contribution for reproducibility.

major comments (2)

[Abstract] Abstract: The claim that the post-CAZ rotation-cessation handoff supplies the optimal stable probe direction rests on the reported superiority over peak-layer baselines, yet no head-to-head comparison is provided against plausible alternative stabilization criteria (e.g., layer of minimal rotation over a sliding window, layer maximizing held-out accuracy, or final-layer direction). This is load-bearing for the interpretation that the identified handoff is the right stable point rather than merely different from the separation-score peak.
[Abstract] Abstract: The basic cessation criterion is supplemented by an “adaptive ablation width rule” that triggers on 79/391 near-final-layer cases and improves 60/79 (75.9%). The manuscript should specify how this rule is defined, whether its parameters were chosen post-hoc on the same data, and whether the 391-pair counts include or exclude these adjusted cases.

minor comments (2)

[Abstract] The abstract states a reference implementation but provides no DOI link, repository, or version-specific usage instructions in the visible text.
[Abstract] Clarify the precise operational definition of the “primary Concept Allocation Zone” and the exact stopping criterion for “rotation ceases” (e.g., cosine threshold, window size).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and add requested comparisons and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the post-CAZ rotation-cessation handoff supplies the optimal stable probe direction rests on the reported superiority over peak-layer baselines, yet no head-to-head comparison is provided against plausible alternative stabilization criteria (e.g., layer of minimal rotation over a sliding window, layer maximizing held-out accuracy, or final-layer direction). This is load-bearing for the interpretation that the identified handoff is the right stable point rather than merely different from the separation-score peak.

Authors: We agree that additional head-to-head comparisons against alternative stabilization criteria would provide stronger support for interpreting the handoff layer as optimal rather than merely distinct. Our primary baseline was the peak separation-score layer because it is the most common practice in the literature; however, we will add direct comparisons to the final-layer direction and a sliding-window minimal-rotation criterion in the revised manuscript. These additions will clarify whether the handoff layer outperforms these alternatives on the 391-pair benchmark. revision: yes
Referee: [Abstract] Abstract: The basic cessation criterion is supplemented by an “adaptive ablation width rule” that triggers on 79/391 near-final-layer cases and improves 60/79 (75.9%). The manuscript should specify how this rule is defined, whether its parameters were chosen post-hoc on the same data, and whether the 391-pair counts include or exclude these adjusted cases.

Authors: We will add a dedicated subsection in the Methods detailing the exact definition of the adaptive ablation width rule, the procedure used to set its parameters, and an explicit statement that the reported 391-pair statistics and improvement counts (60/79) incorporate the adjusted cases. We will also note that parameter selection involved preliminary analysis on the full dataset and discuss this as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons are direct and non-reductive

full rationale

The paper defines GEMs via explicit geometric tracking of directional trajectories in residual streams, identifies the handoff layer by observed cessation of rotation after the primary CAZ, and validates via direct ablation counts (268/391 at-least-as-precise, 259/391 strictly better) against peak separation-score baselines across held-out concept-model pairs. Entry-to-exit cosine (mean 0.233) and Wilcoxon results are raw measurements, not fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the handoff rule; the adaptive ablation rule is an explicit post-hoc adjustment for 79 cases, not a definitional loop. All load-bearing claims reduce to observable data rather than construction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach is built on empirical observation of directional change; the abstract introduces no explicit free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5899 in / 1104 out tokens · 45429 ms · 2026-06-29T22:59:51.148527+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

• Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1), 207–219. • Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single⃗ cector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.20361433 2022
[2]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

arXiv:2310.06824. 25 • Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 4593–4601.arXiv preprint arXiv:1905.05950. https://arxiv.org/abs/1905.05950 • Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan,...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

• Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1), 207–219. • Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single⃗ cector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.20361433 2022

[2] [2]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

arXiv:2310.06824. 25 • Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 4593–4601.arXiv preprint arXiv:1905.05950. https://arxiv.org/abs/1905.05950 • Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan,...

work page internal anchor Pith review Pith/arXiv arXiv 2019