Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams
Pith reviewed 2026-06-29 22:59 UTC · model grok-4.3
The pith
Concept probes from transformer residual streams are more reliable when extracted from the stable handoff layer after directional rotation ends in the concept allocation zone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GEMs track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases after the primary Concept Allocation Zone, and extract the settled probe direction from that layer. The entry-to-exit cosine similarity within CAZs averages 0.233, showing that early directions are poor predictors of final ones. Ablation tests on 391 concept-model pairs confirm that handoff-layer probes are at least as precise as peak-layer probes in 268 cases and strictly better in 259.
What carries the argument
Geometric Evolution Maps (GEMs), which follow the directional trajectory of concept activations through residual streams to locate the post-CAZ handoff layer where rotation stabilizes.
If this is right
- GEM probes match or exceed peak-layer performance in roughly two-thirds of tested concept-model pairs.
- The advantage appears more consistently in MHA architectures than in GQA ones.
- An adaptive ablation-width rule improves results in three-quarters of the near-final-layer cases it triggers.
- Control ablations show the effect is specific to the learned concept direction rather than generic directional suppression.
Where Pith is reading between the lines
- Interpretability pipelines could replace single-layer selection with trajectory-based stabilization checks as a default step.
- The same rotation-tracking logic might reveal analogous stabilization points in non-transformer sequence models.
- If handoff layers prove consistent across tasks, they could serve as a lightweight diagnostic for whether a model has finished assembling a given concept.
Load-bearing premise
The point where directional rotation stops after the primary Concept Allocation Zone supplies the single best stable direction for probing that concept.
What would settle it
A new collection of models and concepts in which handoff-layer probes show lower average precision than peak-layer probes would falsify the central claim.
Figures
read the original abstract
Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Geometric Evolution Maps (GEMs) to track directional trajectories of concepts through transformer residual streams. It identifies a handoff layer after the primary Concept Allocation Zone (CAZ) where directional rotation ceases and extracts probes from that settled direction. Across 23 architectures (70M–14B params) and 17 concept types, entry-to-exit cosine within CAZs averages 0.233. Ablations on 391 concept-model pairs show GEM probes match or exceed peak-layer (separation-score) probes in 268/391 cases (68.5%) and strictly outperform in 259/391 (66.2%), with architecture differences (MHA: 173/221 favor handoff; GQA: 56/119). Model-level Wilcoxon W=214, N=23, p=0.010 (one-sided). An adaptive ablation rule handles 79 near-final cases (improves 60/79). A direction-specificity control shows median 377× suppression vs. random directions.
Significance. If the central comparisons hold, the work supplies a geometrically motivated alternative to fixed-layer or peak-separation probing, with concrete scale (391 pairs), statistical reporting, and a direction-specificity control that strengthens the ablation interpretation. The reference implementation (rosetta_tools v1.3.1) is a positive contribution for reproducibility.
major comments (2)
- [Abstract] Abstract: The claim that the post-CAZ rotation-cessation handoff supplies the optimal stable probe direction rests on the reported superiority over peak-layer baselines, yet no head-to-head comparison is provided against plausible alternative stabilization criteria (e.g., layer of minimal rotation over a sliding window, layer maximizing held-out accuracy, or final-layer direction). This is load-bearing for the interpretation that the identified handoff is the right stable point rather than merely different from the separation-score peak.
- [Abstract] Abstract: The basic cessation criterion is supplemented by an “adaptive ablation width rule” that triggers on 79/391 near-final-layer cases and improves 60/79 (75.9%). The manuscript should specify how this rule is defined, whether its parameters were chosen post-hoc on the same data, and whether the 391-pair counts include or exclude these adjusted cases.
minor comments (2)
- [Abstract] The abstract states a reference implementation but provides no DOI link, repository, or version-specific usage instructions in the visible text.
- [Abstract] Clarify the precise operational definition of the “primary Concept Allocation Zone” and the exact stopping criterion for “rotation ceases” (e.g., cosine threshold, window size).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and add requested comparisons and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the post-CAZ rotation-cessation handoff supplies the optimal stable probe direction rests on the reported superiority over peak-layer baselines, yet no head-to-head comparison is provided against plausible alternative stabilization criteria (e.g., layer of minimal rotation over a sliding window, layer maximizing held-out accuracy, or final-layer direction). This is load-bearing for the interpretation that the identified handoff is the right stable point rather than merely different from the separation-score peak.
Authors: We agree that additional head-to-head comparisons against alternative stabilization criteria would provide stronger support for interpreting the handoff layer as optimal rather than merely distinct. Our primary baseline was the peak separation-score layer because it is the most common practice in the literature; however, we will add direct comparisons to the final-layer direction and a sliding-window minimal-rotation criterion in the revised manuscript. These additions will clarify whether the handoff layer outperforms these alternatives on the 391-pair benchmark. revision: yes
-
Referee: [Abstract] Abstract: The basic cessation criterion is supplemented by an “adaptive ablation width rule” that triggers on 79/391 near-final-layer cases and improves 60/79 (75.9%). The manuscript should specify how this rule is defined, whether its parameters were chosen post-hoc on the same data, and whether the 391-pair counts include or exclude these adjusted cases.
Authors: We will add a dedicated subsection in the Methods detailing the exact definition of the adaptive ablation width rule, the procedure used to set its parameters, and an explicit statement that the reported 391-pair statistics and improvement counts (60/79) incorporate the adjusted cases. We will also note that parameter selection involved preliminary analysis on the full dataset and discuss this as a limitation. revision: yes
Circularity Check
No circularity: empirical comparisons are direct and non-reductive
full rationale
The paper defines GEMs via explicit geometric tracking of directional trajectories in residual streams, identifies the handoff layer by observed cessation of rotation after the primary CAZ, and validates via direct ablation counts (268/391 at-least-as-precise, 259/391 strictly better) against peak separation-score baselines across held-out concept-model pairs. Entry-to-exit cosine (mean 0.233) and Wilcoxon results are raw measurements, not fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the handoff rule; the adaptive ablation rule is an explicit post-hoc adjustment for 79 cases, not a definitional loop. All load-bearing claims reduce to observable data rather than construction from inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
• Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1), 207–219. • Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single⃗ cector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.20361433 2022
-
[2]
arXiv:2310.06824. 25 • Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 4593–4601.arXiv preprint arXiv:1905.05950. https://arxiv.org/abs/1905.05950 • Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan,...
work page internal anchor Pith review Pith/arXiv arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.