When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

Aydin Javadov

arxiv: 2606.13168 · v1 · pith:SEFREH5Znew · submitted 2026-06-11 · 💻 cs.LG

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

Aydin Javadov This is my paper

Pith reviewed 2026-06-27 07:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords mechanistic interpretabilityroutingblock attention residualscausal probesdepth routingtransformer mechanismsattention residualsmodel interventions

0 comments

The pith

Architectural exposure of routing is necessary but not sufficient for mechanistic interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether replacing fixed residuals with learned softmax over earlier layers in Block AttnRes makes cross-layer routing directly observable and therefore interpretable. It runs identical routing-ablation interventions on a baseline that forces a deterministic recency-bias schedule via inference wrapping and on a same-scale model trained from scratch with routing inside the optimization. Only the trained checkpoint develops three localized routing motifs across embedding, current-state, and older-history pathways. Even in that case, average routing mass fails to predict causal importance under intervention, with some high-mass sources showing no detectable effect. This matters because it means descriptive routing tensors cannot be read as mechanistic evidence without further causal checks.

Core claim

Block Attention Residuals replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing routing as an inspectable tensor. Under matched interventions the wrapped baseline reproduces its analytic content-independent schedule, while the trained checkpoint exhibits stratified routing motifs. In both sublayers the slice carrying the largest routing mass is not the slice with the largest causal contribution, and one source family carries appreciable mass yet produces no detectable causal role.

What carries the argument

Block Attention Residuals, which turn fixed residuals into a learned softmax over prior depth sources so that routing becomes an explicit tensor in the forward pass.

If this is right

Structured depth routing motifs appear only when routing participates in training rather than being imposed at inference.
Routing mass and causal importance dissociate, so the largest mass slice need not be the largest causal contributor.
Some source families carry routing mass yet show no measurable causal role when ablated.
Descriptive summaries of exposed routing tensors remain candidate hypotheses until confirmed by causal intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability pipelines that expose internal flows should treat observational summaries as starting points for causal experiments rather than endpoints.
The same mass-versus-causality gap may appear in other exposed transformer components once they are made directly readable.
Training curricula could be modified to encourage alignment between routing mass and causal impact.
The same probe design could be applied to other model families to test whether training is required for any interpretable routing structure to emerge.

Load-bearing premise

The deterministic recency-bias schedule wrapped around the vanilla Qwen3 serves as a valid content-independent routing baseline that can be directly contrasted with the trained checkpoint under the same interventions.

What would settle it

Finding that the trained Block AttnRes checkpoint exhibits no localized motifs or that routing mass perfectly tracks causal importance under the same ablation interventions used on the baseline.

Figures

Figures reproduced from arXiv: 2606.13168 by Aydin Javadov.

**Figure 1.** Figure 1: Source-family routing ablations: baseline (recency-biaswrapped vanilla Qwen3) vs. Block AttnRes trained from scratch, on the same probe and the same data. Three families produce strong AttnRes-side effects with negligible baseline counterparts; a fourth carries routing mass but no causal weight. the recency-bias schedule, and an AttnRes model that is a Block AttnRes Qwen3 trained from scratch with the rou… view at source ↗

**Figure 2.** Figure 2: ). The baseline has only one non-trivial row; its split between attention and MLP for current reflects which sublayer the recency-bias-dominated routing is most fragile to perturbation in, not learned structure. The AttnRes checkpoint differentiates each row: embedding matters almost entirely through MLP (drop 6.38 vs. 0.13); current reverses the baseline order, with attention (2.44) exceeding MLP (1.25)… view at source ↗

**Figure 3.** Figure 3: Layer-binned sublayer ablations on the AttnRes checkpoint as a heatmap. A small number of localised cells carry the primary causal signal. AttnRes does not, by construction, render cross-layer information flow legible; it provides a substrate on which a model that has been trained to use depth diversity can express that usage in a directly intervenable form. Three localised motifs. Of the 24 (family, sub… view at source ↗

**Figure 4.** Figure 4: Routing-mass composition by sublayer. The baseline allocates ≈ 0.84 to current in both sublayers; the AttnRes attention sublayer is broadly distributed with the largest slice on nonlocal, while the AttnRes MLP sublayer is current-heavy with non-trivial embedding and prev completed slices. E. Full Layer-Binned Ablation Tables 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows routing visibility alone doesn't yield mechanism without causal checks, and structured patterns appear only after training, but the baseline has a training-history confound that weakens the contrast.

read the letter

The central observation is worth attention: exposing routing as an inspectable tensor does not automatically deliver interpretable mechanisms. In the trained Block AttnRes model, interventions reveal three localized motifs—an embedding pathway, a current-state pathway, and an older-history pathway—while the wrapped baseline does not. They also report a clean dissociation where the slice with highest routing mass is not the one with largest causal effect under ablation, and one source family carries mass but shows no detectable causal role.

What the work does cleanly is apply identical routing ablations to both checkpoints and document the qualitative split. The claim that descriptive summaries need causal follow-up follows directly from the mass-versus-effect mismatch, and the training-dependent emergence of stratification is a concrete empirical result relative to the cited prior work.

The soft spot sits in the baseline construction. The comparison pits a pre-trained vanilla Qwen3 with an inference wrapper against a Block AttnRes model trained from scratch. These differ in initialization, full training history, and base representations, not only in whether routing parameters were optimized. The abstract states the wrapper reproduces the analytic schedule and stays content-independent, yet without a control that holds the underlying model fixed while toggling only the routing optimization, the motifs and dissociation cannot be attributed solely to training the routing tensor. That is a fixable design issue rather than a fatal one, but it sits at the center of the argument.

This is aimed at interpretability researchers who already work on routing or cross-layer information flow. The empirical contrast is sharp enough to merit referee time once the baseline controls are tightened; the interventions themselves look reproducible from the description. I would send it to review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Block Attention Residuals (Block AttnRes) to replace fixed additive residuals with a learned softmax over earlier depth-source representations, exposing cross-layer routing as an inspectable tensor. It performs controlled routing-ablation interventions on two same-scale (0.6B) models: an inference-wrapped vanilla Qwen3 using a deterministic recency-bias schedule (content-independent baseline) and a Block AttnRes Qwen3 trained from scratch with routing optimized. The trained checkpoint exhibits three localized routing motifs (embedding-source via early MLP, current-state via early attention/MLP, older-history via late attention) and a dissociation between average routing mass and causal importance under intervention, while the baseline does not. The conclusion is that architectural exposure of routing is necessary but not sufficient for mechanistic interpretation; structured routing emerges only when optimized during training, and descriptive summaries require causal testing.

Significance. If the result holds after addressing the baseline confound, the work is significant for mechanistic interpretability research. It supplies concrete evidence that exposing routing tensors does not automatically yield interpretable mechanisms, that training is required for structured motifs to appear, and that mass-based summaries must be validated causally rather than taken as evidence. The emphasis on intervention-based testing over description is a useful methodological contribution that could temper over-interpretation of routing visualizations in transformer variants.

major comments (1)

[Abstract] Abstract and introduction: The central claim that 'structured depth routing emerges only when routing has been part of training' rests on the contrast between the inference-wrapped pre-trained vanilla Qwen3 and the from-scratch trained Block AttnRes checkpoint. These differ simultaneously in (a) whether routing parameters were optimized and (b) the full pre-training history and initialization of the base representations. No control is described that holds the base model fixed while varying only the routing optimization, so the observed stratification and mass-causal dissociation cannot be attributed solely to the presence of trained routing. A same-base-model ablation (e.g., fine-tuning the vanilla checkpoint with routing parameters) would be required to isolate the factor.

minor comments (2)

The abstract and methods description omit details on statistical tests used to establish qualitative differences between the two models, exact implementation of the routing-ablation interventions, and any exclusion criteria for source families.
Notation for the learned softmax in Block AttnRes and the precise form of the deterministic recency-bias schedule should be stated explicitly (e.g., as an equation) rather than described only in prose.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The identified confound between routing optimization and pre-training history is a valid concern that limits causal attribution. We will revise the abstract, introduction, and discussion to qualify the central claim, explicitly note the limitation, and frame the results as evidence that structured motifs appear under joint optimization of routing parameters rather than under a fixed schedule alone.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: The central claim that 'structured depth routing emerges only when routing has been part of training' rests on the contrast between the inference-wrapped pre-trained vanilla Qwen3 and the from-scratch trained Block AttnRes checkpoint. These differ simultaneously in (a) whether routing parameters were optimized and (b) the full pre-training history and initialization of the base representations. No control is described that holds the base model fixed while varying only the routing optimization, so the observed stratification and mass-causal dissociation cannot be attributed solely to the presence of trained routing. A same-base-model ablation (e.g., fine-tuning the vanilla checkpoint with routing parameters) would be required to isolate the factor.

Authors: We agree that the current design confounds the presence of optimized routing parameters with differences in base-model pre-training and initialization. The baseline was selected because it supplies a deterministic, content-independent routing schedule that the Qwen3 codebase already supports as a loading path, thereby providing a non-learned reference on the original pre-trained representations. Nevertheless, this does not isolate the effect of routing optimization from the training history of the representations themselves. A same-base-model control (fine-tuning the vanilla checkpoint after inserting routing parameters) would be the cleanest isolation. We will therefore revise the abstract and introduction to replace the stronger phrasing with a qualified statement: structured routing motifs and the mass-causal dissociation appear when routing is optimized end-to-end, in contrast to a fixed schedule on pre-trained representations. We will also add an explicit limitations paragraph acknowledging that training history may contribute to the observed differences. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical interventions

full rationale

The paper presents no derivation chain involving equations, fitted parameters renamed as predictions, or self-citations that bear the central load. Its claims derive from direct comparisons of routing-ablation interventions on two checkpoints (wrapped vanilla Qwen3 vs. trained Block AttnRes), with routing tensors observed and causally tested in the forward pass. These contrasts are externally falsifiable via the reported interventions and do not reduce to self-definitional quantities or ansatzes imported from prior author work. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the baseline equivalence and the assumption that the reported interventions isolate routing effects without confounding factors from training differences.

axioms (1)

domain assumption The wrapped baseline's routing weights are content-independent and reproduce the analytic schedule prediction.
This is invoked to establish that structured routing requires training rather than being an artifact of the architecture alone.

pith-pipeline@v0.9.1-grok · 5803 in / 1187 out tokens · 36393 ms · 2026-06-27T07:07:43.519896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 3 canonical work pages · 1 internal anchor

[1]

On Layer Normalization in the Transformer Architecture , journal =

Ruibin Xiong and Yunchang Yang and Di He and Kai Zheng and Shuxin Zheng and Chen Xing and Huishuai Zhang and Yanyan Lan and Liwei Wang and Tie. On Layer Normalization in the Transformer Architecture , journal =. 2020 , url =. 2002.04745 , timestamp =

work page arXiv 2020
[2]

arXiv preprint arXiv:2407.02646 , year =

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models , author =. arXiv preprint arXiv:2407.02646 , year =

work page arXiv
[3]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

On Layer Normalization in the Transformer Architecture , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =
[4]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in
[5]

2026 , url=

Attention Residuals , author=. 2026 , url=

2026
[6]

Deep Residual Learning for Image Recognition

Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. CoRR , volume =. 2015 , url =. 1512.03385 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

ArXiv , year=

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm , author=. ArXiv , year=
[8]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[9]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[10]

M. J. Kearns , title =
[11]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[12]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[13]

Suppressed for Anonymity , author=
[14]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[15]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[1] [1]

On Layer Normalization in the Transformer Architecture , journal =

Ruibin Xiong and Yunchang Yang and Di He and Kai Zheng and Shuxin Zheng and Chen Xing and Huishuai Zhang and Yanyan Lan and Liwei Wang and Tie. On Layer Normalization in the Transformer Architecture , journal =. 2020 , url =. 2002.04745 , timestamp =

work page arXiv 2020

[2] [2]

arXiv preprint arXiv:2407.02646 , year =

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models , author =. arXiv preprint arXiv:2407.02646 , year =

work page arXiv

[3] [3]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

On Layer Normalization in the Transformer Architecture , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

[4] [4]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in

[5] [5]

2026 , url=

Attention Residuals , author=. 2026 , url=

2026

[6] [6]

Deep Residual Learning for Image Recognition

Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. CoRR , volume =. 2015 , url =. 1512.03385 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

ArXiv , year=

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm , author=. ArXiv , year=

[8] [8]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[9] [9]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[10] [10]

M. J. Kearns , title =

[11] [11]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[12] [12]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[13] [13]

Suppressed for Anonymity , author=

[14] [14]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[15] [15]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959