arxiv: 2604.22128 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.LG

Recognition: unknown

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

Aryan Sharma , Cutter Dawes , Shivam Raval

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords transformersDyck languagehierarchical structurecausal interventionsattention patternsresidual streamdecodabilitybracket sequences

0 comments

The pith

Transformers use attention to the true top-of-stack position causally for long-distance hierarchical accuracy, while decodable residual stream signals play little causal role.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether representations of hierarchy found in transformers trained on balanced bracket sequences are actually used during computation or merely detectable by probes. Depth, distance, and top-of-stack signals prove decodable from both attention patterns and the residual stream. Yet preventing attention from reaching the true top-of-stack location produces a sharp drop in accuracy on long-distance dependencies. Removing low-dimensional subspaces from the residual stream, in contrast, leaves performance largely intact. The dissociation shows that decodability does not imply causal use even when the hierarchical ground truth is fully known.

Core claim

In transformers trained on the Dyck language, depth, distance, and top-of-stack signals are all decodable from the residual stream and from stack-like attention patterns. Masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, whereas ablating low-dimensional residual stream subspaces has comparatively little effect. The same pattern of results holds when the models are tested on a templated natural language version of the task.

What carries the argument

Attention masking at the true top-of-stack position versus low-dimensional subspace ablation in the residual stream, used to separate decodable signals from causally used signals of hierarchical structure.

If this is right

Stack-like attention patterns are required to maintain last-in-first-out ordering for accurate hierarchical processing.
Low-dimensional geometric signals in the residual stream can be removed without substantially harming task performance.
Linear probes alone are insufficient to establish how a transformer implements structural computations.
The dissociation between decodability and causal use extends from formal bracket languages to structured natural language inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar intervention techniques could be applied to other formal languages or nested-structure tasks to test whether attention stacks dominate over residual geometry.
In much larger models the residual stream might carry more causal weight, which could be checked by repeating the ablations at scale.
Architectures that explicitly reinforce stack-like attention might improve handling of nested dependencies without increasing parameter count.

Load-bearing premise

The chosen attention masking and subspace ablation interventions each isolate the causal contribution of the targeted variable without introducing unrelated side-effects on the model's computation.

What would settle it

No measurable drop in long-distance accuracy after masking attention to the true top-of-stack position, or a large performance drop after ablating the residual subspaces while leaving attention intact, would falsify the claimed dissociation.

Figures

Figures reproduced from arXiv: 2604.22128 by Aryan Sharma, Cutter Dawes, Shivam Raval.

**Figure 1.** Figure 1: Decodability ̸= causal use. (A) At a closing bracket in a trained Dyck transformer, layer-1 attention routes sharply to the matching opener, while three latent variables (depth, distance, and top-of-stack) are well-defined at this position. (B) All three variables are decodable from the model. (C) Only top-of-stack is causally used: ablating the probe-aligned residual-stream subspaces for depth and distanc… view at source ↗

**Figure 2.** Figure 2: Training dynamics for the d = 64 base model on the first 50k steps. Task accuracy (red), depth probe Pearson IID (dark purple), and layer-1 stack-top attention mass (blue) all stabilize by step 5k. The full 500k-step trajectory is in view at source ↗

**Figure 3.** Figure 3: shows the corresponding picture through attention. Each row of the heatmap is an attention distribution for a single query position, and the orange triangles mark the rows corresponding to closing brackets, i.e., the positions whose next-token prediction depends on retrieving the matching opener. In layer 0, these marked rows are diffuse, meaning attention is spread broadly across source positions. In la… view at source ↗

**Figure 4.** Figure 4: Decodability across splits. Depth and distance (purple) report residual-probe Pearson r; top-of-stack (blue) reports layer-1 attention mass. Top-of-stack stays near-perfect across IID, OOD-length, and OOD-depth; depth and distance degrade but stay positive. Long-distance task accuracy is ≥ 0.994 across all splits. Depth Distance Top-of-stack 0.0 0.2 0.4 0.6 0.8 1.0 |Δ| L D a c c ura c y dro p 0.000 0.000 … view at source ↗

**Figure 5.** Figure 5: Causal intervention |∆| long-distance accuracy: residual ablation of depth (left), residual ablation of rank-2 distance subspace (middle), and layer-1 attention-edge knockout of top-ofstack (right). Top-of-stack routing is causally necessary. In our base d = 64 models, layer 1 attention to the true stack-top position is ≈ 0.98–0.997, compared to a uniform baseline of ≈ 0.013 (Appendix A.10). This routin… view at source ↗

**Figure 7.** Figure 7: Full 500k-step training dynamics on a log x-axis. Task accuracy (red), depth probe Pearson IID (dark purple), and layer-1 stack-top attention mass (blue) all stabilize well before the end of training. Dashed vertical line marks step 50k. former block without standardization. For activation patching experiments, we train probes on clean sequences only and apply them to corrupted forward passes. Our depth p… view at source ↗

**Figure 8.** Figure 8: Full activation patching recovery heatmap across cache points (rows) and offsets from the closing bracket (columns). increases, yet the task performance still remains near ceiling, further emphasizing decodability without causality. Furthermore, the layer 1 attention directs over 98% of its weight to the true stack-top position in all variants, compared to a uniform baseline of 1.3%–1.7% (i.e., what equal… view at source ↗

**Figure 9.** Figure 9: Accuracy drop from ablating the probe-aligned distance subspace (solid) vs. a random subspace of equal rank (dashed), across ranks r ∈ {2, 4, 8, 16, 32, 64}. causes substantial drops, showing the decoded subspace is not causally privileged. As rank grows, both ablations converge to near-total accuracy loss, but the probe-aligned subspace remains marginally less damaging than random even at full rank — inco… view at source ↗

**Figure 11.** Figure 11: shows our probe and task accuracy as a function of relative sequence position. The top-of-stack probe is near-perfect (≈ 100%) throughout, dropping only to 97.1% at sequence end. The depth probe Pearson r hovers around 0.40–0.45 with no positional trend, consistent with weak linear decodability. Next-token accuracy rises from 25% to 77% near sequence end, reflecting increasing predictability as sequences … view at source ↗

**Figure 12.** Figure 12: shows probe accuracy on corrupted inputs by offset from the corruption point. The top-of-stack probe 0 5 10 15 Relative sequence position (bin) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Top-of-stack probe Depth probe Task accuracy view at source ↗

read the original abstract

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper demonstrates that top-of-stack attention is causally necessary for long-distance accuracy in Dyck transformers while residual-stream signals for the same variables are not, under their specific interventions.

read the letter

The main point is a dissociation in controlled Dyck-language transformers: depth, distance, and top-of-stack information are all decodable from the residual stream, yet only masking attention to the ground-truth top-of-stack position produces a sharp drop in long-distance accuracy. Ablating the corresponding low-dimensional residual subspaces has comparatively little effect. They report similar patterns when they move to a templated natural-language version of the task. This is new relative to the bracket-sequence papers they cite, which identified the representations but did not run these causal tests. The explicit Dyck ground truth is a real advantage here because it lets them identify the true top-of-stack externally and design interventions around it rather than relying on model-internal guesses. That keeps the work from being purely correlational. The interventions themselves are straightforward to describe and the behavioral readout is clear. The soft spot is whether the two interventions are comparable in strength. Masking attention to one specific position could disrupt positional tracking or other long-range mechanisms more broadly than intended, while subspace ablation might leave redundant or distributed copies of the same information untouched. The abstract gives no sign of control conditions such as random-position masking or ablation of non-signal directions, so the asymmetry in outcomes could partly reflect how disruptive each method is rather than a clean difference in causal role. Without the full methods, statistics, and robustness checks it is hard to judge how much this matters. This is useful reading for people working on mechanistic interpretability of hierarchical structure and stack-like behavior. It supplies a concrete cautionary case that linear decodability does not automatically mean causal use. I would send it to peer review; the controlled task and intervention approach make the central claim evaluable even if the current controls need tightening.

Referee Report

3 major / 2 minor

Summary. The paper trains transformers on the Dyck language (balanced bracket sequences) and probes the residual stream for decodable signals of depth, distance, and top-of-stack position. It then performs causal interventions—masking attention specifically to the ground-truth top-of-stack token and ablating low-dimensional subspaces in the residual stream—and reports that the attention intervention sharply reduces long-distance accuracy while subspace ablation has little effect. The dissociation is claimed to extend to a templated natural-language setting, supporting the broader conclusion that decodability does not imply causal use.

Significance. If the interventions cleanly isolate causal contributions, the work supplies concrete evidence that decodable hierarchical representations in transformers need not be the ones the model actually uses, with direct relevance to mechanistic interpretability. The explicit Dyck ground truth and dual intervention strategy (attention vs. residual) are strengths that allow falsifiable comparison between representation and use.

major comments (3)

[§4.2] §4.2 (attention-masking experiments): the reported sharp drop in long-distance accuracy after masking the true top-of-stack position lacks control conditions such as masking random positions or non-top-of-stack tokens at matched distances. Without these, it is impossible to distinguish a specific loss of stack simulation from a general degradation of long-range positional or dependency tracking.
[§4.3] §4.3 (subspace ablation): ablation is performed only on the low-dimensional directions identified by linear probes; no results are shown for ablation of random or orthogonal subspaces of matched dimensionality, nor for the effect on the remaining residual dimensions. This leaves open the possibility that the small observed effect reflects incomplete removal of distributed or redundant encodings rather than a true dissociation of causal role.
[§3] §3 (model training and evaluation): the manuscript provides no details on train/test splits, number of random seeds, or statistical tests for the accuracy differences. Given that the central claim rests on a differential effect between interventions, these omissions make it difficult to assess whether the long-distance accuracy drop is robust or sensitive to hyperparameter choices.

minor comments (2)

[§5] The abstract and §5 refer to 'templated natural language' without specifying the template generation procedure or how bracket-like hierarchy is preserved while controlling for lexical confounds.
Figure captions should explicitly state the number of runs and error bars used for all accuracy plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. Their suggestions for additional controls and methodological details will help clarify the strength of our claims regarding the dissociation between decodable representations and causal mechanisms in transformer models. We respond to each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (attention-masking experiments): the reported sharp drop in long-distance accuracy after masking the true top-of-stack position lacks control conditions such as masking random positions or non-top-of-stack tokens at matched distances. Without these, it is impossible to distinguish a specific loss of stack simulation from a general degradation of long-range positional or dependency tracking.

Authors: We agree that control conditions are necessary to isolate the effect of masking the top-of-stack position. In the revised manuscript, we will add results for masking attention to random positions and to non-top-of-stack tokens at equivalent distances. These controls will demonstrate that the accuracy drop is specific to the disruption of stack-like attention rather than a nonspecific effect on long-range information. We note that our intervention is already targeted, as we only mask the ground-truth top-of-stack token identified from the Dyck structure, but the additional baselines will strengthen the causal claim. revision: yes
Referee: [§4.3] §4.3 (subspace ablation): ablation is performed only on the low-dimensional directions identified by linear probes; no results are shown for ablation of random or orthogonal subspaces of matched dimensionality, nor for the effect on the remaining residual dimensions. This leaves open the possibility that the small observed effect reflects incomplete removal of distributed or redundant encodings rather than a true dissociation of causal role.

Authors: We concur that including ablations of random and orthogonal subspaces would provide a more complete picture. We will revise §4.3 to include these control ablations, as well as report the performance when ablating the identified subspace versus the effect on the orthogonal complement. This will help rule out that the minimal impact is due to redundancy in the residual stream. Our current results show little effect from probe-based ablation, but these additions will make the dissociation more robust. revision: yes
Referee: [§3] §3 (model training and evaluation): the manuscript provides no details on train/test splits, number of random seeds, or statistical tests for the accuracy differences. Given that the central claim rests on a differential effect between interventions, these omissions make it difficult to assess whether the long-distance accuracy drop is robust or sensitive to hyperparameter choices.

Authors: We regret the lack of these methodological details in the original submission. In the revised manuscript, we will expand §3 to specify the train/test split procedure (sequences were generated with a fixed random seed and split 80/20), the number of independent training runs (5 random seeds), and the statistical analysis (we will add p-values from paired t-tests across seeds for the reported accuracy differences). These details will confirm the reliability of the observed dissociation between the two intervention types. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interventions on external Dyck ground truth

full rationale

The paper reports results from training transformers on the Dyck language, then applying probes and interventions (attention masking to externally identified top-of-stack positions, subspace ablation in the residual stream) to measure effects on accuracy. No derivation chain, equations, or first-principles results are claimed; the hierarchical ground truth is the explicit Dyck sequence structure, independent of the model. Decodability is assessed via probes, causal use via performance changes under intervention. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The dissociation between decodability and causal use is an observed empirical outcome, not a construction that reduces to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard transformer training assumptions and the external ground truth of the Dyck language; no new entities are postulated and no free parameters are fitted to produce the causal claim.

axioms (1)

domain assumption The Dyck language provides an unambiguous hierarchical ground truth that can be used to define depth, distance, and top-of-stack variables.
Invoked in the abstract when stating that hierarchical ground truth is explicit.

pith-pipeline@v0.9.0 · 5472 in / 1233 out tokens · 40311 ms · 2026-05-08T12:08:19.515606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 1 internal anchor

[1]

org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

doi: 10.48550/arXiv.2304.14997. URL https://arxiv. org/abs/2304.14997. NeurIPS 2023 Spotlight. Elazar, Y ., Ravfogel, S., Jacovi, A., and Goldberg, Y . Am- nesic probing: Behavioral explanation with amnesic coun- terfactuals.Transactions of the Association for Computa- tional Linguistics, 9:160–175,

work page doi:10.48550/arxiv.2304.14997 2023
[2]

Lost in the Middle: How Language Models Use Long Contexts

doi: 10.1162/tacl a 00359. URL https://arxiv.org/abs/2006.0

work page internal anchor Pith review doi:10.1162/tacl 2006
[3]

Geiger, A., Potts, C., and Icard, T

URL https://tr ansformer-circuits.pub/2021/framework /index.html. Geiger, A., Potts, C., and Icard, T. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pp. 9574–9586,

2021
[4]

Hayakawa, D

URL https://papers.neurips.cc/paper_ files/paper/2021/file/4f5c422f4d49a5 a807eda27434231040-Paper.pdf. Hayakawa, D. and Sato, I. Theoretical analysis of hierarchi- cal language recognition and generation by transformers without positional encoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long ...

2021
[5]

URL https://aclanthology.org/2025.acl-lon g.1488/

doi: 10.18653/v1/2025.acl-long.1488. URL https://aclanthology.org/2025.acl-lon g.1488/. Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. InProceedings of EMNLP-IJCNLP, pp. 2733–2743,

work page doi:10.18653/v1/2025.acl-long.1488 2025
[6]

Designing and Interpreting Probes with Control Tasks

doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275/. Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. InProceedings of NAACL- HLT, pp. 4129–4138,

work page doi:10.18653/v1/d19-1275
[7]

A Structural Probe for Finding Syntax in Word Representations

doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-141 9/. Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pp. 17359–17372,

work page doi:10.18653/v1/n19-1419
[8]

Locating and Editing Factual Associations in GPT, January 2023

URL https://arxiv.org/ab s/2202.05262. Murty, S., Sharma, P., Andreas, J., and Manning, C. D. Grokking of hierarchical structure in vanilla transformers. InProceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (V olume 2: Short Papers), pp. 439–448,

work page arXiv
[9]

URL https://aclanthology.org/2 023.acl-short.38/

doi: 10.18653/v1/2023.acl -short.38. URL https://aclanthology.org/2 023.acl-short.38/. M´eloux, M., Dirupo, G., Portet, F., and Peyrard, M. The dead salmons of ai interpretability.arXiv preprint arXiv:2512.18792,

work page doi:10.18653/v1/2023.acl 2023
[10]

Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Gold- berg, Y

URL https://arxiv.or g/abs/2512.18792. Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Gold- berg, Y . Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of ACL, pp. 7237–7256,

work page arXiv
[11]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.ac l-main.647/. Ravfogel, S., Twiton, M., Goldberg, Y ., and Bachrach, Y . Linear adversarial concept erasure. InProceedings of the 39th International Conference on Machine Learning, pp. 18400–18421,

work page doi:10.18653/v1/2020.acl-main.647 2020
[12]

mlr.press/v162/ravfogel22a.html

URL https://proceedings. mlr.press/v162/ravfogel22a.html. Tiwari, U., Gupta, A., and Hahn, M. Emergent stack representations in modeling counter languages using transformers.arXiv preprint arXiv:2502.01432,

work page arXiv
[13]

mlr.press/v162/ravfogel22a.html

doi: 10.48550/arXiv.2502.01432. URL https: //arxiv.org/abs/2502.01432. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InInter- national Conference on Learning Representations,

work page doi:10.48550/arxiv.2502.01432
[14]

URL https://acla nthology.org/2021.acl-long.292/

doi: 10.18653/v1/2021.acl-long.292. URL https://acla nthology.org/2021.acl-long.292/. 5 Dissociating Decodability and Causal Use in Bracket-Sequence Transformers Table 1.Training hyperparameters (all models). Parameter Value Optimizer AdamW Learning rate3×10 −4 (constant) Adamβ 1, β2 (0.9,0.999) Weight decay0.0 Batch size 128 sequences Gradient clipping 1...

work page doi:10.18653/v1/2021.acl-long.292 2021