Recognition: unknown
Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
Pith reviewed 2026-05-08 12:08 UTC · model grok-4.3
The pith
Transformers use attention to the true top-of-stack position causally for long-distance hierarchical accuracy, while decodable residual stream signals play little causal role.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In transformers trained on the Dyck language, depth, distance, and top-of-stack signals are all decodable from the residual stream and from stack-like attention patterns. Masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, whereas ablating low-dimensional residual stream subspaces has comparatively little effect. The same pattern of results holds when the models are tested on a templated natural language version of the task.
What carries the argument
Attention masking at the true top-of-stack position versus low-dimensional subspace ablation in the residual stream, used to separate decodable signals from causally used signals of hierarchical structure.
If this is right
- Stack-like attention patterns are required to maintain last-in-first-out ordering for accurate hierarchical processing.
- Low-dimensional geometric signals in the residual stream can be removed without substantially harming task performance.
- Linear probes alone are insufficient to establish how a transformer implements structural computations.
- The dissociation between decodability and causal use extends from formal bracket languages to structured natural language inputs.
Where Pith is reading between the lines
- Similar intervention techniques could be applied to other formal languages or nested-structure tasks to test whether attention stacks dominate over residual geometry.
- In much larger models the residual stream might carry more causal weight, which could be checked by repeating the ablations at scale.
- Architectures that explicitly reinforce stack-like attention might improve handling of nested dependencies without increasing parameter count.
Load-bearing premise
The chosen attention masking and subspace ablation interventions each isolate the causal contribution of the targeted variable without introducing unrelated side-effects on the model's computation.
What would settle it
No measurable drop in long-distance accuracy after masking attention to the true top-of-stack position, or a large performance drop after ablating the residual subspaces while leaving attention intact, would falsify the claimed dissociation.
Figures
read the original abstract
When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains transformers on the Dyck language (balanced bracket sequences) and probes the residual stream for decodable signals of depth, distance, and top-of-stack position. It then performs causal interventions—masking attention specifically to the ground-truth top-of-stack token and ablating low-dimensional subspaces in the residual stream—and reports that the attention intervention sharply reduces long-distance accuracy while subspace ablation has little effect. The dissociation is claimed to extend to a templated natural-language setting, supporting the broader conclusion that decodability does not imply causal use.
Significance. If the interventions cleanly isolate causal contributions, the work supplies concrete evidence that decodable hierarchical representations in transformers need not be the ones the model actually uses, with direct relevance to mechanistic interpretability. The explicit Dyck ground truth and dual intervention strategy (attention vs. residual) are strengths that allow falsifiable comparison between representation and use.
major comments (3)
- [§4.2] §4.2 (attention-masking experiments): the reported sharp drop in long-distance accuracy after masking the true top-of-stack position lacks control conditions such as masking random positions or non-top-of-stack tokens at matched distances. Without these, it is impossible to distinguish a specific loss of stack simulation from a general degradation of long-range positional or dependency tracking.
- [§4.3] §4.3 (subspace ablation): ablation is performed only on the low-dimensional directions identified by linear probes; no results are shown for ablation of random or orthogonal subspaces of matched dimensionality, nor for the effect on the remaining residual dimensions. This leaves open the possibility that the small observed effect reflects incomplete removal of distributed or redundant encodings rather than a true dissociation of causal role.
- [§3] §3 (model training and evaluation): the manuscript provides no details on train/test splits, number of random seeds, or statistical tests for the accuracy differences. Given that the central claim rests on a differential effect between interventions, these omissions make it difficult to assess whether the long-distance accuracy drop is robust or sensitive to hyperparameter choices.
minor comments (2)
- [§5] The abstract and §5 refer to 'templated natural language' without specifying the template generation procedure or how bracket-like hierarchy is preserved while controlling for lexical confounds.
- Figure captions should explicitly state the number of runs and error bars used for all accuracy plots.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. Their suggestions for additional controls and methodological details will help clarify the strength of our claims regarding the dissociation between decodable representations and causal mechanisms in transformer models. We respond to each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§4.2] §4.2 (attention-masking experiments): the reported sharp drop in long-distance accuracy after masking the true top-of-stack position lacks control conditions such as masking random positions or non-top-of-stack tokens at matched distances. Without these, it is impossible to distinguish a specific loss of stack simulation from a general degradation of long-range positional or dependency tracking.
Authors: We agree that control conditions are necessary to isolate the effect of masking the top-of-stack position. In the revised manuscript, we will add results for masking attention to random positions and to non-top-of-stack tokens at equivalent distances. These controls will demonstrate that the accuracy drop is specific to the disruption of stack-like attention rather than a nonspecific effect on long-range information. We note that our intervention is already targeted, as we only mask the ground-truth top-of-stack token identified from the Dyck structure, but the additional baselines will strengthen the causal claim. revision: yes
-
Referee: [§4.3] §4.3 (subspace ablation): ablation is performed only on the low-dimensional directions identified by linear probes; no results are shown for ablation of random or orthogonal subspaces of matched dimensionality, nor for the effect on the remaining residual dimensions. This leaves open the possibility that the small observed effect reflects incomplete removal of distributed or redundant encodings rather than a true dissociation of causal role.
Authors: We concur that including ablations of random and orthogonal subspaces would provide a more complete picture. We will revise §4.3 to include these control ablations, as well as report the performance when ablating the identified subspace versus the effect on the orthogonal complement. This will help rule out that the minimal impact is due to redundancy in the residual stream. Our current results show little effect from probe-based ablation, but these additions will make the dissociation more robust. revision: yes
-
Referee: [§3] §3 (model training and evaluation): the manuscript provides no details on train/test splits, number of random seeds, or statistical tests for the accuracy differences. Given that the central claim rests on a differential effect between interventions, these omissions make it difficult to assess whether the long-distance accuracy drop is robust or sensitive to hyperparameter choices.
Authors: We regret the lack of these methodological details in the original submission. In the revised manuscript, we will expand §3 to specify the train/test split procedure (sequences were generated with a fixed random seed and split 80/20), the number of independent training runs (5 random seeds), and the statistical analysis (we will add p-values from paired t-tests across seeds for the reported accuracy differences). These details will confirm the reliability of the observed dissociation between the two intervention types. revision: yes
Circularity Check
No circularity: empirical interventions on external Dyck ground truth
full rationale
The paper reports results from training transformers on the Dyck language, then applying probes and interventions (attention masking to externally identified top-of-stack positions, subspace ablation in the residual stream) to measure effects on accuracy. No derivation chain, equations, or first-principles results are claimed; the hierarchical ground truth is the explicit Dyck sequence structure, independent of the model. Decodability is assessed via probes, causal use via performance changes under intervention. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The dissociation between decodability and causal use is an observed empirical outcome, not a construction that reduces to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Dyck language provides an unambiguous hierarchical ground truth that can be used to define depth, distance, and top-of-stack variables.
Reference graph
Works this paper leans on
-
[1]
org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes
doi: 10.48550/arXiv.2304.14997. URL https://arxiv. org/abs/2304.14997. NeurIPS 2023 Spotlight. Elazar, Y ., Ravfogel, S., Jacovi, A., and Goldberg, Y . Am- nesic probing: Behavioral explanation with amnesic coun- terfactuals.Transactions of the Association for Computa- tional Linguistics, 9:160–175,
-
[2]
Lost in the Middle: How Language Models Use Long Contexts
doi: 10.1162/tacl a 00359. URL https://arxiv.org/abs/2006.0
work page internal anchor Pith review doi:10.1162/tacl 2006
-
[3]
Geiger, A., Potts, C., and Icard, T
URL https://tr ansformer-circuits.pub/2021/framework /index.html. Geiger, A., Potts, C., and Icard, T. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pp. 9574–9586,
2021
-
[4]
Hayakawa, D
URL https://papers.neurips.cc/paper_ files/paper/2021/file/4f5c422f4d49a5 a807eda27434231040-Paper.pdf. Hayakawa, D. and Sato, I. Theoretical analysis of hierarchi- cal language recognition and generation by transformers without positional encoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long ...
2021
-
[5]
URL https://aclanthology.org/2025.acl-lon g.1488/
doi: 10.18653/v1/2025.acl-long.1488. URL https://aclanthology.org/2025.acl-lon g.1488/. Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. InProceedings of EMNLP-IJCNLP, pp. 2733–2743,
-
[6]
Designing and Interpreting Probes with Control Tasks
doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275/. Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. InProceedings of NAACL- HLT, pp. 4129–4138,
-
[7]
A Structural Probe for Finding Syntax in Word Representations
doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-141 9/. Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pp. 17359–17372,
-
[8]
Locating and Editing Factual Associations in GPT, January 2023
URL https://arxiv.org/ab s/2202.05262. Murty, S., Sharma, P., Andreas, J., and Manning, C. D. Grokking of hierarchical structure in vanilla transformers. InProceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (V olume 2: Short Papers), pp. 439–448,
-
[9]
URL https://aclanthology.org/2 023.acl-short.38/
doi: 10.18653/v1/2023.acl -short.38. URL https://aclanthology.org/2 023.acl-short.38/. M´eloux, M., Dirupo, G., Portet, F., and Peyrard, M. The dead salmons of ai interpretability.arXiv preprint arXiv:2512.18792,
-
[10]
Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Gold- berg, Y
URL https://arxiv.or g/abs/2512.18792. Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Gold- berg, Y . Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of ACL, pp. 7237–7256,
-
[11]
doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.ac l-main.647/. Ravfogel, S., Twiton, M., Goldberg, Y ., and Bachrach, Y . Linear adversarial concept erasure. InProceedings of the 39th International Conference on Machine Learning, pp. 18400–18421,
-
[12]
mlr.press/v162/ravfogel22a.html
URL https://proceedings. mlr.press/v162/ravfogel22a.html. Tiwari, U., Gupta, A., and Hahn, M. Emergent stack representations in modeling counter languages using transformers.arXiv preprint arXiv:2502.01432,
-
[13]
mlr.press/v162/ravfogel22a.html
doi: 10.48550/arXiv.2502.01432. URL https: //arxiv.org/abs/2502.01432. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InInter- national Conference on Learning Representations,
-
[14]
URL https://acla nthology.org/2021.acl-long.292/
doi: 10.18653/v1/2021.acl-long.292. URL https://acla nthology.org/2021.acl-long.292/. 5 Dissociating Decodability and Causal Use in Bracket-Sequence Transformers Table 1.Training hyperparameters (all models). Parameter Value Optimizer AdamW Learning rate3×10 −4 (constant) Adamβ 1, β2 (0.9,0.999) Weight decay0.0 Batch size 128 sequences Gradient clipping 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.