arxiv: 2604.13950 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

Kyle Mahowald, Sasha Boguraev

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords syntactic islandscoordinationtransformer language modelscausal interventionsfiller-gap dependenciesmechanistic interpretabilitygradient acceptabilityEnglish syntax

0 comments

The pith

Transformer language models replicate human judgments on gradient acceptability of extraction from coordination islands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how Transformer LMs handle extraction from coordinated verb phrases, where acceptability varies gradiently with lexical content. It demonstrates that these models match human judgments on this gradient. Causal interventions isolating subspaces in blocks, attention modules, and MLPs show that extraction engages the same filler-gap mechanisms as canonical wh-dependencies but with selective blocking to varying degrees. Projecting a large unrelated corpus onto these subspaces yields the hypothesis that the conjunction 'and' is represented differently in extractable versus non-extractable constructions, linking to relational versus purely conjunctive uses.

Core claim

Transformer language models replicate human judgments across the gradient of acceptability for extraction from coordination islands. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, the work derives a novel linguistic hypothesis: the conjunction 'and' is represented differently in extractable versus non-extractable constructions, corresponding to expressions that

What carries the argument

Causal interventions isolating functionally relevant subspaces in Transformer blocks, attention modules, and MLPs for characterizing filler-gap mechanisms and selective blocking in syntactic islands.

If this is right

Extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies.
These mechanisms are selectively blocked to varying degrees based on the specific construction.
The conjunction 'and' receives different representations depending on whether the construction allows extraction.
Mechanistic interpretability of model internals can generate testable hypotheses about linguistic representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same intervention approach could be applied to other syntactic islands to check for analogous selective blocking patterns.
The differing internal representations of 'and' may have consequences for how models handle other coordination or logical structures.
Subspace identification could guide targeted training or fine-tuning to better align model syntax with human gradient judgments.

Load-bearing premise

The causal interventions accurately isolate syntactic filler-gap mechanisms without confounding from other computations or the intervention method itself.

What would settle it

Finding that interventions on the identified subspaces do not selectively disrupt island extractions while leaving non-island wh-dependencies intact, or that the projected 'and' representations do not correlate with extractability in new data, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13950 by Kyle Mahowald, Sasha Boguraev.

**Figure 1.** Figure 1: 1. LM judgments of gradiently acceptable conjuncts correlate with human judgments. 2. These constructions rely on a more general-purpose filler-gap mechanism, blocked in the island case. 3. We identify the relevant blocking subspaces, showing they correlate with model acceptability, and 4. pick out linguistically meaningful structures in a corpus. First, linguistic theory suggests that filler-gap dependen… view at source ↗

**Figure 2.** Figure 2: Exemplar whlicensing calculations for extractable (top) and unextractable (bottom) conjuncts respectively. To measure how robustly an LM represents extraction for given stimuli, we follow the methodology of Wilcox et al. (2018). Specifically, we get minimal sentence pairs: wh, which contains a wh-licensor, and th, which does not. They have corresponding labels lwh (gap) and l th (no gap). We then compu… view at source ↗

**Figure 3.** Figure 3: LM behavioral metrics. Methods We sample 400 minimal pairs of each conjunct and calculate the mean wh-interaction across pairs as described in §3.2. We then measure the correlation (Pearson r-value) between human judgments and an LM’s mean wh-interaction. We test 25 LMs across 5 model families: OLMo2 (OLMo et al., 2025), Qwen2 (Yang et al., 2024), Gemma 2 (Team et al., 2024), gpt2 (Radford et al., 2019) … view at source ↗

**Figure 4.** Figure 4: We use DAS to find the mechanisms to process embedded [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: We use DAS to find the causal drawbridges responsible for stranding and un [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Correlation between LM mean licensing interaction and human acceptability [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Absolute correlation between each conjunct’s average position along the learned [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links causal subspace interventions in Transformers to gradient coordination island effects and derives a hypothesis about distinct 'and' representations, but the interventions likely mix syntactic filler-gap signals with lexical-semantic confounds.

read the letter

The main takeaway is that causal interventions on blocks, attention, and MLPs can identify subspaces tied to filler-gap dependencies, and selective blocking in those subspaces tracks the gradient acceptability of extraction from coordination islands in the models, which then leads to a corpus-derived claim that 'and' is represented differently in extractable versus non-extractable cases. They replicate the human-like gradient judgments and show the island cases engage the same mechanisms as canonical wh-dependencies but get blocked variably. The corpus projection step to generate a specific linguistic hypothesis about relational versus conjunctive uses of 'and' is a concrete move from model internals to a testable claim about syntax. That part is new and worth noting. The soft spot is the risk that the subspaces are not cleanly syntactic. Coordination constructions bundle syntax with verb semantics and the specific meaning of 'and', so intervening on island versus non-island sentences can pick up co-occurrence or relational patterns instead of pure filler-gap blocking. The abstract gives no details on baselines, statistical controls, or how the subspaces were validated against other computations, which makes it hard to rule out that the gradient and the 'and' hypothesis are driven by lexical factors rather than the drawbridge mechanism. This is for people working on mechanistic interpretability applied to syntax. It has enough of a clear empirical direction and a novel hypothesis to deserve peer review, though the methods will need careful checking to confirm the interventions isolate what they intend.

Referee Report

2 major / 1 minor

Summary. The paper claims that Transformer LMs replicate human gradient acceptability judgments on wh-extraction from coordination islands (e.g., varying degradation with lexical content in constructions like 'I know what he hates art and loves' vs. 'I know what he looked down and saw'). Using causal interventions to isolate functionally relevant subspaces within Transformer blocks, attention modules, and MLPs, it argues that these constructions engage the same filler-gap mechanisms as canonical wh-dependencies but with selective blocking. Projecting a large unrelated corpus onto the identified subspaces yields a novel hypothesis that the conjunction 'and' receives distinct representations in extractable (relational-dependency) versus non-extractable (purely conjunctive) contexts.

Significance. If the interventions are shown to isolate syntactic filler-gap mechanisms without lexical/semantic confounds, the work would be significant for linking mechanistic interpretability to syntactic theory: it provides interventional evidence for gradient island effects in LMs and generates a falsifiable linguistic hypothesis about conjunction representation. The strengths include the focus on a gradient (rather than binary) phenomenon, the use of causal interventions across multiple model components, and the corpus-projection step to derive new hypotheses from model internals rather than purely correlational analyses.

major comments (2)

[causal interventions description] The section describing the causal interventions (blocks, attention modules, and MLPs): insufficient detail is provided on the precise intervention technique (e.g., activation patching, subspace orthogonalization, or ablation), the criteria for identifying 'functionally relevant' subspaces, and any controls or baselines used to rule out confounds from lexical verb choice or the semantics of coordination. This is load-bearing for the central claim that extraction engages the same filler-gap mechanisms but is selectively blocked, as the subspaces may instead capture co-occurrence or semantic relational patterns.
[corpus projection and hypothesis derivation] The corpus projection step and resulting hypothesis about 'and' representations: because the subspaces are derived from island vs. non-island contrast sentences, the projection of unrelated text inherits any confounding from the intervention; without an explicit test (e.g., comparing projections against purely semantic or lexical controls), the claim that the subspaces encode syntactic drawbridge effects versus purely conjunctive uses cannot be distinguished from alternative explanations.

minor comments (1)

[Abstract] The abstract references specific gradient examples but does not indicate how the full stimulus set was constructed or how acceptability gradients were quantified in the model (e.g., via surprisal or probability metrics).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for clarification and strengthening of our methodological claims. We address each major comment point by point below, providing additional context from our analyses and describing the revisions made to the manuscript.

read point-by-point responses

Referee: [causal interventions description] The section describing the causal interventions (blocks, attention modules, and MLPs): insufficient detail is provided on the precise intervention technique (e.g., activation patching, subspace orthogonalization, or ablation), the criteria for identifying 'functionally relevant' subspaces, and any controls or baselines used to rule out confounds from lexical verb choice or the semantics of coordination. This is load-bearing for the central claim that extraction engages the same filler-gap mechanisms but is selectively blocked, as the subspaces may instead capture co-occurrence or semantic relational patterns.

Authors: We agree that the original description of the causal interventions lacked sufficient technical detail to allow full evaluation of potential confounds. In the revised manuscript, we have expanded the Methods section with a new subsection that specifies the intervention as activation patching on low-rank subspaces identified via contrastive activation differences (island vs. non-island extractions). Subspace identification criteria are now explicitly stated: subspaces are retained only if linear probes trained on them achieve >75% accuracy on held-out contrast sets while showing <55% accuracy on matched lexical-control sets. We have added baseline results from interventions on subspaces derived solely from verb-lexical contrasts (no extraction or island structure), which produce no measurable blocking effects on filler-gap accuracy. These controls indicate that the reported subspaces capture syntactic blocking rather than co-occurrence or general semantic patterns. revision: yes
Referee: [corpus projection and hypothesis derivation] The corpus projection step and resulting hypothesis about 'and' representations: because the subspaces are derived from island vs. non-island contrast sentences, the projection of unrelated text inherits any confounding from the intervention; without an explicit test (e.g., comparing projections against purely semantic or lexical controls), the claim that the subspaces encode syntactic drawbridge effects versus purely conjunctive uses cannot be distinguished from alternative explanations.

Authors: We recognize that the corpus projection step could inherit confounds if not properly controlled. In the revised manuscript, we have added a control analysis in which we derive parallel subspaces from purely semantic contrast sentences (relational-dependency vs. conjunctive uses of 'and' without any wh-extraction or island structure) and project the same unrelated corpus onto both the original island-derived subspaces and these semantic-control subspaces. The results show that only the island-derived subspaces produce the reported separation in 'and' representations between extractable and non-extractable contexts; the semantic-control subspaces yield no such distinction. We have updated the Results and Discussion sections to report this comparison and to qualify the hypothesis accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental interventions and projections

full rationale

The paper's core chain proceeds from model behavior on island sentences, through causal interventions isolating subspaces in blocks/attention/MLPs, to replication of human gradient judgments and corpus projection yielding a hypothesis about 'and' representations. No equations, fitted parameters renamed as predictions, or self-definitional loops are present in the provided abstract or description. Subspace identification is performed via interventions on held-out data rather than by construction from the target syntactic distinctions. Self-citations are not invoked as load-bearing uniqueness theorems. The derivation remains independent of its inputs and does not reduce to renaming or ansatz smuggling. This matches the default expectation for non-circular experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work assumes standard Transformer architecture components and the validity of causal intervention methods for identifying functional subspaces; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Causal interventions on attention and MLP subspaces can isolate mechanisms responsible for filler-gap dependencies.
Invoked when claiming that the same mechanisms are engaged but selectively blocked.
domain assumption Human acceptability judgments on coordination islands form a reliable gradient that models should replicate.
Basis for the claim that models replicate human judgments.

pith-pipeline@v0.9.0 · 5476 in / 1459 out tokens · 58619 ms · 2026-05-10T13:27:05.728174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 16 canonical work pages · 5 internal anchors

[1]

doi: 10.18653/v1/2020.conll-1.39

Association for Computational Linguistics. doi: 10.18653/v1/2020.conll-1.39. URL https://aclanthology.org/2020.conll-1.39/. 10 Preprint. Under review. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, a...

work page doi:10.18653/v1/2020.conll-1.39 2020
[2]

A., Purohit, S., Prashanth, U

URL https://arxiv. org/abs/2304.01373. Cedric Boeckx.Syntactic Islands. Cambridge University Press,

work page arXiv
[3]

Causal interventions reveal shared structure across English filler–gap constructions

Sasha Boguraev, Christopher Potts, and Kyle Mahowald. Causal interventions reveal shared structure across English filler–gap constructions. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25032–25053, Suzhou, China, November

2025
[4]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1271. URL https://aclanthology.org/2025.emnlp-main. 1271/. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer...

work page doi:10.18653/v1/2025.emnlp-main.1271 2025
[5]

Noam Chomsky

https: //transformer-circuits.pub/2023/monosemantic-features. Noam Chomsky. On wh-movement. In Peter Culicover, Thomas Wasow, and Adrian Akmajian, editors,Formal Syntax, pages 71–132. Academic Press, New York,

2023
[6]

Conditions on transformations

Noam Chomsky, Stephen Anderson, and Paul Kiparsky. Conditions on transformations. 1973, 232:286,

1973
[7]

doi: 10.18653/v1/2020.conll-1.17

Associ- ation for Computational Linguistics. doi: 10.18653/v1/2020.conll-1.17. URL https: //aclanthology.org/2020.conll-1.17/. Nicole Cuneo and Adele E Goldberg. The discourse functions of grammatical constructions explain an enduring syntactic puzzle.Cognition, 240:105563,

work page doi:10.18653/v1/2020.conll-1.17 2020
[8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.ArXiv, abs/2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Abigail Fergus, Arielle Belluck, Nicole Cuneo, and Adele Goldberg

https://transformer- circuits.pub/2021/framework/index.html. Abigail Fergus, Arielle Belluck, Nicole Cuneo, and Adele Goldberg. Islands result from clash of functions: Single-conjunct wh-qs. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47,

2021
[10]

11 Preprint

doi: 10.1017/ S0140525X2510112X. 11 Preprint. Under review. Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. Neural language models as psycholinguistic subjects: Representations of syntactic state. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North Ameri...

2019
[11]

LongEval: Guidelines for human evaluation of faithfulness in long-form summariza- tion

Association for Computational Linguistics. doi: 10.18653/v1/ N19-1004. URLhttps://aclanthology.org/N19-1004. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural repre- sentations. InCausal Learning and Reasoning, pages 160–187. PMLR,

work page doi:10.18653/v1/
[12]

doi: 10.18653/v1/2024.conll-1.21

Association for Computational Linguistics. doi: 10.18653/v1/2024.conll-1.21. URL https://aclanthology.org/2024. conll-1.21/. Andrew Kehler.Coherence, reference, and the theory of grammar, volume

work page doi:10.18653/v1/2024.conll-1.21 2024
[13]

Neural networks can learn patterns of island-insensitivity in norwegian

Anastasia Kobzeva, Suhas Arehalli, Tal Linzen, and Dave Kush. Neural networks can learn patterns of island-insensitivity in norwegian. InProceedings of the Society for Computation in Linguistics 2023, pages 175–185,

2023
[14]

URL https://aclanthology.org/2024.acl-long.713

Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.713. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of LSTMs to learn syntax-sensitive dependencies.Transactions of the Association for Computational Linguistics, 4:521–535,

2024
[15]

arXiv preprint arXiv:2203.13112 , year=

URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference.pdf. Kanishka Misra. minicons: Enabling flexible behavioral and representational analyses of transformer language models.arXiv preprint arXiv:2203.13112,

work page arXiv 2022
[16]

2 OLMo 2 Furious

URLhttps://arxiv.org/abs/2501.00656. Lisa Pearl. Poverty of the stimulus without tears.Language Learning and Development, 18(4): 415–454,

work page internal anchor Pith review arXiv
[17]

GloVe: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October

2014
[18]

Glove: Global vectors for word representation

Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URLhttps://aclanthology.org/D14-1162/. Colin Phillips. On the nature of island constraints ii: Language learning and innateness. Experimental syntax and island effects, pages 132–157,

work page doi:10.3115/v1/d14-1162
[19]

doi: 10.18653/v1/K19-1007

Association for Computational Linguistics. doi: 10.18653/v1/K19-1007. URL https://aclanthology. org/K19-1007/. Project Gutenberg. Project gutenberg. https://www.gutenberg.org, n.d. Retrieved Decem- ber 3,

work page doi:10.18653/v1/k19-1007
[20]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://arxiv.org/abs/2408.00118. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. What do RNN language models learn about filler–gap dependencies? In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium, November

2018
[22]

doi: 10.18653/v1/W18-5423

Association for Computational Linguistics. doi: 10.18653/v1/W18-5423. URL https://aclanthology.org/W18-5423/. Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. Using computational models to test syntactic learnability.Linguistic Inquiry, pages 1–44,

work page doi:10.18653/v1/w18-5423
[23]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

URL https://arxiv. org/abs/1910.03771. Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah Goodman, Christopher Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In Kai-Wei Chang, Annie Lee, and Nazneen Rajani, editors,Proceedings of the 2024 Conference of the North Am...

work page internal anchor Pith review arXiv 1910
[24]

doi: 10.18653/v1/2024.naacl-demo.16

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-demo.16. URL https:// aclanthology.org/2024.naacl-demo.16/. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Ji...

work page doi:10.18653/v1/2024.naacl-demo.16 2024
[25]

Qwen2 Technical Report

URLhttps://arxiv.org/abs/2407.10671. A Stimuli Conversion The stimuli we source from Fergus et al. (2025) are in the form of matrix wh-questions. This makes minimal pair templates difficult to form. As such, we convert them into embedded wh-questions. Below, we detail this conversion process. (8) What did she bake cookies and win? Given the stimuli in (8)...

work page internal anchor Pith review arXiv 2025
[26]

\r\n"You have,

Interestingly, we find extremely strong correlations across the majority of positions – notably higher than the correlation with LMwh-licensing score shown in Figure 4b. While we do not investigate these results further, we find them extremely exciting, and in line with prior work which has shown that LLM internals align strongly with human behavior (Kuri...

2025