Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Ethan Wilcox; Kanishka Misra; Leonie Weissweiler; Nathan Schneider; Wesley Scivetti

arxiv: 2605.31586 · v1 · pith:NMOWQQ3Vnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Wesley Scivetti , Ethan Wilcox , Nathan Schneider , Kanishka Misra , Leonie Weissweiler This is my paper

Pith reviewed 2026-06-28 22:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsconstructional semanticspaired-focus constructionslet alonesemantic understandingtraining dynamicsworld knowledge

0 comments

The pith

Modestly sized language models grasp rare paired-focus constructions like 'let alone' and link them to world knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can understand the meanings of uncommon English constructions such as 'let alone' and 'much less', which signal focus on a stronger or weaker alternative. It introduces a new dataset that probes these meanings through scalar adjectives and everyday facts rather than surface cues. Experiments across many models show that several modestly sized open-source systems detect both the forms and the intended meanings, whereas models trained on smaller human-scale data sets fail the meaning tests. Training curves reveal that semantic grasp of these constructions appears after basic syntactic knowledge and aligns with progress in certain world-knowledge areas.

Core claim

Modestly sized open-source models are sensitive to both the forms and the meanings of Paired-Focus constructions; semantic understanding emerges later in training than syntactic knowledge and correlates with gains in some domains of world knowledge, while models trained on human-scale data fail at all meaning evaluations.

What carries the argument

A novel dataset that isolates Paired-Focus construction meanings by combining scalar adjectival semantics with general world knowledge; it forces models to apply the construction's specific semantic contribution rather than lexical or surface patterns.

If this is right

Modestly sized open-source models can acquire understanding of rare constructions without extreme scale.
Semantic knowledge of Paired-Focus constructions develops after their syntactic patterns during training.
Acquisition of Paired-Focus semantics tracks improvements in selected world-knowledge domains.
Models trained only on human-scale data remain unable to handle the meanings of these constructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed link suggests that constructional semantics and factual knowledge may reinforce each other during learning.
If the correlation is causal, methods that boost general world knowledge could also improve handling of rare constructions.
The result raises the possibility that similar late-emerging semantic patterns hold for other low-frequency constructions.

Load-bearing premise

The new test items using scalar adjectives and world knowledge actually measure constructional semantic understanding instead of unrelated surface patterns or background facts.

What would settle it

A controlled follow-up in which models that pass the current tests are shown to succeed by relying only on word associations rather than the paired-focus meaning, or in which the reported correlation between construction learning and world-knowledge gains disappears under stricter controls.

Figures

Figures reproduced from arXiv: 2605.31586 by Ethan Wilcox, Kanishka Misra, Leonie Weissweiler, Nathan Schneider, Wesley Scivetti.

**Figure 1.** Figure 1: PAIRED-FOCUS Semantic Results by Model. Accuracy is averaged across our four PAIRED-FOCUS constructions: LET-ALONE, MUCH-LESS, NOT-TOMENTION, and NEVER-MIND. We observe a high rankorder correlation between model parameters and average accuracy (Spearman’s ρ = 0.67). 4.2 Results [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Training dynamics of Pythia-12b on PAIREDFOCUS evaluations as well as other linguistic benchmarks. Chance performance on EWoK is 25%, while chance performance on all other evaluations is 50%. the constructions. To operationalize semantic and formal knowledge of PAIRED-FOCUS constructions, we use PAIRED-FOCUS accuracy for semantics (Equation 2) and our syntactic test suite identical to Experiment 1. In… view at source ↗

**Figure 4.** Figure 4: Learning trajectory correlation scatterplots for Pythia-12b. Each point represents, for a given checkpoint, how much the model improved over the previous checkpoint with respect to a pair of criteria. EWoK physical relations and PAIRED-FOCUS semantic accuracy show moderate correlation. world knowledge cues for their interpretations of PAIRED-FOCUS constructions, Pythia-12b displays a sensitivity to the con… view at source ↗

**Figure 5.** Figure 5: Training dynamics of Ettin-Enc-400M. Chance (EWoK) Chance (Others) 0 25 50 75 100 0 50000 100000 150000 Step Accuracy(%) Dataset BLIMP COMPS EWoK PF Form PF Function (plausible) PF Function (implausible) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of Ettin-Decoder 1b. with large peaks and valleys throughout training. For both Ettin models, we observe early spikes in performance on plausible examples that are not accompanied by spikes on implausible examples. Generally, performance on implausible examples does not rise consistently above chance until much later in training. Taken together, these results seem to indicate that the sma… view at source ↗

read the original abstract

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mid-sized models show sensitivity to these constructions in the new tests, but the items may not cleanly separate constructional semantics from scalar or factual reasoning.

read the letter

The main point is that several modestly sized open-source models register sensitivity to both the forms and meanings of Paired-Focus constructions such as "let alone" and "much less," while models trained on smaller human-scale data do not. The training-dynamics results also show the semantic sensitivity appearing after syntactic knowledge and alongside gains in certain world-knowledge domains.

The new dataset and the checkpoint analysis are the clearest additions. Running the same items across a wide range of model sizes, architectures, and pretraining scales gives a broader view than most single-model probes, and tracking emergence order during training is a useful angle that goes beyond static accuracy numbers.

The sweep itself looks thorough on the surface. The correlation between Paired-Focus performance and other meaning benchmarks is worth recording even if it stays observational.

The soft spot is the isolation claim. The items combine scalar adjectives with world-knowledge facts, so correct answers could reflect direct application of that knowledge to the content words rather than grasp of the specific form-meaning pairing. The abstract mentions testing both forms and meanings but does not describe controls such as the same scalars paired with neutral connectors. Without those, the evidence that models have acquired the constructions themselves is weaker than the headline conclusion suggests.

This is for researchers who work on constructional semantics in LMs or on when different kinds of meaning knowledge appear during training. The new data and timing patterns are concrete enough that a referee should look at the methods and controls in detail.

I would send it for peer review, with the main request being clearer validation that the items target the construction rather than independent scalar or factual reasoning.

Referee Report

2 major / 2 minor

Summary. The paper claims that modestly sized open-source language models can grasp the semantics of rare Paired-Focus constructions (e.g., 'let alone', 'much less') via a novel dataset testing scalar adjectival semantics and world knowledge; several such models succeed on both form and meaning, while human-scale-data models fail; Paired-Focus semantics emerges later than syntax during training and correlates with gains in certain world-knowledge domains.

Significance. If the isolation of constructional semantics holds, the work provides empirical evidence that constructional understanding is not limited to the largest models, links it to broader meaning domains via checkpoint analysis, and supplies a reusable dataset for rare constructions.

major comments (2)

[§3] §3 (Dataset): No control conditions are described that hold scalar adjectives and world-knowledge facts constant while replacing the Paired-Focus connector with a neutral one (e.g., 'and' or 'but'); without these, success on the novel items cannot be attributed to the form-meaning pairing rather than independent scalar or factual reasoning.
[§4.3] §4.3 (Training Dynamics): The reported correlation between Paired-Focus semantic performance and world-knowledge benchmarks lacks a control correlation with unrelated semantic tasks or a permutation test; this leaves open whether the link is specific or an artifact of overall capability growth.

minor comments (2)

[Table 1] Table 1 and §2.2: clarify the exact pretraining data sizes labeled 'human-scale' and list the precise model checkpoints used for the dynamics analysis.
[§4.1] §4.1: report exact item counts per condition and any exclusion criteria for the novel dataset items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the attribution of our results to constructional semantics. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (Dataset): No control conditions are described that hold scalar adjectives and world-knowledge facts constant while replacing the Paired-Focus connector with a neutral one (e.g., 'and' or 'but'); without these, success on the novel items cannot be attributed to the form-meaning pairing rather than independent scalar or factual reasoning.

Authors: We agree that neutral-connector controls would more cleanly isolate the contribution of the Paired-Focus form-meaning pairing. Our current items were constructed so that the scalar ordering and world-knowledge facts are only licensed under the construction (as described in §3), but we acknowledge the referee's point that this is not sufficient without explicit baselines. We will add matched control items using 'and'/'but' in the revised dataset and re-run the model evaluations to quantify the drop in performance. revision: yes
Referee: [§4.3] §4.3 (Training Dynamics): The reported correlation between Paired-Focus semantic performance and world-knowledge benchmarks lacks a control correlation with unrelated semantic tasks or a permutation test; this leaves open whether the link is specific or an artifact of overall capability growth.

Authors: The referee is correct that the reported correlations could reflect general capability growth rather than a specific link. While we chose the world-knowledge domains on theoretical grounds (their relevance to scalar reasoning), we did not include unrelated-task controls or permutation tests. In revision we will add (i) correlations against unrelated semantic benchmarks and (ii) a permutation test that shuffles the Paired-Focus scores, to demonstrate specificity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations

full rationale

The paper reports results from constructing a novel dataset and running model evaluations on Paired-Focus constructions, including correlations with world-knowledge benchmarks and training dynamics. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear; all central claims rest on observed performance metrics that are measured directly rather than reduced to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed test items measure the intended semantic distinctions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Test items built from scalar adjectives and world-knowledge facts isolate constructional meaning rather than unrelated surface or factual knowledge.
This premise is required for the meaning evaluations to support the reported conclusions about semantic understanding.

pith-pipeline@v0.9.1-grok · 5779 in / 1287 out tokens · 21226 ms · 2026-06-28T22:02:49.832049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

[1]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

A method for studying semantic construal in grammatical constructions with interpretable contex- tual embedding spaces. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 242–261, Toronto, Canada. Association for Computational Lin- guistics. Anne Cocos, Skyler Wharton, Ellie Pavlick, Ma...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 5387–5403, Online

Harnessing the linguistic signal to predict scalar inferences. InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 5387–5403, Online. Association for Computational Linguistics. Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, and Nathan Schneider. 2025a. UnpackingLet Alone: Human-scale models generalize to a ra...

2025
[3]

InProceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP , GURT/SyntaxFest 2023), pages 85–95, Washington, D.C

Construction grammar provides unique in- sight into neural language models. InProceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP , GURT/SyntaxFest 2023), pages 85–95, Washington, D.C. Association for Computational Linguistics. Leonie Weissweiler, Valentin Hofmann, Abdullatif Kök- sal, and Hinrich Schütze. 2022. The ...

2023
[4]

much” and “less

Seq vs seq: An open suite of paired encoders and decoders.Preprint, arXiv:2507.11412. Bryan Wilkinson and Oates Tim. 2016. A gold standard for scalar adjectives. InProceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC’16), pages 2669–2675, Portorož, Slovenia. European Language Resources Association (ELRA). Xiulin Y...

work page arXiv 2016

[1] [1]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

A method for studying semantic construal in grammatical constructions with interpretable contex- tual embedding spaces. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 242–261, Toronto, Canada. Association for Computational Lin- guistics. Anne Cocos, Skyler Wharton, Ellie Pavlick, Ma...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 5387–5403, Online

Harnessing the linguistic signal to predict scalar inferences. InProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 5387–5403, Online. Association for Computational Linguistics. Wesley Scivetti, Tatsuya Aoyama, Ethan Wilcox, and Nathan Schneider. 2025a. UnpackingLet Alone: Human-scale models generalize to a ra...

2025

[3] [3]

InProceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP , GURT/SyntaxFest 2023), pages 85–95, Washington, D.C

Construction grammar provides unique in- sight into neural language models. InProceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP , GURT/SyntaxFest 2023), pages 85–95, Washington, D.C. Association for Computational Linguistics. Leonie Weissweiler, Valentin Hofmann, Abdullatif Kök- sal, and Hinrich Schütze. 2022. The ...

2023

[4] [4]

much” and “less

Seq vs seq: An open suite of paired encoders and decoders.Preprint, arXiv:2507.11412. Bryan Wilkinson and Oates Tim. 2016. A gold standard for scalar adjectives. InProceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC’16), pages 2669–2675, Portorož, Slovenia. European Language Resources Association (ELRA). Xiulin Y...

work page arXiv 2016