Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

Claire Hobbs; R. Thomas McCoy

arxiv: 2605.20529 · v1 · pith:BR4HB222new · submitted 2026-05-19 · 💻 cs.CL · cs.AI

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

Claire Hobbs , R. Thomas McCoy This is my paper

Pith reviewed 2026-05-21 06:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords collocational bootstrappingsubject-verb agreementlanguage acquisitionneural networkschild-directed speechstatistical learningsyntactic dependencies

0 comments

The pith

Collocational patterns in child-directed speech provide enough signal for neural networks to learn subject-verb agreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether regularities in how words co-occur can supply cues for syntactic dependencies such as subject-verb agreement. Researchers trained neural networks on artificial datasets that controlled the predictability of subject-verb pairings and identified a range of variability in which the networks reliably acquired the agreement rule. They then measured the actual variability of subject-verb pairings in child-directed language and found that it fell inside the successful range from the simulations. These results indicate that collocational bootstrapping is a workable learning strategy given the statistical properties of the input children hear.

Core claim

Neural networks learn English subject-verb agreement robustly from synthetic input whose subject-verb pairings exhibit moderate variability; child-directed speech contains comparable variability, making collocational bootstrapping a viable route to agreement acquisition.

What carries the argument

Collocational bootstrapping: the use of statistical regularities in word co-occurrence to infer syntactic dependencies.

If this is right

Children could acquire subject-verb agreement from statistical patterns without additional syntactic knowledge.
The same range of input variability supports generalization in both artificial and potentially human learners.
Collocational cues may be sufficient for other local syntactic dependencies in early language acquisition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mechanism might scale to agreement in languages with richer morphology or to other dependency relations such as noun-adjective concord.
If the variability match holds, targeted experiments could test whether children show stronger agreement learning after exposure to input with matched statistical properties.
Models that rely solely on collocational statistics could be compared against models that incorporate additional cues to isolate the contribution of co-occurrence alone.

Load-bearing premise

Neural-network performance on controlled synthetic datasets accurately reflects the mechanisms children use to extract agreement from natural language input.

What would settle it

A measurement showing that subject-verb pairing variability in child-directed corpora lies outside the range that permits robust agreement learning in the neural-network simulations.

Figures

Figures reproduced from arXiv: 2605.20529 by Claire Hobbs, R. Thomas McCoy.

**Figure 1.** Figure 1: Noun probability distributions across α values (log scale). Lower α values produce flatter distributions with more uniform noun usage, while higher α values concentrate probability on fewer nouns. The distributions were truncated at the dotted line to leave some nouns unseen as the subjects of particular verbs. 4.3 Training For each α value from 0.0 to 3.0 inclusive in increments of 0.1, as well as the c… view at source ↗

**Figure 2.** Figure 2: Model accuracy vs. Zipfian parameter α across four evaluation conditions. There is an optimal point where α = 1.4 at which models perform robustly in all test conditions. Error bars show one standard deviation. the verb not matching the subject’s number. Sample minimal pairs are in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: The fitted Zipf parameter α decreases with child age. The dashed red line indicates the optimal α found in neural network simulations; the dotted purple line indicates the overall corpus α. also broke down the analysis by the age of the child being spoken to in order to see whether the bestfitting α value varied by the age of the target child. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of utterances by speaker role in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Training and validation loss at three α values. Loss curves track closely across all conditions, indicating no overfitting. Age Group Utterances S-V Pairs Unique Verbs Unique Subjects α MSE 0–12mo 182,023 113,607 1,180 1,516 1.46 9.78e-07 12–24mo 671,559 350,859 3,580 6,135 1.40 2.56e-07 24–36mo 1,900,684 1,163,974 6,775 13,577 1.44 2.15e-07 36–48mo 923,858 553,675 4,940 9,817 1.38 3.90e-07 48–60mo 621,805… view at source ↗

read the original abstract

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper hypothesizes 'collocational bootstrapping' as a mechanism in which word co-occurrence regularities provide cues to syntactic dependencies, specifically for acquiring English subject-verb agreement. It tests this via neural network simulations on synthetic datasets that systematically vary the predictability of subject-verb pairings, identifying a range of variability levels supporting robust generalization. A corpus analysis of child-directed language then shows that its subject-verb variability falls inside the same range, leading to the claim that collocational bootstrapping is a viable strategy for the input children receive.

Significance. If the central mapping from synthetic variability range to natural child-directed speech holds, the work supplies a concrete, testable computational account of how low-level statistical signals can support syntactic acquisition without requiring innate syntactic knowledge. The dual approach of controlled simulation plus direct corpus measurement is a strength, as is the focus on a falsifiable prediction about the variability sweet spot.

major comments (2)

[§3] §3 (Synthetic experiments): the claim that the identified variability range supports robust agreement learning rests on networks trained only on controlled subject-verb predictability while other factors are held fixed; this construction does not reproduce the correlated cues (semantic number, noun/verb morphology, prosody, sentence position) present in natural input, so the observed range may not be a faithful proxy for the learning problem children face.
[§4] §4 (Corpus analysis): matching only a single scalar variability statistic from CHILDES data to the synthetic range is insufficient to establish viability; the paper does not report training or evaluating the same networks on actual child-directed corpora to test whether the robustness and generalization behavior observed in the synthetic setting is preserved.

minor comments (2)

[Abstract] Abstract and §2: the term 'collocational bootstrapping' is introduced without an explicit formal definition or operationalization before the simulations are described.
[Methods] Methods: network architectures, exact agreement metrics, and statistical criteria for 'robust' learning are described at a high level; additional detail would improve replicability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper where possible to clarify our methodology and acknowledge limitations.

read point-by-point responses

Referee: [§3] §3 (Synthetic experiments): the claim that the identified variability range supports robust agreement learning rests on networks trained only on controlled subject-verb predictability while other factors are held fixed; this construction does not reproduce the correlated cues (semantic number, noun/verb morphology, prosody, sentence position) present in natural input, so the observed range may not be a faithful proxy for the learning problem children face.

Authors: We agree that the synthetic experiments isolate subject-verb co-occurrence predictability while holding other factors fixed and therefore do not replicate the full set of correlated cues present in natural child-directed speech. This controlled design was chosen specifically to test the contribution of collocational statistics in isolation, thereby identifying the variability range at which such statistics alone can support robust generalization. We view the resulting range as a conservative estimate, since additional cues in natural input would be expected to make learning easier rather than harder. In the revised manuscript we have added a dedicated paragraph in the discussion section that explicitly addresses this point, clarifies the scope of the synthetic results, and notes that the identified range represents a lower bound under minimal-cue conditions. revision: partial
Referee: [§4] §4 (Corpus analysis): matching only a single scalar variability statistic from CHILDES data to the synthetic range is insufficient to establish viability; the paper does not report training or evaluating the same networks on actual child-directed corpora to test whether the robustness and generalization behavior observed in the synthetic setting is preserved.

Authors: We acknowledge that directly training and evaluating the networks on actual CHILDES corpora would constitute a stronger and more direct test of whether the generalization behavior transfers from the synthetic setting. Our corpus analysis was intentionally limited to measuring the subject-verb variability statistic in child-directed speech so that it could be compared against the range established in the controlled simulations; this matching was the central link between the two parts of the study. We recognize that the absence of end-to-end training on naturalistic data is a genuine limitation. In the revised manuscript we have expanded the discussion section to state this limitation explicitly, to explain why full retraining on CHILDES lies outside the current scope, and to outline how such experiments could be pursued in future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: independent simulations and corpus analysis

full rationale

The paper's derivation consists of two independent empirical components: (1) training neural networks on synthetic datasets that systematically vary subject-verb predictability to identify a viable range for robust agreement learning, and (2) measuring the actual variability of subject-verb pairings in child-directed language (CHILDES) and checking whether it falls inside that range. Neither step presupposes the outcome of the other; the synthetic experiments discover the range rather than fitting it to the target conclusion, and the corpus analysis is a separate measurement. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided abstract or described chain. The argument is therefore self-contained and externally benchmarked against real linguistic data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the modeling assumption that neural network performance on synthetic data generalizes to human acquisition and that the chosen variability metric captures the relevant statistical cues; no free parameters or invented entities beyond the hypothesis itself are detailed in the abstract.

free parameters (1)

variability levels of subject-verb pairings
The paper varies predictability in synthetic datasets to identify a supportive range, implying parameter choices for simulation conditions.

axioms (1)

domain assumption Neural network learning on controlled synthetic input can serve as a proxy for human statistical learning of syntax
This premise is invoked when extrapolating simulation results to viability for children receiving similar input.

invented entities (1)

collocational bootstrapping no independent evidence
purpose: Mechanism by which word co-occurrence regularities cue syntactic dependencies like subject-verb agreement
Newly hypothesized process introduced to explain the learning pathway supported by the simulations and corpus analysis.

pith-pipeline@v0.9.0 · 5677 in / 1425 out tokens · 46216 ms · 2026-05-21T06:46:46.854813+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we varied the extent to which a subject could be predicted from its verb... f(r) = K / r^α ... α≈1.4 ... best-fitting value of α was α=1.43
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

collocational bootstrapping... neural networks... CHILDES

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Assessing BERT's Syntactic Abilities

The acquisition of anaphora by simple re- current networks.Language Acquisition, 20(3):181– 227. Yoav Goldberg. 2019. Assessing BERT’s syntactic abil- ities.arXiv preprint arXiv:1901.05287. Rebecca L. Gómez. 2002. Variability and detec- tion of invariant structure.Psychological Science, 13(5):431–436. Kristina Gulordava, Piotr Bojanowski, Edouard Grave, T...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Diederik Kingma and Jimmy Ba

Four decades of open language science: The CHILDES project.Language Teaching Research Quarterly, 44:15–30. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations. Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo- gatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn s...

work page 2015
[3]

A systematic framework for generating novel experimental hypotheses from language models

Assessing the ability of LSTMs to learn syntax- sensitive dependencies.Transactions of the Associa- tion for Computational Linguistics, 4:521–535. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations. Brian MacWhinney. 2000.The CHILDES Project: Tools for Analyzing Talk. Law...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 11722–11740, Suzhou, China

Data drives unstable hierarchical generaliza- tion in LMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 11722–11740, Suzhou, China. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multi...

work page 2025
[5]

InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 932–948

Frequency effects on syntactic rule learning in transformers. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 932–948. Kenneth Wexler and Peter W. Culicover. 1980.F ormal Principles of Language Acquisition. MIT Press. Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy

work page 2021
[6]

Elizabeth Wonnacott, Elissa L

Using computational models to test syntactic learnability.Linguistic Inquiry, 55(4):805–848. Elizabeth Wonnacott, Elissa L. Newport, and Michael K. Tanenhaus. 2008. Acquiring and processing verb ar- gument structure: Distributional learning in a minia- ture language.Cognitive Psychology, 56(3):165– 209. Charles Yang. 2016.The Price of Linguistic Produc- t...

work page arXiv 2008
[7]

you put the block on

“you put the block on”

work page
[8]

what else do we see in here

“what else do we see in here”

work page
[9]

oh be be very gentle with baby right

“oh be be very gentle with baby right” Age 12-24 months (α= 1.40)

work page
[10]

yeah that’s where we were

“yeah that’s where we were”

work page
[11]

you don’t like MacDonald’s and I don’t like MacDonald’s

“you don’t like MacDonald’s and I don’t like MacDonald’s” Age 24-36 months (α= 1.44)

work page
[12]

how do you know this is a duck

“how do you know this is a duck”

work page
[13]

let’s sit here on mama’s mama’s knee

“let’s sit here on mama’s mama’s knee” Age 36-48 months (α= 1.38)

work page
[14]

you get milk from it

“you get milk from it”

work page
[15]

want mommy to read

“want mommy to read” Age 48-60 months (α= 1.37)

work page
[16]

i think we found the wheels or your mom did

“i think we found the wheels or your mom did”

work page
[17]

just like we see up there remember

“just like we see up there remember”

work page
[18]

there’s something wrong with her teeth aren’t there

“there’s something wrong with her teeth aren’t there” Age 60-72 months (α= 1.28)

work page
[19]

well I know but you know what I think this chair is

“well I know but you know what I think this chair is”

work page
[20]

so you want listen come here I’m going to tell you

“so you want listen come here I’m going to tell you”

work page
[21]

I don’t think I would like those

“I don’t think I would like those” Age 70-84 months (α= 1.23)

work page
[22]

I never heard of that one before

“I never heard of that one before”

work page
[23]

dad’s gonna dads can do it a lot

“dad’s gonna dads can do it a lot”

work page
[24]

“there’s how many bears on one wheel’ Age 84-96 months (α= 1.25)

work page
[25]

I got you a pencil

“I got you a pencil”

work page
[26]

alright well if you don’t put it on then the letter’s no good

“alright well if you don’t put it on then the letter’s no good”

work page
[27]

uh what about a movie though

“uh what about a movie though” D Training loss See Figure 6 for the loss trajectories of the models we trained. E Data Cleaning We downloaded 5,147,586 utterances from partic- ipants categorized as English-language speakers, restricted to 25 target speaker roles: Adult, Care- taker, Father, Friend, Grandfather, Grandmother, Investigator, Mother, Narrator,...

work page 2024

[1] [1]

Assessing BERT's Syntactic Abilities

The acquisition of anaphora by simple re- current networks.Language Acquisition, 20(3):181– 227. Yoav Goldberg. 2019. Assessing BERT’s syntactic abil- ities.arXiv preprint arXiv:1901.05287. Rebecca L. Gómez. 2002. Variability and detec- tion of invariant structure.Psychological Science, 13(5):431–436. Kristina Gulordava, Piotr Bojanowski, Edouard Grave, T...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Diederik Kingma and Jimmy Ba

Four decades of open language science: The CHILDES project.Language Teaching Research Quarterly, 44:15–30. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations. Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo- gatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn s...

work page 2015

[3] [3]

A systematic framework for generating novel experimental hypotheses from language models

Assessing the ability of LSTMs to learn syntax- sensitive dependencies.Transactions of the Associa- tion for Computational Linguistics, 4:521–535. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations. Brian MacWhinney. 2000.The CHILDES Project: Tools for Analyzing Talk. Law...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 11722–11740, Suzhou, China

Data drives unstable hierarchical generaliza- tion in LMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 11722–11740, Suzhou, China. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multi...

work page 2025

[5] [5]

InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 932–948

Frequency effects on syntactic rule learning in transformers. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 932–948. Kenneth Wexler and Peter W. Culicover. 1980.F ormal Principles of Language Acquisition. MIT Press. Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy

work page 2021

[6] [6]

Elizabeth Wonnacott, Elissa L

Using computational models to test syntactic learnability.Linguistic Inquiry, 55(4):805–848. Elizabeth Wonnacott, Elissa L. Newport, and Michael K. Tanenhaus. 2008. Acquiring and processing verb ar- gument structure: Distributional learning in a minia- ture language.Cognitive Psychology, 56(3):165– 209. Charles Yang. 2016.The Price of Linguistic Produc- t...

work page arXiv 2008

[7] [7]

you put the block on

“you put the block on”

work page

[8] [8]

what else do we see in here

“what else do we see in here”

work page

[9] [9]

oh be be very gentle with baby right

“oh be be very gentle with baby right” Age 12-24 months (α= 1.40)

work page

[10] [10]

yeah that’s where we were

“yeah that’s where we were”

work page

[11] [11]

you don’t like MacDonald’s and I don’t like MacDonald’s

“you don’t like MacDonald’s and I don’t like MacDonald’s” Age 24-36 months (α= 1.44)

work page

[12] [12]

how do you know this is a duck

“how do you know this is a duck”

work page

[13] [13]

let’s sit here on mama’s mama’s knee

“let’s sit here on mama’s mama’s knee” Age 36-48 months (α= 1.38)

work page

[14] [14]

you get milk from it

“you get milk from it”

work page

[15] [15]

want mommy to read

“want mommy to read” Age 48-60 months (α= 1.37)

work page

[16] [16]

i think we found the wheels or your mom did

“i think we found the wheels or your mom did”

work page

[17] [17]

just like we see up there remember

“just like we see up there remember”

work page

[18] [18]

there’s something wrong with her teeth aren’t there

“there’s something wrong with her teeth aren’t there” Age 60-72 months (α= 1.28)

work page

[19] [19]

well I know but you know what I think this chair is

“well I know but you know what I think this chair is”

work page

[20] [20]

so you want listen come here I’m going to tell you

“so you want listen come here I’m going to tell you”

work page

[21] [21]

I don’t think I would like those

“I don’t think I would like those” Age 70-84 months (α= 1.23)

work page

[22] [22]

I never heard of that one before

“I never heard of that one before”

work page

[23] [23]

dad’s gonna dads can do it a lot

“dad’s gonna dads can do it a lot”

work page

[24] [24]

“there’s how many bears on one wheel’ Age 84-96 months (α= 1.25)

work page

[25] [25]

I got you a pencil

“I got you a pencil”

work page

[26] [26]

alright well if you don’t put it on then the letter’s no good

“alright well if you don’t put it on then the letter’s no good”

work page

[27] [27]

uh what about a movie though

“uh what about a movie though” D Training loss See Figure 6 for the loss trajectories of the models we trained. E Data Cleaning We downloaded 5,147,586 utterances from partic- ipants categorized as English-language speakers, restricted to 25 target speaker roles: Adult, Care- taker, Father, Friend, Grandfather, Grandmother, Investigator, Mother, Narrator,...

work page 2024