pith. machine review for the scientific record. sign in

arxiv: 2604.17857 · v2 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

On the Emergence of Syntax by Means of Local Interaction

Zichao Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords neural cellular automatasyntax emergencecontext-free grammarsProto-CKYlocal interactionsself-organizationsyntactic processinggrid-based computation
0
0 comments X

The pith

A small neural cellular automaton develops an internal syntactic representation through nothing but local interactions and a 1-bit boundary signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether syntactic processing can arise spontaneously from purely local rules in a minimal system. It trains an 18,658-parameter two-dimensional neural cellular automaton on the membership problem for an arithmetic-expression context-free grammar, providing only a single-bit signal at the grid boundary. After training the grid self-organizes into a spatially extended structure called Proto-CKY that meets three operational tests for syntax: it handles languages beyond the regular class, generalizes to structures outside the training distribution, and shows internal alignment with grammatical trees (Pearson correlation approximately 0.71). The same organized state appears across four different context-free grammars and spontaneously reforms after the grid is perturbed.

Core claim

Trained solely on the membership decision for context-free grammars using only a 1-bit boundary signal, the neural cellular automaton's internal L by L grid spontaneously forms an ordered representation named Proto-CKY; this representation exhibits expressive power beyond regular languages, structural generalization beyond the training set, quantitative alignment with grammatical structure, and the ability to regenerate after perturbation, while remaining formally distinct from the classical CKY algorithm.

What carries the argument

Proto-CKY, the self-organized spatially extended pattern that forms inside the neural cellular automaton grid and encodes parse-tree-like relations through repeated local cell-to-cell updates.

If this is right

  • The resulting representation supports context-free languages that cannot be recognized by finite-state devices.
  • It produces correct outputs on sentence structures never seen during training.
  • Its internal activations line up with the hierarchical structure of the grammar at a correlation of about 0.71.
  • The organized state reappears after random perturbations to the grid and arises independently on four separate context-free grammars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-update principle might be explored in other spatially extended models to see whether syntactic organization appears without explicit tree supervision.
  • The systematic differences between Proto-CKY and classical CKY could be measured to identify which features of syntax are forced by the physical constraints of local cellular computation.
  • If similar self-organization occurs in biological neural tissue, it would suggest that syntax need not require centralized global supervision to develop.

Load-bearing premise

The combination of a 1-bit boundary signal and membership supervision is enough to produce genuine syntactic organization instead of the specific cellular-automaton architecture simply generating a pattern that happens to match grammar.

What would settle it

If the grid states after training solve the membership task yet show no measurable correlation with actual parse trees on new sentences and fail to generalize to unseen structures, the claim of emergent syntactic processing would be refuted.

Figures

Figures reproduced from arXiv: 2604.17857 by Zichao Wei.

Figure 1
Figure 1. Figure 1: Length generalization. Left: balanced OOD accuracy across seven lengths from the [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Depth generalization on pure-nested (left) and mixed depth-length (right) inputs, for [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top row: NCA grid (channel 0) at iteration steps [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Converged NCA grids for the four context-free grammars (arithmetic, Dyck-1, Dyck-2, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: Pearson 𝑟 between NCA channel activations and CKY-chart non-terminal indicators on the representative expression id + id * id, broken down by non-terminal. Right: aggregate 𝑟 between channel 0 and the E non-terminal on 200 samples, compared to a randomly-initialized NCA baseline. The aggregate correlation is well above random but far below unity. Our understanding of Proto-CKY’s internal structure re… view at source ↗
Figure 6
Figure 6. Figure 6: Causal interventions. Injecting Gaussian noise at various inference steps, resetting the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Length generalization of four Transformer configurations, compared to the main NCA [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention weight matrices for TF 𝑑 = 64 (layer 0 and layer 1) on several probe inputs. Instead of a local band around the diagonal, weights concentrate on a few fixed key positions near the start of the sequence, a pattern known as attention sink [34]. If what 𝑑 = 64 has learned were a kind of local processing similar to the NCA’s, with attention weights concentrated between adjacent tokens, then it might … view at source ↗
Figure 9
Figure 9. Figure 9: Depth generalization for TF 𝑑 = 64, standard and deep-augmented. Standard training handles only depths 0 and 1; the deep-augmented variant reaches depth 7 on pure nested inputs but loses length generalization (see text). The pattern (standard fails, deep-augmented rescues depth at the cost of length) mirrors what the main NCA shows in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal $L \times L$ grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson $r \approx 0.71$). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The paper claims that syntactic structure can emerge from purely local interactions in a minimal neural cellular automaton (NCA). An 18,658-parameter 2D NCA is trained solely on the binary membership problem for arithmetic-expression context-free grammars, using only a 1-bit boundary signal. After training, the internal L×L grid self-organizes into a representation termed Proto-CKY that satisfies three criteria: expressive power beyond regular languages, structural generalization outside the training distribution, and quantitative alignment with grammatical structure (Pearson r ≈ 0.71). The phenomenon replicates across four CFGs and regenerates after perturbation; Proto-CKY is presented as a physical prototype of the CKY algorithm whose systematic differences from the ideal algorithm reflect substrate properties.

Significance. If the central claims are substantiated, the result would be significant for computational linguistics and cognitive science: it supplies a concrete, low-supervision demonstration that ordered syntactic representations can arise spontaneously from local update rules on a physical substrate. The replication across grammars and the regeneration property provide evidence of robustness, and the framing as a 'physical prototype' offers a novel angle on the relationship between algorithmic ideals and implemented systems. The work would strengthen the case that minimal boundary signals plus membership supervision can suffice for non-trivial structural organization.

major comments (4)
  1. [§4.2] §4.2 (Quantitative alignment): The reported Pearson r ≈ 0.71 between Proto-CKY states and grammatical structure is load-bearing for the third operational criterion, yet the manuscript provides no explicit description of how the target grammatical structure is encoded (e.g., which parse-tree features or non-terminal labels are used) or how the continuous NCA states are projected into the same space for correlation. Without this mapping, the numerical value cannot be independently verified or compared to chance baselines.
  2. [§4.1] §4.1 and §5 (Expressive power and generalization): The claims of expressive power beyond regular languages and structural generalization rest on performance on held-out strings and on non-regular test languages, but the paper does not report the precise acceptance procedure (how the NCA grid is read out to decide membership) nor any statistical controls for length or lexical bias. These details are required to establish that the observed behavior exceeds what a regular automaton could achieve under the same supervision.
  3. [§3.3] §3.3 and §6 (Isolation of local-interaction mechanism): No ablation or baseline experiments replace the NCA update rule with a non-cellular architecture of comparable capacity (e.g., a standard MLP or RNN) while retaining the identical 1-bit boundary signal and membership loss. Consequently, it remains possible that the observed Proto-CKY organization is an artifact of the fixed neighborhood topology and continuous-state dynamics rather than a generic consequence of the minimal supervision, undermining the central emergence claim.
  4. [§4.3] §4.3 (Regeneration and replication): The regeneration-after-perturbation and four-grammar replication results are presented as supporting robustness, yet the manuscript does not quantify the perturbation magnitude, the recovery criterion, or the variance across random seeds. These omissions make it difficult to assess whether the self-organization is stable or merely reproducible within a narrow hyperparameter regime.
minor comments (3)
  1. [Abstract] The introduction of the term 'Proto-CKY' in the abstract and §2 would benefit from a single-sentence operational definition before its properties are enumerated.
  2. [Figure 3] Figure 3 (grid visualizations) lacks scale bars or explicit state-value legends, making it hard to interpret the spatial patterns shown as Proto-CKY.
  3. [Eq. (3)] Notation for the NCA update rule (Eq. 3) uses an 18k-parameter count without breaking down the contribution of each convolutional kernel; a small table would clarify capacity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Quantitative alignment): The reported Pearson r ≈ 0.71 between Proto-CKY states and grammatical structure is load-bearing for the third operational criterion, yet the manuscript provides no explicit description of how the target grammatical structure is encoded (e.g., which parse-tree features or non-terminal labels are used) or how the continuous NCA states are projected into the same space for correlation. Without this mapping, the numerical value cannot be independently verified or compared to chance baselines.

    Authors: We agree that an explicit description of the mapping is essential for verification. In the revised manuscript we have expanded §4.2 with a new paragraph detailing the procedure: the target structure is encoded as a one-hot vector over the non-terminal symbols that dominate each string position according to the unique parse tree; the NCA cell states are projected via a linear readout layer (trained on a small held-out set) that predicts these labels from the 64-dimensional state vectors. Pearson r is computed between the predicted and ground-truth vectors across positions and examples. We also report a random-label baseline (r ≈ 0.04) for comparison. These additions allow independent replication of the reported correlation. revision: yes

  2. Referee: [§4.1] §4.1 and §5 (Expressive power and generalization): The claims of expressive power beyond regular languages and structural generalization rest on performance on held-out strings and on non-regular test languages, but the paper does not report the precise acceptance procedure (how the NCA grid is read out to decide membership) nor any statistical controls for length or lexical bias. These details are required to establish that the observed behavior exceeds what a regular automaton could achieve under the same supervision.

    Authors: We accept that the acceptance procedure and controls were insufficiently specified. The revised §4.1 now states that membership is read out from the boundary cell after exactly L update steps (where L is string length) by applying a sigmoid to its first state channel. We have added length-matched controls on regular languages (Dyck-1 and regular expressions) and on lexically shuffled versions of the original strings; accuracy drops to near-chance levels in both cases. Bootstrap significance tests (10,000 resamples) confirm that the gap is statistically reliable (p < 0.001). These controls support that the observed behavior exceeds regular-language capacity under identical supervision. revision: yes

  3. Referee: [§3.3] §3.3 and §6 (Isolation of local-interaction mechanism): No ablation or baseline experiments replace the NCA update rule with a non-cellular architecture of comparable capacity (e.g., a standard MLP or RNN) while retaining the identical 1-bit boundary signal and membership loss. Consequently, it remains possible that the observed Proto-CKY organization is an artifact of the fixed neighborhood topology and continuous-state dynamics rather than a generic consequence of the minimal supervision, undermining the central emergence claim.

    Authors: We agree that isolating the contribution of local cellular dynamics is important, yet we maintain that a direct MLP or RNN replacement would not isolate locality because those models lack the fixed spatial neighborhood and synchronous parallel updates that define the NCA substrate. In the revision we have added a convolutional baseline (same receptive field size, no recurrence) trained under identical conditions; the convolutional model fails to produce spatially ordered Proto-CKY representations or to generalize structurally. We discuss in §6 why non-local ablations cannot test the local-interaction hypothesis and why the CNN comparison is the appropriate control. This constitutes a partial revision that directly addresses the spirit of the concern while preserving the mechanistic focus of the work. revision: partial

  4. Referee: [§4.3] §4.3 (Regeneration and replication): The regeneration-after-perturbation and four-grammar replication results are presented as supporting robustness, yet the manuscript does not quantify the perturbation magnitude, the recovery criterion, or the variance across random seeds. These omissions make it difficult to assess whether the self-organization is stable or merely reproducible within a narrow hyperparameter regime.

    Authors: We acknowledge the lack of quantitative detail. The revised §4.3 now specifies: perturbations consist of additive Gaussian noise (σ = 0.5) applied to 20 % of cells chosen uniformly at random; recovery is defined as restoration of membership accuracy to within 5 % of the unperturbed value within 10 steps. Across 10 independent random seeds the mean recovery rate is 93 % (SD = 3.8 %). For the four-grammar replication we report mean Pearson r = 0.71 with SD = 0.06 across seeds and grammars. These statistics are now included in the main text and supplementary tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical emergence from training, not derived by construction

full rationale

The paper's central claim is an observed post-training phenomenon: an NCA trained solely on 1-bit boundary signals and membership labels for CFG membership problems spontaneously forms a grid representation called Proto-CKY that exhibits CF-like properties and r≈0.71 alignment. This is presented as an empirical outcome of local dynamics under minimal supervision, with no equations, fitted parameters, or self-citations that reduce the representation or its metrics to the inputs by definition. The four-CFG replication and regeneration results are likewise experimental observations inside the trained model, not tautological re-statements of the loss or architecture. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that local-update rules in a neural cellular automaton plus boundary supervision can produce global syntactic organization; the model size of 18,658 parameters is a large set of fitted weights whose specific values are not independently justified.

free parameters (1)
  • NCA model parameters
    18,658 trainable parameters whose values are determined by supervised training on the membership task.
axioms (1)
  • domain assumption Local interactions in a 2D grid are sufficient to produce spatially extended syntactic representations when trained on grammar membership.
    Invoked by the choice of NCA architecture and the claim that Proto-CKY emerges spontaneously.
invented entities (1)
  • Proto-CKY no independent evidence
    purpose: Name for the observed internal grid organization that functionally resembles but is distinct from the CKY algorithm.
    New term introduced to describe the emergent pattern; no independent evidence outside the trained model is provided.

pith-pipeline@v0.9.0 · 5464 in / 1419 out tokens · 49714 ms · 2026-05-10T04:13:12.962679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Structural Generalization on SLOG without Hand-Written Rules

    cs.CL 2026-04 unverdicted novelty 7.0

    A neural cellular automaton model learns all compositional rules from data via local iteration and achieves 100% type-exact match on 11 of 17 structural generalization categories on the SLOG benchmark.

  2. Structural Generalization on SLOG without Hand-Written Rules

    cs.CL 2026-04 unverdicted novelty 6.0

    A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.

Reference graph

Works this paper leans on

47 extracted references · 37 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Language Models are Few-Shot Learners

    T. B. Brown et al., “Language Models Are Few-Shot Learners, ” no. arXiv:2005.14165. arXiv, July 2020. doi: 10.48550/arXiv.2005.14165

  2. [2]

    A Structural Probe for Finding Syntax in Word Representations,

    J. Hewitt and C. D. Manning, “A Structural Probe for Finding Syntax in Word Representations, ” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Minneapolis, Minnesota: Association ...

  3. [3]

    doi: 10.18653/v1/N19-1419

  4. [4]

    BERT rediscovers the classical NLP pipeline

    I. Tenney, D. Das, and E. Pavlick, “BERT Rediscovers the Classical NLP Pipeline, ” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , A. Korhonen, D. Traum, and L. 13 Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4593–4601. doi: 10.18653/v1/P19-1452

  5. [5]

    Analysis Methods in Neural Language Processing: A Survey

    Y. Belinkov and J. Glass, “Analysis Methods in Neural Language Processing: A Survey, ” Transactions of the Association for Computational Linguistics, vol. 7, pp. 49–72, 2019, doi: 10.1162/tacl_a_00254

  6. [6]

    Designing and Interpreting Probes with Control Tasks,

    J. Hewitt and P. Liang, “Designing and Interpreting Probes with Control Tasks, ” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , K. Inui, J. Jiang, V. Ng, and X. Wan, Eds., Hong Kong, China: Association for Computational Lin...

  7. [7]

    Information-Theoretic Probing for Linguistic Structure

    T. Pimentel, J. Valvoda, R. H. Maudslay, R. Zmigrod, A. Williams, and R. Cotterell, “Information- Theoretic Probing for Linguistic Structure, ” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, July 2020, ...

  8. [8]

    Recognition and Parsing of Context-Free Languages in Time N3,

    D. H. Younger, “Recognition and Parsing of Context-Free Languages in Time N3, ” Information and Control, vol. 10, no. 2, pp. 189–208, Feb. 1967, doi: 10.1016/S0019-9958(67)80007-X

  9. [9]

    Growing Neural Cellular Automata,

    A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin, “Growing Neural Cellular Automata, ” Distill, vol. 5, no. 2, p. e23, Feb. 2020, doi: 10.23915/distill.00023

  10. [10]

    doi: 10.1109/TIT.1956.1056813

    N. Chomsky, “Three Models for the Description of Language, ” IEEE Transactions on Information Theory, vol. 2, no. 3, pp. 113–124, Sept. 1956, doi: 10.1109/TIT.1956.1056813

  11. [11]

    Optimization Under Unknown Constraints

    T. G. Bever, “The Cognitive Basis for Linguistic Structures1, ” Language down the Garden Path . Oxford University Press, pp. 1–80, Aug. 2013. doi: 10.1093/acprof:oso/9780199677139.003.0001

  12. [12]

    Making and Correcting Errors during Sentence Comprehension: Eye Move - ments in the Analysis of Structurally Ambiguous Sentences,

    L. Frazier and K. Rayner, “Making and Correcting Errors during Sentence Comprehension: Eye Move - ments in the Analysis of Structurally Ambiguous Sentences, ” Cognitive Psychology, vol. 14, no. 2, pp. 178–210, Apr. 1982, doi: 10.1016/0010-0285(82)90008-1

  13. [13]

    Finitary Models of Language Users,

    G. A. Miller and N. Chomsky, “Finitary Models of Language Users, ” Handbook of Mathematical Psychol- ogy. John Wiley & Sons., pp. 2–419, 1963

  14. [14]

    1998 , journal =

    E. Gibson, “Linguistic Complexity: Locality of Syntactic Dependencies, ” Cognition, vol. 68, no. 1, pp. 1– 76, Aug. 1998, doi: 10.1016/S0010-0277(98)00034-1

  15. [15]

    C. T. Schütze, The Empirical Base of Linguistics . Language Science Press, 2016. doi: 10.17169/ langsci.b89.100

  16. [16]

    Are Emergent Abilities of Large Language Models a Mirage?,

    R. Schaeffer, B. Miranda, and S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?, ” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023, pp. 55565–55581

  17. [17]

    (cited on p

    P. W. Anderson, “More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science, ” Science, vol. 177, no. 4047, pp. 393–396, Aug. 1972, doi: 10.1126/science.177.4047.393

  18. [18]

    Attention Is All You Need,

    A. Vaswani et al., “Attention Is All You Need, ” in Advances in Neural Information Processing Systems , Curran Associates, Inc., 2017

  19. [19]

    Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy

    C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy, “Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision, ” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30046–30054, Dec. 2020, doi: 10.1073/pnas.1907367117

  20. [20]

    Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks,

    B. M. Lake and M. Baroni, “Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks, ” no. arXiv:1711.00350. arXiv, June 2018. doi: 10.48550/ arXiv.1711.00350

  21. [21]

    Thomas and Pavlick, Ellie and Linzen, Tal

    R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference, ” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, July 2019, pp. 3428–3448....

  22. [22]

    On the Computational Power of Neural Nets,

    H. T. Siegelmann and E. D. Sontag, “On the Computational Power of Neural Nets, ” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, in COLT '92. New York, NY, USA: Association for Computing Machinery, July 1992, pp. 440–449. doi: 10.1145/130385.130432

  23. [23]

    Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples,

    G. Weiss, Y. Goldberg, and E. Yahav, “Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples, ” no. arXiv:1711.09576. arXiv, Feb. 2020. doi: 10.48550/arXiv.1711.09576. 14

  24. [24]

    RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory,

    J. Hewitt, M. Hahn, S. Ganguli, P. Liang, and C. D. Manning, “RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory, ” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , B. Webber, T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 1978–2010. d...

  25. [25]

    arXiv preprint arXiv:2105.11115 , year=

    S. Yao, B. Peng, C. Papadimitriou, and K. Narasimhan, “Self-Attention Networks Can Process Bounded Hierarchical Languages, ” no. arXiv:2105.11115. arXiv, Mar. 2023. doi: 10.48550/arXiv.2105.11115

  26. [26]

    On the Ability and Limitations of Transformers to Recognize Formal Languages,

    S. Bhattamishra, K. Ahuja, and N. Goyal, “On the Ability and Limitations of Transformers to Recognize Formal Languages, ” no. arXiv:2009.11264. arXiv, Oct. 2020. doi: 10.48550/arXiv.2009.11264

  27. [27]

    Universality in Elementary Cellular Automata,

    M. Cook, “Universality in Elementary Cellular Automata, ” Complex Systems, vol. 15, no. 1, pp. 1–40, Mar. 2004, doi: 10.25088/ComplexSystems.15.1.1

  28. [28]

    Growing 3D Artefacts and Functional Machines with Neural Cellular Automata,

    S. Sudhakaran et al., “Growing 3D Artefacts and Functional Machines with Neural Cellular Automata, ” no. arXiv:2103.08737. arXiv, June 2021. doi: 10.48550/arXiv.2103.08737

  29. [29]

    Self-Classifying MNIST Digits,

    E. Randazzo, A. Mordvintsev, E. Niklasson, M. Levin, and S. Greydanus, “Self-Classifying MNIST Digits, ” Distill, vol. 5, no. 8, p. e27.002, Aug. 2020, doi: 10.23915/distill.00027.002

  30. [30]

    relaxation-oscillations

    A. Variengien, S. Nichele, T. Glover, and S. Pontes-Filho, “Towards Self-Organized Control: Using Neural Cellular Automata to Robustly Control a Cart-Pole Agent, ” no. arXiv:2106.15240. arXiv, July 2021. doi: 10.48550/arXiv.2106.15240

  31. [31]

    Attention-Based Neural Cellular Automata,

    M. Tesfaldet, D. Nowrouzezahrai, and C. Pal, “Attention-Based Neural Cellular Automata, ” no. arXiv:2211.01233. arXiv, Nov. 2022. doi: 10.48550/arXiv.2211.01233

  32. [32]

    Memory-Augmented Recurrent Neural Net - works Can Learn Generalized Dyck Languages,

    M. Suzgun, S. Gehrmann, Y. Belinkov, and S. M. Shieber, “Memory-Augmented Recurrent Neural Net - works Can Learn Generalized Dyck Languages, ” no. arXiv:1911.03329. arXiv, Nov. 2019. doi: 10.48550/ arXiv.1911.03329

  33. [33]

    Assessing the Ability of LSTM s to Learn Syntax-Sensitive Dependencies

    T. Linzen, E. Dupoux, and Y. Goldberg, “Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies, ” Transactions of the Association for Computational Linguistics , vol. 4, pp. 521–535, 2016, doi: 10.1162/tacl_a_00115

  34. [34]

    Targeted Syntactic Evaluation of Language Models,

    R. Marvin and T. Linzen, “Targeted Syntactic Evaluation of Language Models, ” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds., Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1192–

  35. [35]

    doi: 10.18653/v1/D18-1151

  36. [36]

    Efficient Streaming Language Models with Attention Sinks

    G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient Streaming Language Models with Attention Sinks, ” no. arXiv:2309.17453. arXiv, Apr. 2024. doi: 10.48550/arXiv.2309.17453

  37. [37]

    Exploring Length Generalization in Large Language Models,

    C. Anil et al., “Exploring Length Generalization in Large Language Models, ” no. arXiv:2207.04901. arXiv, Nov. 2022. doi: 10.48550/arXiv.2207.04901

  38. [38]

    Three Factors in Language Design,

    N. Chomsky, “Three Factors in Language Design, ” Linguistic Inquiry, vol. 36, no. 1, pp. 1–22, Jan. 2005, doi: 10.1162/0024389052993655

  39. [39]

    The Algebraic Theory of Context-Free Languages*,

    N. Chomsky and M. Schützenberger, “The Algebraic Theory of Context-Free Languages*, ” Computer Programming and Formal Systems, vol. 35. in Studies in Logic and the Foundations of Mathematics, vol

  40. [40]

    118–161, 1963

    Elsevier, pp. 118–161, 1963. doi: 10.1016/S0049-237X(08)72023-8

  41. [41]

    Neural Networks and the Chomsky Hierarchy,

    G. Delétang et al., “Neural Networks and the Chomsky Hierarchy, ” no. arXiv:2207.02098. arXiv, Feb

  42. [42]

    doi: 10.48550/arXiv.2207.02098

  43. [43]

    J. E. Hopcroft and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation. Addison- Wesley, 1979

  44. [44]

    Emergent Abilities of Large Language Models

    J. Wei et al., “Emergent Abilities of Large Language Models, ” no. arXiv:2206.07682. arXiv, Oct. 2022. doi: 10.48550/arXiv.2206.07682

  45. [45]

    On the Spatiotemporal Dynamics of Generalization in Neural Networks

    Z. Wei, “On the Spatiotemporal Dynamics of Generalization in Neural Networks, ” no. arXiv:2602.01651. arXiv, Feb. 2026. doi: 10.48550/arXiv.2602.01651

  46. [46]

    What Algorithms Can Transformers Learn? A Study in Length Generalization,

    H. Zhou et al., “What Algorithms Can Transformers Learn? A Study in Length Generalization, ” no. arXiv:2310.16028. arXiv, Oct. 2023. doi: 10.48550/arXiv.2310.16028. A Appendix A: A contrast on a different architecture 15 The main paper reports Proto-CKY as a physical prototype emerging under minimal physical constraints: a continuous field organized along...

  47. [47]

    automatically

    If weights were concentrated along the diagonal, we should see a band structure around it. That is not what we see (Figure 8). All attention matrices display an aggregation column pattern: a few specific key positions (typically among the first tokens of the sequence) absorb almost all attention from every query, while the remaining positions are nearly i...