Structure Before Collapse: Transient semantic geometry in next-token prediction

Christos Thrampoulidis; Isabel Papadimitriou; Yize Zhao

arxiv: 2606.26749 · v1 · pith:2R326XCGnew · submitted 2026-06-25 · 💻 cs.LG · cs.CL

Structure Before Collapse: Transient semantic geometry in next-token prediction

Yize Zhao , Isabel Papadimitriou , Christos Thrampoulidis This is my paper

Pith reviewed 2026-06-26 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords neural collapsenext-token predictionsemantic geometrytransient structuregram matrixlanguage model representationsunconstrained features model

0 comments

The pith

Semantic geometry emerges early in next-token prediction before models reach the symmetric neural collapse state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Next-token prediction trains on one-hot labels that neural collapse theory says should push all representations to equal separation based only on their labels. The paper tests this tension in three synthetic settings where inputs share latent semantic factors yet map to unique one-hot targets. It shows that gradient descent first produces clusters reflecting those shared factors even without explicit semantic supervision. The clusters are temporary: continued training with sufficient capacity drives the representations to the symmetric equidistant state. Gram matrix analysis tracks the transition, and a modification to the unconstrained features model is introduced to describe the early phase.

Core claim

In three controlled synthetic settings with latent semantic factors mapped to distinct one-hot labels, gradient descent produces representations that cluster according to shared attributes early in training. These clusters are visible in Gram matrix analysis and persist until sufficient training drives the model to the symmetric configuration where all class representations are equidistant, consistent with neural collapse predictions. A modification to the unconstrained features model is proposed to account for this emergent geometry.

What carries the argument

The transient semantic geometry in which representations cluster by latent input attributes before equalizing, tracked through Gram matrix analysis.

If this is right

Representations initially reflect shared latent attributes among inputs rather than depending solely on their one-hot labels.
The symmetric equidistant state predicted by neural collapse is reached only after an initial structured phase.
Gram matrix analysis can detect the duration of the semantic clustering phase.
A modified unconstrained features model can reproduce both the early clustering and the later collapse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If similar early clustering occurs during pretraining on natural text, it could explain how language models acquire semantic categories despite sparse label overlap.
Varying model capacity or stopping training early might preserve more of the semantic structure for downstream use.
The same transient clustering could appear in other one-hot supervised tasks that contain hidden similarities among inputs.

Load-bearing premise

The three synthetic controlled settings with latent semantic factors and distinct one-hot labels sufficiently capture the relevant training dynamics of next-token prediction in language models on natural data.

What would settle it

Training a next-token model on one of the described synthetic datasets and observing either no early clustering by latent factors in the representations or no later transition to equal separation.

Figures

Figures reproduced from arXiv: 2606.26749 by Christos Thrampoulidis, Isabel Papadimitriou, Yize Zhao.

**Figure 1.** Figure 1: Transient semantic geometry emerges before collapse (Experiment 1). Samplelevel n × n RSMs over the full training set, ordered by output label (n = 800, K = 8). Left column: hypothesis RSMs. Sem-RSM encodes latent semantic overlap between class pairs, with values 2, 1, and 0 equal to the number of shared attributes. ETF-RSM encodes the labeldriven collapsed geometry predicted by Neural Collapse. Middle c… view at source ↗

**Figure 2.** Figure 2: Linear language from Experiment 2: an illustrative example 20-token vocab [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Transient semantic geometry in the linear template Zipfian language (Experiment 2). Sample-level n × n RSMs over the full training set, ordered by semantic and output label (n = 10k, K = 100). Left: Sem-RSM, where similarity is determined by the number of shared latent categories. Because the label distribution is imbalanced, block sizes are unequal (see inset). Middle: empirical RSMs (Emp-RSM) at epochs … view at source ↗

**Figure 4.** Figure 4: An illustrative example of our Exp 3 hierarchical language with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Transient semantic geometry in the hierarchical grammar (Experiment 3). Sample-level n × n RSMs over the full training set, ordered by output label (n = 2000, K = 16). Left: Sem-RSM, where similarity is 1 if two contexts share the same semantic category and 0 otherwise. Middle: empirical RSMs (Emp-RSM) at epochs 200 and 10k. Early in training, Emp-RSM exhibits clear category-level block structure matching … view at source ↗

**Figure 6.** Figure 6: Neural Collapse diagnostics across three experiments. NC1 measures within [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Recreation of [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-label semantic diagnostics for all three experiments. The left column [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Full representational similarity results for Experiment 2. This figure is the Experiment 2 analogue of [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Full representational similarity results for Experiment 3. This figure is the Experiment 3 analogue of [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Layerwise RSA dynamics for Experiment 1. For four representative models, we [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Layerwise RSA dynamics for Experiment 2. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Layerwise RSA dynamics for Experiment 3. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Numerical simulation of the Spherical BoW dynamics on the color-shape setup. Left column: hypothesis RSMs. Sem-RSM encodes the semantic similarity structure, while ETF-RSM encodes the label-driven collapsed geometry. Middle and Right columns: Simulations of the Spherical BoW objective under two settings: fixing W to an ETF readout, and training W jointly with H. In each setting, we show a representative e… view at source ↗

**Figure 15.** Figure 15: Spherical BoW on the full Color-Shape Categories language. This figure is the Spherical BoW analogue of [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

read the original abstract

Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in the inputs. This creates a puzzle: next-token prediction language models are trained predominantly (as context length increases) with one-hot labels: the same context is very unlikely to appear twice in training with different labels. However, they clearly learn latent structural features. That is, despite the one-hot training regime, a language model's contextual embeddings represent the fact that the next word in ''Mary broke the ___'' is likely to be filled by tokens in the latent classes of a) medium-sized, b) rigid, c) inanimate nouns. How does gradient descent find such categorical semantic structure when co-occurrence statistics collapse to one-hot sparsity, eliminating any shared next-tokens among different contexts? To investigate this tension we identify three synthetic controlled settings where inputs have latent semantic factors but are mapped to distinct one-hot labels. We find that semantic geometry emerges early in training, and that representations cluster by shared attributes despite receiving no explicit supervision to do so. This structure is transient: with sufficient capacity and time, the model eventually reaches the predicted symmetric state where all representations are equally separated. We study this phase transition through Gram matrix analysis and propose a preliminary modification to the commonly used unconstrained features model to capture the emergent semantic geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper observes transient semantic clustering by latent factors in synthetic next-token tasks before neural collapse sets in, but the strict one-hot separation in the experiments leaves open whether this explains real language model behavior.

read the letter

The main takeaway is that semantic geometry appears early in these controlled next-token settings and then fades as the model reaches the symmetric state predicted by neural collapse. The authors isolate the puzzle by creating three synthetic cases where inputs share latent attributes but map to completely distinct one-hot targets, then track the effect with Gram matrix analysis and a proposed change to the unconstrained features model.

This is new relative to the neural collapse work they cite. The controlled setups let them show clustering by shared attributes without any explicit supervision, and the transient nature is a clean observation.

The experiments are limited by design. Every label is a unique one-hot with no token overlap across contexts, which removes the partial co-occurrence patterns that exist in natural text. That choice makes the setup clean but also means the reported phase transition might depend on the absence of those overlaps rather than on next-token prediction itself. The abstract gives no quantitative metrics, error bars, or dataset sizes, so the strength of the effect is hard to judge from what's here.

The work is for people studying representation dynamics and neural collapse in language models. A reader already thinking about that tension would find the setup useful as a starting point.

It should go to peer review. The core idea is worth checking against fuller methods and against settings that include some token sharing.

Referee Report

2 major / 2 minor

Summary. The paper claims that despite the one-hot nature of next-token prediction (which should drive Neural Collapse to symmetric, label-dependent representations), semantic geometry emerges early in training: in three synthetic controlled settings with latent semantic factors mapped to distinct one-hot labels, representations cluster by shared attributes with no explicit supervision. This structure is transient; with sufficient capacity and time the model reaches the predicted symmetric state. The phase transition is analyzed via Gram matrices, and a preliminary modification to the unconstrained-features model is proposed to capture the emergent geometry.

Significance. If the central empirical observation holds, the work addresses a genuine tension between Neural Collapse predictions and the semantic structure learned by language models. The controlled synthetic experiments and Gram-matrix analysis isolate the effect of latent factors under one-hot supervision and identify a transient phase before collapse; this is a concrete, falsifiable contribution to the NC literature. Credit is due for the reproducible synthetic protocol and the attempt to extend the unconstrained-features model.

major comments (2)

[Abstract] Abstract and experimental sections: the central claim that the observed early clustering explains semantic structure in next-token prediction on natural data rests on the assumption that the three synthetic settings (distinct one-hot targets for inputs sharing latent factors) capture the relevant dynamics. This assumption is load-bearing because the Gram-matrix analysis and the proposed unconstrained-features modification are derived under strict one-hot separation; any residual next-token overlap present in natural text could alter the predicted phase transition.
[Results] Experimental results: the abstract and results report only qualitative observations of clustering and collapse. No quantitative metrics (e.g., within-cluster vs. between-cluster distances, silhouette scores, or Gram-matrix eigenvalue trajectories with error bars across random seeds) are supplied, making it impossible to assess the statistical reliability or timing of the reported transient geometry.

minor comments (2)

The description of the proposed modification to the unconstrained-features model would benefit from an explicit equation or pseudocode showing how the semantic-factor term is added to the loss.
Dataset generation details (exact latent-factor cardinalities, context lengths, and vocabulary sizes for the three synthetic settings) should be stated explicitly to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential contribution and for the constructive major comments. We respond to each point below.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the central claim that the observed early clustering explains semantic structure in next-token prediction on natural data rests on the assumption that the three synthetic settings (distinct one-hot targets for inputs sharing latent factors) capture the relevant dynamics. This assumption is load-bearing because the Gram-matrix analysis and the proposed unconstrained-features modification are derived under strict one-hot separation; any residual next-token overlap present in natural text could alter the predicted phase transition.

Authors: The synthetic protocols are constructed precisely to enforce the one-hot separation that defines the Neural Collapse regime while still embedding latent semantic factors, thereby isolating the mechanism that could resolve the stated tension. We do not claim these settings are a complete proxy for natural text; rather, they demonstrate that transient semantic geometry is possible under the strict one-hot condition that dominates next-token prediction. Any effect of residual overlaps in natural data would be a natural extension, but lies outside the scope of the controlled study. We will revise the abstract and discussion sections to state this scope more explicitly. revision: partial
Referee: [Results] Experimental results: the abstract and results report only qualitative observations of clustering and collapse. No quantitative metrics (e.g., within-cluster vs. between-cluster distances, silhouette scores, or Gram-matrix eigenvalue trajectories with error bars across random seeds) are supplied, making it impossible to assess the statistical reliability or timing of the reported transient geometry.

Authors: The referee correctly notes the absence of quantitative metrics. In the revised manuscript we will add silhouette scores, within- versus between-cluster distance ratios, and Gram-matrix eigenvalue trajectories plotted with error bars across multiple random seeds to provide statistical support for the timing and reliability of the observed phase transition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical synthetic experiments and analysis.

full rationale

The paper investigates the tension between Neural Collapse and semantic structure in next-token prediction via three synthetic controlled settings where inputs have latent factors but map to distinct one-hot labels. It reports empirical observations of early clustering by attributes, followed by eventual symmetric collapse, analyzed via Gram matrices and a proposed modification to the unconstrained features model. No load-bearing step reduces by construction to its inputs, self-citation chains, or fitted parameters renamed as predictions; the central results are direct experimental findings under the stated one-hot regime rather than derived equivalences.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; claims rely on empirical observations in synthetic settings without detailed modeling assumptions stated.

pith-pipeline@v0.9.1-grok · 5782 in / 1048 out tokens · 31562 ms · 2026-06-26T05:11:52.163346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 9 canonical work pages

[1]

Information and Inference: A Journal of the IMA , volume=

Robust implicit regularization via weight normalization , author=. Information and Inference: A Journal of the IMA , volume=. 2024 , publisher=

2024
[2]

The Thirty Sixth Annual Conference on Learning Theory , pages=

The implicit bias of batch normalization in linear models and two-layer linear convolutional neural networks , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

2023
[3]

arXiv preprint arXiv:2410.04887 , year=

Wide neural networks trained with weight decay provably exhibit neural collapse , author=. arXiv preprint arXiv:2410.04887 , year=

arXiv
[4]

arXiv preprint arXiv:2505.15239 , year=

Neural collapse is globally optimal in deep regularized resnets and transformers , author=. arXiv preprint arXiv:2505.15239 , year=

arXiv
[5]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=
[6]

Advances in Neural Information Processing Systems , volume=

Norm matters: efficient and accurate normalization schemes in deep networks , author=. Advances in Neural Information Processing Systems , volume=
[7]

International Conference on Machine Learning , pages=

Are neurons actually collapsed? on the fine-grained structure in neural representations , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[8]

Advances in Neural Information Processing Systems , volume=

Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=
[9]

arXiv preprint arXiv:2505.00661 , year=

On the generalization of language models from in-context learning and finetuning: a controlled study , author=. arXiv preprint arXiv:2505.00661 , year=

arXiv
[10]

Journal of Artificial Intelligence Research , volume=

Compositionality decomposed: How do neural networks generalise? , author=. Journal of Artificial Intelligence Research , volume=
[11]

Transactions of the Association for Computational Linguistics , volume=

Does syntax need to grow on trees? sources of hierarchical inductive bias in sequence-to-sequence networks , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

2020
[12]

arXiv preprint arXiv:2109.12036 , year=

Transformers generalize linearly , author=. arXiv preprint arXiv:2109.12036 , year=

arXiv
[13]

Transactions of the Association for Computational Linguistics , volume=

Learning syntax without planting trees: Understanding hierarchical generalization in transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025
[14]

arXiv preprint arXiv:2503.21676 , year=

How do language models learn facts? dynamics, curricula and hallucinations , author=. arXiv preprint arXiv:2503.21676 , year=

arXiv
[15]

arXiv preprint arXiv:2505.18091 , year=

Data mixing can induce phase transitions in knowledge acquisition , author=. arXiv preprint arXiv:2505.18091 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2601.22510 , year=

Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic , author=. arXiv preprint arXiv:2601.22510 , year=

arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Towards a theory of how the structure of language is acquired by deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[18]

arXiv preprint arXiv:2410.11474 , year=

How transformers get rich: Approximation and dynamics analysis , author=. arXiv preprint arXiv:2410.11474 , year=

arXiv
[19]

arXiv preprint arXiv:2412.04619 , year=

Sometimes i am a tree: Data drives unstable hierarchical generalization , author=. arXiv preprint arXiv:2412.04619 , year=

arXiv
[20]

Psychonomic bulletin & review , volume=

Zipf’s word frequency law in natural language: A critical review and future directions , author=. Psychonomic bulletin & review , volume=. 2014 , publisher=

2014
[21]

, author=

N-gram Counts and Language Models from the Common Crawl. , author=
[22]

Computer Speech & Language , volume=

A bit of progress in language modeling , author=. Computer Speech & Language , volume=. 2001 , publisher=

2001
[23]

The Thirteenth International Conference on Learning Representations,

Ximing Lu and Melanie Sclar and Skyler Hallinan and Niloofar Mireshghallah and Jiacheng Liu and Seungju Han and Allyson Ettinger and Liwei Jiang and Khyathi Raghavi Chandu and Nouha Dziri and Yejin Choi , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[24]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month =

Papadimitriou, Isabel and Jurafsky, Dan , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month =. 2020 , address =. doi:10.18653/v1/2020.emnlp-main.554 , pages =

work page doi:10.18653/v1/2020.emnlp-main.554 2020
[25]

arXiv preprint arXiv:2603.10055 , year=

Training Language Models via Neural Cellular Automata , author=. arXiv preprint arXiv:2603.10055 , year=

arXiv
[26]

2023 , volume=

Physics of Language Models: Part 1, Learning Hierarchical Language Structures , author=. 2023 , volume=

2023
[27]

The Eleventh International Conference on Learning Representations , year=

Characterizing intrinsic compositionality in transformers with Tree Projections , author=. The Eleventh International Conference on Learning Representations , year=
[28]

arXiv preprint arXiv:2510.02524 , year=

Unraveling Syntax: How Language Models Learn Context-Free Grammars , author=. arXiv preprint arXiv:2510.02524 , year=

Pith/arXiv arXiv
[29]

Implicit representations of meaning in neural language models , author=. Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (Volume 1: Long papers) , pages=
[30]

The Eleventh International Conference on Learning Representations , year =

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author =. The Eleventh International Conference on Learning Representations , year =
[31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Between circuits and chomsky: Pre-pretraining on formal languages imparts linguistic biases , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[32]

2023 , journal =

Injecting structural hints: Using language models to study inductive biases in language learning , author =. 2023 , journal =

2023
[33]

and Marzen, Sarah E

Shai, Adam S. and Marzen, Sarah E. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Riechers, Paul M. , booktitle =. Transformers Represent Belief State Geometry in their Residual Stream , url =. 2024 , pdf =. doi:10.52202/079017-2387 , editor =

work page doi:10.52202/079017-2387 2024
[34]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

2024
[35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =

Mission: Impossible Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2024 , address =. doi:10.18653/v1/2024.acl-long.787 , pages =

work page doi:10.18653/v1/2024.acl-long.787 2024
[36]

Transactions of the Association for Computational Linguistics , volume=

Filtered corpus training (FiCT) shows that language models can generalize from indirect evidence , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024
[37]

Semantics and linguistic theory , pages=

More structural analogies between pronouns and tenses , author=. Semantics and linguistic theory , pages=
[38]

1985 , publisher=

Logical form: Its structure and derivation , author=. 1985 , publisher=

1985
[39]

, author=

Polarity sensitivity as inherent scope relations. , author=. 1979 , publisher=

1979
[40]

Chomsky, Noam , title =
[41]

1976 , school=

The syntactic domain of anaphora , author=. 1976 , school=

1976
[42]

PLoS computational biology , volume=

A toolbox for representational similarity analysis , author=. PLoS computational biology , volume=. 2014 , publisher=

2014
[43]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008
[44]

arXiv preprint arXiv:2505.08348 , year=

Geometry of Semantics in Next-token Prediction , author=. arXiv preprint arXiv:2505.08348 , year=

arXiv
[45]

arXiv preprint arXiv:2110.02796 , year=

An unconstrained layer-peeled perspective on neural collapse , author=. arXiv preprint arXiv:2110.02796 , year=

arXiv
[46]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of SignGD and Adam on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

arXiv
[47]

Linear algebra and its applications , volume=

Concavity of certain maps on positive definite matrices and applications to Hadamard products , author=. Linear algebra and its applications , volume=. 1979 , publisher=

1979
[48]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

2013
[49]

Advances in neural information processing systems , volume=

Neural word embedding as implicit matrix factorization , author=. Advances in neural information processing systems , volume=
[50]

arXiv preprint arXiv:1811.02564 , year=

On exponential convergence of sgd in non-convex over-parametrized learning , author=. arXiv preprint arXiv:1811.02564 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2402.05738 , year=

Implicit bias and fast convergence rates for self-attention , author=. arXiv preprint arXiv:2402.05738 , year=

arXiv
[52]

First Conference on Language Modeling , year=

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations , author=. First Conference on Language Modeling , year=
[53]

arXiv preprint arXiv:2406.10650 , year=

The Implicit Bias of Adam on Separable Data , author=. arXiv preprint arXiv:2406.10650 , year=

arXiv
[54]

arXiv preprint arXiv:2405.13718 , year=

Upper and lower memory capacity bounds of transformers for next-token prediction , author=. arXiv preprint arXiv:2405.13718 , year=

arXiv
[55]

arXiv e-prints , pages=

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains , author=. arXiv e-prints , pages=
[56]

arXiv preprint arXiv:2404.04454 , year=

Implicit Bias of AdamW: ell\_ infty Norm Constrained Optimization , author=. arXiv preprint arXiv:2404.04454 , year=

arXiv
[57]

The 22nd international conference on artificial intelligence and statistics , pages=

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019
[58]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Glove: Global vectors for word representation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

2014
[59]

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=
[60]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[61]

2023 , url=

Speech and Language Processing , author=. 2023 , url=

2023
[62]

Gemini: A Family of Highly Capable Multimodal Models , author=. , year=
[63]

arXiv preprint arXiv:2110.06914 , year=

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework , author=. arXiv preprint arXiv:2110.06914 , year=

arXiv
[64]

Advances in Neural Information Processing Systems , volume=

Label noise sgd provably prefers flat global minimizers , author=. Advances in Neural Information Processing Systems , volume=
[65]

arXiv preprint arXiv:2402.04161 , year=

Attention with markov: A framework for principled analysis of transformers via markov chains , author=. arXiv preprint arXiv:2402.04161 , year=

arXiv
[66]

arXiv preprint arXiv:2405.04517 , year=

xLSTM: Extended Long Short-Term Memory , author=. arXiv preprint arXiv:2405.04517 , year=

arXiv
[67]

International Conference on Artificial Intelligence and Statistics , pages=

Mechanics of next token prediction with self-attention , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024
[68]

arXiv preprint arXiv:2312.10794 , year=

A mathematical perspective on Transformers , author=. arXiv preprint arXiv:2312.10794 , year=

arXiv
[69]

arXiv preprint arXiv:2212.14052 , year=

Hungry hungry hippos: Towards language modeling with state space models , author=. arXiv preprint arXiv:2212.14052 , year=

arXiv
[70]

International Conference on Machine Learning , pages=

Same pre-training loss, better downstream: Implicit bias matters for language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[71]

arXiv preprint arXiv:2312.00752 , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

Pith/arXiv arXiv
[72]

Bell system technical journal , volume=

Prediction and entropy of printed English , author=. Bell system technical journal , volume=. 1951 , publisher=

1951
[73]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

2023
[74]

arXiv , pages=

GPT-4 technical report , author=. arXiv , pages=
[75]

2023 , url =

Deep Ganguli and Nicholas Schiefer and Marina Favaro and Jack Clark , title =. 2023 , url =

2023
[76]

arXiv preprint arXiv:2303.15056 , year=

Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

arXiv
[77]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[78]

International Conference on Machine Learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[79]

Can Generative Artificial Intelligence Improve Social Science? , author=
[80]

arXiv preprint arXiv:2309.08108 , year=

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting , author=. arXiv preprint arXiv:2309.08108 , year=

arXiv

Showing first 80 references.

[1] [1]

Information and Inference: A Journal of the IMA , volume=

Robust implicit regularization via weight normalization , author=. Information and Inference: A Journal of the IMA , volume=. 2024 , publisher=

2024

[2] [2]

The Thirty Sixth Annual Conference on Learning Theory , pages=

The implicit bias of batch normalization in linear models and two-layer linear convolutional neural networks , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

2023

[3] [3]

arXiv preprint arXiv:2410.04887 , year=

Wide neural networks trained with weight decay provably exhibit neural collapse , author=. arXiv preprint arXiv:2410.04887 , year=

arXiv

[4] [4]

arXiv preprint arXiv:2505.15239 , year=

Neural collapse is globally optimal in deep regularized resnets and transformers , author=. arXiv preprint arXiv:2505.15239 , year=

arXiv

[5] [5]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=

[6] [6]

Advances in Neural Information Processing Systems , volume=

Norm matters: efficient and accurate normalization schemes in deep networks , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

International Conference on Machine Learning , pages=

Are neurons actually collapsed? on the fine-grained structure in neural representations , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[8] [8]

Advances in Neural Information Processing Systems , volume=

Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=

[9] [9]

arXiv preprint arXiv:2505.00661 , year=

On the generalization of language models from in-context learning and finetuning: a controlled study , author=. arXiv preprint arXiv:2505.00661 , year=

arXiv

[10] [10]

Journal of Artificial Intelligence Research , volume=

Compositionality decomposed: How do neural networks generalise? , author=. Journal of Artificial Intelligence Research , volume=

[11] [11]

Transactions of the Association for Computational Linguistics , volume=

Does syntax need to grow on trees? sources of hierarchical inductive bias in sequence-to-sequence networks , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

2020

[12] [12]

arXiv preprint arXiv:2109.12036 , year=

Transformers generalize linearly , author=. arXiv preprint arXiv:2109.12036 , year=

arXiv

[13] [13]

Transactions of the Association for Computational Linguistics , volume=

Learning syntax without planting trees: Understanding hierarchical generalization in transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025

[14] [14]

arXiv preprint arXiv:2503.21676 , year=

How do language models learn facts? dynamics, curricula and hallucinations , author=. arXiv preprint arXiv:2503.21676 , year=

arXiv

[15] [15]

arXiv preprint arXiv:2505.18091 , year=

Data mixing can induce phase transitions in knowledge acquisition , author=. arXiv preprint arXiv:2505.18091 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2601.22510 , year=

Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic , author=. arXiv preprint arXiv:2601.22510 , year=

arXiv

[17] [17]

Advances in Neural Information Processing Systems , volume=

Towards a theory of how the structure of language is acquired by deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

arXiv preprint arXiv:2410.11474 , year=

How transformers get rich: Approximation and dynamics analysis , author=. arXiv preprint arXiv:2410.11474 , year=

arXiv

[19] [19]

arXiv preprint arXiv:2412.04619 , year=

Sometimes i am a tree: Data drives unstable hierarchical generalization , author=. arXiv preprint arXiv:2412.04619 , year=

arXiv

[20] [20]

Psychonomic bulletin & review , volume=

Zipf’s word frequency law in natural language: A critical review and future directions , author=. Psychonomic bulletin & review , volume=. 2014 , publisher=

2014

[21] [21]

, author=

N-gram Counts and Language Models from the Common Crawl. , author=

[22] [22]

Computer Speech & Language , volume=

A bit of progress in language modeling , author=. Computer Speech & Language , volume=. 2001 , publisher=

2001

[23] [23]

The Thirteenth International Conference on Learning Representations,

Ximing Lu and Melanie Sclar and Skyler Hallinan and Niloofar Mireshghallah and Jiacheng Liu and Seungju Han and Allyson Ettinger and Liwei Jiang and Khyathi Raghavi Chandu and Nouha Dziri and Yejin Choi , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[24] [24]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month =

Papadimitriou, Isabel and Jurafsky, Dan , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month =. 2020 , address =. doi:10.18653/v1/2020.emnlp-main.554 , pages =

work page doi:10.18653/v1/2020.emnlp-main.554 2020

[25] [25]

arXiv preprint arXiv:2603.10055 , year=

Training Language Models via Neural Cellular Automata , author=. arXiv preprint arXiv:2603.10055 , year=

arXiv

[26] [26]

2023 , volume=

Physics of Language Models: Part 1, Learning Hierarchical Language Structures , author=. 2023 , volume=

2023

[27] [27]

The Eleventh International Conference on Learning Representations , year=

Characterizing intrinsic compositionality in transformers with Tree Projections , author=. The Eleventh International Conference on Learning Representations , year=

[28] [28]

arXiv preprint arXiv:2510.02524 , year=

Unraveling Syntax: How Language Models Learn Context-Free Grammars , author=. arXiv preprint arXiv:2510.02524 , year=

Pith/arXiv arXiv

[29] [29]

Implicit representations of meaning in neural language models , author=. Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (Volume 1: Long papers) , pages=

[30] [30]

The Eleventh International Conference on Learning Representations , year =

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author =. The Eleventh International Conference on Learning Representations , year =

[31] [31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Between circuits and chomsky: Pre-pretraining on formal languages imparts linguistic biases , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[32] [32]

2023 , journal =

Injecting structural hints: Using language models to study inductive biases in language learning , author =. 2023 , journal =

2023

[33] [33]

and Marzen, Sarah E

Shai, Adam S. and Marzen, Sarah E. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Riechers, Paul M. , booktitle =. Transformers Represent Belief State Geometry in their Residual Stream , url =. 2024 , pdf =. doi:10.52202/079017-2387 , editor =

work page doi:10.52202/079017-2387 2024

[34] [34]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

2024

[35] [35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =

Mission: Impossible Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2024 , address =. doi:10.18653/v1/2024.acl-long.787 , pages =

work page doi:10.18653/v1/2024.acl-long.787 2024

[36] [36]

Transactions of the Association for Computational Linguistics , volume=

Filtered corpus training (FiCT) shows that language models can generalize from indirect evidence , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024

[37] [37]

Semantics and linguistic theory , pages=

More structural analogies between pronouns and tenses , author=. Semantics and linguistic theory , pages=

[38] [38]

1985 , publisher=

Logical form: Its structure and derivation , author=. 1985 , publisher=

1985

[39] [39]

, author=

Polarity sensitivity as inherent scope relations. , author=. 1979 , publisher=

1979

[40] [40]

Chomsky, Noam , title =

[41] [41]

1976 , school=

The syntactic domain of anaphora , author=. 1976 , school=

1976

[42] [42]

PLoS computational biology , volume=

A toolbox for representational similarity analysis , author=. PLoS computational biology , volume=. 2014 , publisher=

2014

[43] [43]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008

[44] [44]

arXiv preprint arXiv:2505.08348 , year=

Geometry of Semantics in Next-token Prediction , author=. arXiv preprint arXiv:2505.08348 , year=

arXiv

[45] [45]

arXiv preprint arXiv:2110.02796 , year=

An unconstrained layer-peeled perspective on neural collapse , author=. arXiv preprint arXiv:2110.02796 , year=

arXiv

[46] [46]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of SignGD and Adam on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

arXiv

[47] [47]

Linear algebra and its applications , volume=

Concavity of certain maps on positive definite matrices and applications to Hadamard products , author=. Linear algebra and its applications , volume=. 1979 , publisher=

1979

[48] [48]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

2013

[49] [49]

Advances in neural information processing systems , volume=

Neural word embedding as implicit matrix factorization , author=. Advances in neural information processing systems , volume=

[50] [50]

arXiv preprint arXiv:1811.02564 , year=

On exponential convergence of sgd in non-convex over-parametrized learning , author=. arXiv preprint arXiv:1811.02564 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2402.05738 , year=

Implicit bias and fast convergence rates for self-attention , author=. arXiv preprint arXiv:2402.05738 , year=

arXiv

[52] [52]

First Conference on Language Modeling , year=

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations , author=. First Conference on Language Modeling , year=

[53] [53]

arXiv preprint arXiv:2406.10650 , year=

The Implicit Bias of Adam on Separable Data , author=. arXiv preprint arXiv:2406.10650 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2405.13718 , year=

Upper and lower memory capacity bounds of transformers for next-token prediction , author=. arXiv preprint arXiv:2405.13718 , year=

arXiv

[55] [55]

arXiv e-prints , pages=

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains , author=. arXiv e-prints , pages=

[56] [56]

arXiv preprint arXiv:2404.04454 , year=

Implicit Bias of AdamW: ell\_ infty Norm Constrained Optimization , author=. arXiv preprint arXiv:2404.04454 , year=

arXiv

[57] [57]

The 22nd international conference on artificial intelligence and statistics , pages=

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019

[58] [58]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Glove: Global vectors for word representation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

2014

[59] [59]

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=

[60] [60]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021

[61] [61]

2023 , url=

Speech and Language Processing , author=. 2023 , url=

2023

[62] [62]

Gemini: A Family of Highly Capable Multimodal Models , author=. , year=

[63] [63]

arXiv preprint arXiv:2110.06914 , year=

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework , author=. arXiv preprint arXiv:2110.06914 , year=

arXiv

[64] [64]

Advances in Neural Information Processing Systems , volume=

Label noise sgd provably prefers flat global minimizers , author=. Advances in Neural Information Processing Systems , volume=

[65] [65]

arXiv preprint arXiv:2402.04161 , year=

Attention with markov: A framework for principled analysis of transformers via markov chains , author=. arXiv preprint arXiv:2402.04161 , year=

arXiv

[66] [66]

arXiv preprint arXiv:2405.04517 , year=

xLSTM: Extended Long Short-Term Memory , author=. arXiv preprint arXiv:2405.04517 , year=

arXiv

[67] [67]

International Conference on Artificial Intelligence and Statistics , pages=

Mechanics of next token prediction with self-attention , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024

[68] [68]

arXiv preprint arXiv:2312.10794 , year=

A mathematical perspective on Transformers , author=. arXiv preprint arXiv:2312.10794 , year=

arXiv

[69] [69]

arXiv preprint arXiv:2212.14052 , year=

Hungry hungry hippos: Towards language modeling with state space models , author=. arXiv preprint arXiv:2212.14052 , year=

arXiv

[70] [70]

International Conference on Machine Learning , pages=

Same pre-training loss, better downstream: Implicit bias matters for language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[71] [71]

arXiv preprint arXiv:2312.00752 , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

Pith/arXiv arXiv

[72] [72]

Bell system technical journal , volume=

Prediction and entropy of printed English , author=. Bell system technical journal , volume=. 1951 , publisher=

1951

[73] [73]

Joule , volume=

The growing energy footprint of artificial intelligence , author=. Joule , volume=. 2023 , publisher=

2023

[74] [74]

arXiv , pages=

GPT-4 technical report , author=. arXiv , pages=

[75] [75]

2023 , url =

Deep Ganguli and Nicholas Schiefer and Marina Favaro and Jack Clark , title =. 2023 , url =

2023

[76] [76]

arXiv preprint arXiv:2303.15056 , year=

Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

arXiv

[77] [77]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[78] [78]

International Conference on Machine Learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[79] [79]

Can Generative Artificial Intelligence Improve Social Science? , author=

[80] [80]

arXiv preprint arXiv:2309.08108 , year=

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting , author=. arXiv preprint arXiv:2309.08108 , year=

arXiv