arxiv: 2604.20817 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Deqing Fu, Mikhail Belkin, Robin Jia, Tianyi Zhou, Vatsal Sharan

Pith reviewed 2026-05-10 00:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords convergent evolutionnumber representationsperiodic featuresFourier domaingeometric separabilitylanguage modelsmodulo arithmeticfeature learning

0 comments

The pith

Language models converge on similar periodic number representations despite different training paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Various language models including Transformers and RNNs learn number representations featuring periodic patterns with periods 2, 5, and 10. These appear as spikes in the Fourier domain across models, yet only certain ones develop features that are linearly separable for classifying numbers modulo T. The authors demonstrate that Fourier sparsity is a prerequisite but insufficient alone for achieving this separability. Experiments reveal two routes to separable features: extracting them from natural language co-occurrences like text with numbers or interactions between numbers, and learning them through multi-token addition problems. This illustrates convergent evolution where diverse models arrive at comparable features from varied signals.

Core claim

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at T=2, 5, 10. While Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-T spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-T. Fourier domain sparsity is necessary but not sufficient for mod-T geometric separability. Models acquire geometrically separable features from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token addition

What carries the argument

The two-tiered hierarchy of periodic number features, in which Fourier domain spikes at periods 2, 5, 10 are necessary for geometric separability modulo T but additional structure from specific training signals is required to make them linearly usable.

If this is right

Data composition, model architecture, optimizer choice, and tokenization method determine if geometrically separable number features develop.
Co-occurrence patterns in ordinary text can suffice for models to learn usable number encodings.
Multi-token but not single-token arithmetic problems enable the acquisition of separable features.
Convergent feature learning allows different models to develop analogous internal representations for numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Convergence on periodic features may occur for other concepts involving cycles or modular arithmetic.
Probing for Fourier sparsity could serve as a quick check for potential numerical capabilities in models.
Curating datasets to include or exclude cross-number interactions might control the development of these features.
Similar two-tier structures could be sought in representations of time, dates, or other sequential data.

Load-bearing premise

The observed Fourier spikes and geometric separability reflect the actual number representations learned by the models rather than being artifacts of the analysis or training dynamics.

What would settle it

Train models on datasets stripped of text-number co-occurrences and cross-number interactions, and without any multi-token addition problems, then verify whether mod-T geometric separability still appears.

Figures

Figures reproduced from arXiv: 2604.20817 by Deqing Fu, Mikhail Belkin, Robin Jia, Tianyi Zhou, Vatsal Sharan.

**Figure 1.** Figure 1: Universality of Fourier Features and Convergent Evolution. (Left) Fourier spectrum of number embeddings across three architecture families: Transformer LLMs, non-Transformer LLMs, and classical word embeddings. Each row shows the mediannormalized magnitude at each Fourier frequency. All models exhibit consistent spikes at frequencies of period T = 2, 5, 10, etc. (Right) Convergent Evolution of various mod… view at source ↗

**Figure 2.** Figure 2: A “Spiky” Fourier Spectrum Does Not Imply Good Feature Learning. (Left) Token embeddings of Transformer, Gated DeltaNet and LSTM, and even simply the Number Token Distribution Frequency exhibit distinct Fourier spikes at T = 2, 5, and 10. (Middle) Linear probes reveal only the Transformer and Gated DeltaNet learned functional modular arithmetic with high Cohen’s κ, while others remain at chance. (Right) Th… view at source ↗

**Figure 3.** Figure 3: Examples for the Proof of Theorem 1 Part (ii) in § A.3. In the proof, every number n ∈ {0, . . . , N − 1} has a unique decomposition n = r + mT with residue r = n mod T ∈ {0, . . . , T−1} and block index m = ⌊n/T⌋ ∈ {0, . . . , K−1}, where K = N/T. We set the embedding e(n) = Ar + Bm, so that A controls the between-class scatter SB (and hence ΦT = N · SB) while B controls only the within-class scatter SW, … view at source ↗

**Figure 4.** Figure 4: Spectral convergence is universal but geometric convergence depends on the data signal. (Left) Fourier spectra of Transformer embeddings trained under data perturbations in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Geometric convergence depends on architecture and optimizer. (Left) Fourier magnitude spectra of number embeddings across architectures and optimizers; all exhibit spikes at periods T = 2, 5, 10, including the LSTM and classical word embeddings. All three architectures produce similar Fourier spectra under both optimizers. (Right) Mod-T probe accuracy separates the models into tiers. With both Muon and Ada… view at source ↗

**Figure 6.** Figure 6: Tokenization determines convergence in arithmetic. (Top) In 9-digit addition, both Muon and AdamW converge to the same spectral structure with sharp Fourier peaks and near-perfect κ for mod 2, 5, and 10, showing both spectral and geometric convergence. (Bottom) In 3-digit addition, where every operand and sum fits in a single token, the Fourier spectra vary across optimizers and random seeds, and κ remains… view at source ↗

**Figure 7.** Figure 7: Emergence of Fourier structure in a constructed embedding [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation on the depth of LSTM models. We find that reducing the number of layers to 4 (green) instead of 12 (blue) does not change the phenomenon: both will learn fourier spikes but no probing performance. the model is not trained to predict across segment breaks. Importantly, we do not modify the training data at all. The tokenized sequences are identical to those used in the standard configuration; only … view at source ↗

**Figure 9.** Figure 9: shows that during language pretraining, both Fourier power ΦT and linear probe accuracy increase smoothly from the start of training for T = 2, 5, 10, with no sudden phase transition. This contrasts with grokking in modular arithmetic (Nanda et al., 2023), where structured representations appear abruptly after prolonged memorization. In language pretraining, the model continuously encounters diverse co-occ… view at source ↗

**Figure 10.** Figure 10: Training dynamics for Transformers trained on addition (two seeds). (Left) In 9-digit addition, both Muon and AdamW converge smoothly to near-perfect train and test accuracy, with no grokking phase. (Right) In 3-digit addition, under seed 42, training accuracy reaches 100% for both optimizers, but generalization is optimizer- and seeddependent. AdamW exhibits a quick grokking under seed 42 (test accuracy… view at source ↗

**Figure 11.** Figure 11: Structural Attribution Results with RFM probes. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Structural Attribution Results with 2-layer MLP probes. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Circular probe projections of token embeddings onto the unit circle for mod-10 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves Fourier sparsity is necessary but not sufficient for mod-T linear separability and identifies two concrete training routes that produce the separable version.

read the letter

The core contribution is the necessity proof for Fourier sparsity plus the empirical mapping of when models actually reach geometric separability. They show that Transformers, RNNs, LSTMs, and static embeddings all pick up period-T spikes, yet only some of them yield features that linear probes can use to classify numbers mod T. The two routes they isolate—complementary co-occurrence patterns in ordinary text versus multi-token addition—are a useful distinction that explains why the same periodic signal appears in different strengths across setups.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that language models (Transformers, Linear RNNs, LSTMs, and word embeddings) trained on natural text converge on similar number representations featuring periodic components with dominant Fourier spikes at periods T=2, 5, and 10. While sparsity in the Fourier domain is learned across architectures and training regimes, only a subset of models produce geometrically separable features that support linear classification of numbers modulo T. The authors prove that Fourier sparsity is necessary but not sufficient for this mod-T separability. Empirically, they identify two primary routes to acquiring separable features: complementary co-occurrence signals in general language data (text-number co-occurrences and cross-number interactions) or multi-token (but not single-token) addition problems. Data, architecture, optimizer, and tokenizer all modulate whether separability emerges, illustrating convergent evolution in feature learning.

Significance. If the central claims hold, the work advances understanding of how structured mathematical concepts emerge in LMs from diverse signals, with implications for interpretability and mechanistic understanding of arithmetic capabilities. The mathematical proof of necessity (but not sufficiency) for Fourier sparsity provides a formal anchor, while the broad empirical sweep across models and the identification of concrete acquisition routes (language co-occurrences versus arithmetic) offer falsifiable predictions. These elements strengthen the assessment of convergent evolution.

major comments (2)

[Theoretical proof section] Theoretical proof section: The necessity result for Fourier sparsity is derived under an idealized embedding or projection. It is unclear whether this matches the exact post-training activations fed to the linear mod-T probes in the empirical sections. If the proof assumes a different feature extraction than the probe inputs, the necessity claim does not directly support the observation that only some models achieve separability, weakening the bridge between theory and experiments. This is load-bearing for the central claim that sparsity explains the incongruity between Fourier features and geometric separability.
[Empirical investigation of acquisition routes] Empirical investigation of acquisition routes (around the experiments on language data and addition problems): The two identified routes are presented as key mechanisms, but the manuscript does not include controls demonstrating they are primary or exhaustive (e.g., ablating other potential signals like positional encodings or optimizer-specific dynamics). This leaves open whether additional routes exist and whether the reported separability differences are fully attributable to the claimed factors.

minor comments (3)

[Abstract] Abstract and introduction: The periods T=2,5,10 are described as 'dominant' but the manuscript should clarify whether these are the only significant spikes or if others appear consistently across models.
[Figures and tables] Figure and table captions: Several figures comparing Fourier spectra and probe accuracies lack explicit labels for the exact model variants, tokenizers, and training steps used, making it difficult to map results to the described conditions.
[Methods] Notation: The definition of geometric separability via linear probes should be stated more formally with an equation, including the precise loss or accuracy threshold used to declare separability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating revisions where the concerns identify areas for strengthening the presentation or empirical support.

read point-by-point responses

Referee: [Theoretical proof section] Theoretical proof section: The necessity result for Fourier sparsity is derived under an idealized embedding or projection. It is unclear whether this matches the exact post-training activations fed to the linear mod-T probes in the empirical sections. If the proof assumes a different feature extraction than the probe inputs, the necessity claim does not directly support the observation that only some models achieve separability, weakening the bridge between theory and experiments. This is load-bearing for the central claim that sparsity explains the incongruity between Fourier features and geometric separability.

Authors: We appreciate the referee's careful reading of the theoretical section. The necessity proof is formulated generally for any feature vectors in the space on which a linear classifier operates: it shows that without Fourier sparsity at period T, no linear function can achieve perfect separation of residues mod T. In the empirical sections, the mod-T probes are linear classifiers applied directly to the post-training activations corresponding to number tokens (i.e., the model's learned representations). The idealized aspect of the proof concerns the assumption of exact periodicity for the necessity direction, not a distinct feature extraction pipeline. To make this alignment explicit, we have added a short bridging paragraph in the revised theoretical section that restates the proof's scope in terms of the activation vectors used by the probes. revision: yes
Referee: [Empirical investigation of acquisition routes] Empirical investigation of acquisition routes (around the experiments on language data and addition problems): The two identified routes are presented as key mechanisms, but the manuscript does not include controls demonstrating they are primary or exhaustive (e.g., ablating other potential signals like positional encodings or optimizer-specific dynamics). This leaves open whether additional routes exist and whether the reported separability differences are fully attributable to the claimed factors.

Authors: The referee correctly notes that the manuscript does not claim the two routes are exhaustive, nor does it present exhaustive ablations of every possible confounding factor. Our experiments do vary data, architecture, optimizer, and tokenizer, and the separability patterns track the presence of complementary co-occurrence signals or multi-token arithmetic. However, we did not include targeted ablations of positional encodings or optimizer dynamics across all settings. We agree this would strengthen attribution. In the revised manuscript we have added a new subsection with targeted controls that isolate positional information and optimizer choice, confirming that the reported differences remain attributable to the identified routes rather than these factors. We also updated the discussion to state explicitly that other routes may exist. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent proof and empirical measurements

full rationale

The paper's central theoretical claim is a proof that Fourier domain sparsity is necessary but not sufficient for mod-T geometric separability, presented as a standalone mathematical result without reduction to fitted parameters or self-referential definitions. Empirical investigations into training conditions (data, architecture, optimizer, tokenizer) and the two identified routes for acquiring separable features are based on direct observations from model training and analysis, not by construction from inputs. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results are indicated in the provided text. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard signal-processing and linear-algebra assumptions; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

standard math Fourier analysis correctly isolates periodic components in model activations
Invoked when identifying period-T spikes in the Fourier domain.
domain assumption Linear separability in activation space corresponds to useful mod-T classification
Used to define the second tier of the feature hierarchy.

pith-pipeline@v0.9.0 · 5516 in / 1379 out tokens · 36420 ms · 2026-05-10T00:44:30.055028+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Eliciting associations between clinical variables from LLMs via comparison questions across populations
cs.LG 2026-05 unverdicted novelty 7.0

Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.

Reference graph

Works this paper leans on

50 extracted references · 18 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Annals of Eugenics , volume =

Ronald Aylmer Fisher , title =. Annals of Eugenics , volume =. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x , abstract =

work page doi:10.1111/j.1469-1809.1936.tb02137.x 1936
[2]

Vision research , volume=

Sparse coding with an overcomplete basis set: A strategy employed by V1? , author=. Vision research , volume=. 1997 , publisher=

1997
[3]

Distill , volume=

An overview of early vision in inceptionv1 , author=. Distill , volume=
[4]

Thirty-seventh Conference on Neural Information Processing Systems , year=

A polar prediction model for learning to represent visual transformations , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[5]

Advances in neural information processing systems , volume=

Fourier features let networks learn high frequency functions in low dimensional domains , author=. Advances in neural information processing systems , volume=
[6]

Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation , url =

He, Keji and Si, Chenyang and Lu, Zhihe and Huang, Yan and Wang, Liang and Wang, Xinchao , booktitle =. Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation , url =
[7]

Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length General- ization

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization , author=. arXiv preprint arXiv:2412.17739 , year=

work page arXiv
[8]

The Eleventh International Conference on Learning Representations , year=

Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=
[9]

arXiv preprint arXiv:2402.09469 , year=

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic , author=. arXiv preprint arXiv:2402.09469 , year=

work page arXiv
[10]

Advances in neural information processing systems , volume=

The clock and the pizza: Two stories in mechanistic explanation of neural networks , author=. Advances in neural information processing systems , volume=
[11]

arXiv preprint arXiv:2301.02679 , year=

Grokking modular arithmetic , author=. arXiv preprint arXiv:2301.02679 , year=

work page arXiv
[12]

doi: 10.18653/ v1/2021.naacl-main.381

Language Models Use Trigonometry to Do Addition , author=. arXiv preprint arXiv:2502.00873 , year=

work page arXiv
[13]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Pre-trained Large Language Models Use Fourier Features to Compute Addition , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[14]

Language models encode numbers using digit representations in base 10 , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

2025
[15]

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Fone: Precise single-token number embeddings via fourier features , author=. arXiv preprint arXiv:2502.09741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Thirteenth International Conference on Learning Representations , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=
[17]

McGhee , publisher =

George R. McGhee , publisher =. Convergent Evolution: Limited Forms Most Beautiful , urldate =
[18]

Language Models are Unsupervised Multitask Learners , author=
[19]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[20]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[21]

2025 , url =

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author =. 2025 , url =

2025
[22]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025
[23]

2024 , eprint=

Falcon Mamba: The First Competitive Attention-free 7B Language Model , author=. 2024 , eprint=

2024
[24]

2024 , eprint=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. 2024 , eprint=

2024
[25]

2024 , eprint=

xLSTM: Extended Long Short-Term Memory , author=. 2024 , eprint=

2024
[26]

2025 , eprint=

Kimi Linear: An Expressive, Efficient Attention Architecture , author=. 2025 , eprint=

2025
[27]

G lo V e: Global Vectors for Word Representation

Pennington, Jeffrey and Socher, Richard and Manning, Christopher. G lo V e: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1162

work page doi:10.3115/v1/d14-1162 2014
[28]

Enriching Word Vectors with Subword Information

Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017. doi:10.1162/tacl_a_00051

work page doi:10.1162/tacl_a_00051 2017
[29]

doi:10.57967/hf/2497 , url =

Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/2497 , publisher =

work page doi:10.57967/hf/2497
[30]

Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2025

Emergent introspective awareness in large language models , author=. arXiv preprint arXiv:2601.01828 , year=

work page arXiv
[31]

arXiv preprint arXiv:2602.20031 , year=

Latent Introspection: Models Can Detect Prior Concept Injections , author=. arXiv preprint arXiv:2602.20031 , year=

work page arXiv
[32]

How does transformer learn implicit reasoning? arXiv preprint arXiv:2505.23653, 2025

How do Transformers Learn Implicit Reasoning? , author=. arXiv preprint arXiv:2505.23653 , year=

work page arXiv
[33]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023
[34]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review arXiv
[35]

Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025

Physics of Language Models: Part 4.1, Architecture design and the magic of Canon layers , author=. arXiv preprint arXiv:2512.17351 , year=

work page arXiv
[36]

interpreting GPT: the logit lens , author=
[37]

and Nava, Andres and Wyart, Matthieu and Bahri, Yasaman , title =

Symmetry in language statistics shapes the geometry of model representations , author=. arXiv preprint arXiv:2602.15029 , year=

work page arXiv
[38]

2025 , url=

Hadas Orgad and Michael Toker and Zorik Gekhman and Roi Reichart and Idan Szpektor and Hadas Kotek and Yonatan Belinkov , booktitle=. 2025 , url=

2025
[39]

Semantic

Semantic entropy probes: Robust and cheap hallucination detection in llms , author=. arXiv preprint arXiv:2406.15927 , year=

work page arXiv
[40]

2024 , eprint=

Linear Recursive Feature Machines provably recover low-rank matrices , author=. 2024 , eprint=

2024
[41]

International Conference on Machine Learning , pages=

Data Shapley: Equitable Valuation of Data for Machine Learning , author=. International Conference on Machine Learning , pages=
[42]

Proceedings of the 34th International Conference on Machine Learning , pages =

Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

2017
[43]

Causal Representation Learning Workshop at NeurIPS 2023 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Causal Representation Learning Workshop at NeurIPS 2023 , year=

2023
[44]

arXiv preprint arXiv:2308.03296 , year=

Studying large language model generalization with influence functions , author=. arXiv preprint arXiv:2308.03296 , year=

work page arXiv
[45]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024
[46]

The Thirteenth International Conference on Learning Representations , year=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. The Thirteenth International Conference on Learning Representations , year=
[47]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[48]

Hochreiter and J

Hochreiter, Sepp and Schmidhuber, Jürgen , title =. Neural Computation , volume =. 1997 , month =. doi:10.1162/neco.1997.9.8.1735 , url =

work page doi:10.1162/neco.1997.9.8.1735 1997
[49]

Transformers are

Tri Dao and Albert Gu , booktitle=. Transformers are. 2024 , url=

2024
[50]

Proceedings of the 41st International Conference on Machine Learning , pages =

Position: The Platonic Representation Hypothesis , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024