pith. machine review for the scientific record. sign in

arxiv: 2605.06216 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal, Duc Hoang, Han-Byul Kim, Lauren Hannah, Mehrdad Farajtabar, Minsik Cho

Pith reviewed 2026-05-08 10:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords transformertoken embeddingrare tokenscontextual collapsememory blockslanguage modelinggradient signalsoftmax router
0
0 comments X

The pith

Token embeddings injected only at input cause rare tokens to be under-trained and similar tokens to collapse in hidden states; TIDE routes fresh memory vectors to every layer to fix both.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers look up each token's embedding once at the input and then discard the raw identity for the rest of the computation. This single-injection choice produces two failures: rare tokens receive far less gradient signal than common ones because of vocabulary imbalance, and tokens that occur in similar contexts end up with nearly identical representations deeper in the network. TIDE adds an ensemble of memory blocks that hold context-free token vectors and injects them into every layer through a depth-aware router that can also choose to ignore the signal. If the approach succeeds, models can maintain token distinctions throughout their depth and give uncommon words adequate training updates without enlarging the overall parameter count. Readers care because these fixes target structural limits that appear whenever language data follows a long-tail distribution.

Core claim

The single-injection assumption in standard transformers induces the Rare Token Problem, in which rare tokens receive only a fraction of the cumulative gradient signal, and the Contextual Collapse Problem, in which distributionally similar tokens map to indistinguishable hidden states. TIDE augments the transformer with EmbeddingMemory, an ensemble of K independent MemoryBlocks that store context-free semantic vectors for token indices and inject them at every layer via a depth-conditioned softmax router equipped with a learnable null bank, thereby restoring per-layer token identity and yielding measurable gains on language modeling and downstream tasks.

What carries the argument

EmbeddingMemory, an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors and deliver them to every transformer layer through a depth-conditioned softmax router with a learnable null bank.

If this is right

  • Rare tokens receive gradient updates proportional to their appearance at every layer rather than only at the input embedding.
  • Hidden states preserve distinctions between tokens that share contextual patterns, reducing collapse as depth increases.
  • Language modeling perplexity and downstream task accuracy improve across multiple benchmarks.
  • The router with null bank allows the model to selectively apply or suppress the memory signal per layer and per token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models with limited capacity could handle long-tail vocabularies more effectively by reusing the same memory blocks across layers instead of allocating more parameters to the initial embedding table.
  • The same per-layer identity injection principle might be tested in non-transformer sequence architectures where early loss of token identity occurs.
  • Training dynamics could be monitored by tracking how often the router activates the null bank, providing a diagnostic for when token memory is most needed.

Load-bearing premise

The problems arise mainly from looking up token embeddings only once at the input, and that supplying independent context-free vectors at every layer will correct the gradient imbalance and collapse without introducing new optimization or capacity problems.

What would settle it

Run controlled experiments comparing gradient norms for rare tokens and pairwise hidden-state similarities between distributionally similar tokens in a baseline transformer versus a TIDE model; if neither metric improves, the central claim does not hold.

read the original abstract

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies single-injection of token embeddings at the input layer as the root cause of the Rare Token Problem (under-training of rare tokens due to Zipfian gradient imbalance) and the Contextual Collapse Problem (indistinguishable hidden states for distributionally similar tokens). It proposes TIDE, which augments the transformer with an EmbeddingMemory module consisting of K independent MemoryBlocks that compute context-free semantic vectors; these are injected into every layer via a depth-conditioned softmax router equipped with a learnable null bank. The authors claim this design yields both theoretical advantages and empirical gains on language modeling and downstream tasks.

Significance. If the per-layer injection mechanism can be shown to drive the gains independently of the added parameters, TIDE would represent a meaningful architectural shift for improving gradient flow to rare tokens and preserving token identity across depth. The approach directly targets two widely observed but under-analyzed limitations in current LLMs.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central causal claim attributes both the Rare Token Problem and Contextual Collapse Problem to the single-injection design, yet the proposed TIDE necessarily increases total parameter count via the K MemoryBlocks, router, and null bank. No ablation is described that holds parameter count fixed while varying only injection frequency, leaving open the possibility that observed improvements stem from capacity rather than the per-layer mechanism.
  2. [Abstract] Abstract: theoretical benefits are asserted without any equations, proof sketches, or derivation details for how the depth-conditioned router or null bank resolves the gradient imbalance or collapse; this makes it impossible to verify whether the claims are load-bearing or reduce to properties of the newly introduced free parameters (K and null-bank weights).
minor comments (2)
  1. [§3] Notation for the router and MemoryBlocks should be introduced with explicit equations rather than descriptive prose to allow reproducibility.
  2. [§3] The manuscript should clarify the precise definition of 'context-free semantic vectors' and how they differ from standard embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the causal claims and presentation without altering the core contributions of TIDE.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the central causal claim attributes both the Rare Token Problem and Contextual Collapse Problem to the single-injection design, yet the proposed TIDE necessarily increases total parameter count via the K MemoryBlocks, router, and null bank. No ablation is described that holds parameter count fixed while varying only injection frequency, leaving open the possibility that observed improvements stem from capacity rather than the per-layer mechanism.

    Authors: We acknowledge the absence of a parameter-controlled ablation isolating injection frequency. In the revised manuscript we will add this experiment: a baseline transformer augmented with parameter-equivalent capacity (via expanded FFN dimensions or redundant embeddings) but retaining single-injection, compared directly against TIDE. This will show that gains on rare-token perplexity and downstream tasks arise from depth-conditioned per-layer injection rather than capacity. Even with matched parameters, single-injection cannot supply layer-specific token vectors, which is the mechanism addressing Zipfian gradient imbalance and contextual collapse. revision: yes

  2. Referee: [Abstract] Abstract: theoretical benefits are asserted without any equations, proof sketches, or derivation details for how the depth-conditioned router or null bank resolves the gradient imbalance or collapse; this makes it impossible to verify whether the claims are load-bearing or reduce to properties of the newly introduced free parameters (K and null-bank weights).

    Authors: The abstract follows standard conventions for brevity and does not contain equations. The theoretical analysis—including derivations showing how the depth-conditioned router and learnable null bank enable adaptive injection of context-free semantic vectors to mitigate gradient imbalance and preserve token distinguishability—is provided in §3 with supporting equations. We will revise the abstract to include a concise reference to this analysis (e.g., “theoretically establishing that depth-conditioned routing resolves single-injection limitations”). The claimed benefits follow from the multi-layer injection architecture, not merely the addition of free parameters. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation self-contained against external benchmarks

full rationale

The paper defines the Rare Token Problem and Contextual Collapse Problem directly from the standard transformer's single-injection design choice, then proposes TIDE's EmbeddingMemory, per-layer injection, depth-conditioned router, and null bank as an independent architectural augmentation. No equations or claims in the provided text reduce a prediction to a fitted parameter by construction, invoke a self-citation as the sole justification for a uniqueness theorem, or rename a known result. The theoretical and empirical establishment of benefits is presented as separate from the input assumptions, making the central claims non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the new EmbeddingMemory components and the assumption that single-injection is the root cause; no independent evidence for the new entities is supplied in the abstract.

free parameters (2)
  • K
    Number of independent MemoryBlocks chosen by the authors; value not specified in abstract.
  • learnable null bank parameters
    Additional parameters introduced for the router's null option.
axioms (2)
  • domain assumption Token indices follow a Zipf-type distribution causing rare tokens to receive fractionally less gradient signal.
    Invoked to define the Rare Token Problem.
  • domain assumption Limited-parameter models map distributionally similar tokens to indistinguishable hidden states under single injection.
    Invoked to define the Contextual Collapse Problem.
invented entities (2)
  • EmbeddingMemory no independent evidence
    purpose: Ensemble of K MemoryBlocks providing context-free semantic vectors for per-layer injection.
    New component introduced to solve the identified problems.
  • depth-conditioned softmax router with learnable null bank no independent evidence
    purpose: Mechanism to inject the memory vectors into every layer.
    New routing component with no prior independent validation mentioned.

pith-pipeline@v0.9.0 · 5488 in / 1468 out tokens · 55208 ms · 2026-05-08T10:30:45.906533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

121 extracted references · 40 canonical work pages · 16 internal anchors

  1. [1]

    and Kaiser,

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention Is All You Need , booktitle =

  2. [2]

    Long Short-Term Memory , journal =

    Hochreiter, Sepp and Schmidhuber, J. Long Short-Term Memory , journal =

  3. [3]

    Proceedings of EMNLP , pages =

    Cho, Kyunghyun and van Merrienboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua , title =. Proceedings of EMNLP , pages =

  4. [4]

    Proceedings of NAACL-HLT , pages =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of NAACL-HLT , pages =

  5. [5]

    OpenAI Blog , volume =

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , title =. OpenAI Blog , volume =

  6. [6]

    and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , title =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D. and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , title =. Advances in Neural Information Processing Systems , volume =

  7. [7]

    Journal of Machine Learning Research , volume =

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and others , title =. Journal of Machine Learning Research , volume =

  8. [8]

    Advances in Neural Information Processing Systems , volume =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and others , title =. Advances in Neural Information Processing Systems , volume =

  9. [9]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and others , title =. arXiv preprint arXiv:2307.09288 , year =

  10. [10]

    arXiv preprint arXiv:2303.08774 , year =

  11. [11]

    Scaling Laws for Neural Language Models

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeff and Amodei, Dario , title =. arXiv preprint arXiv:2001.08361 , year =

  12. [12]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, Tom and Kaplan, Jared and Katz, Mor and Chen, Mark and Hesse, Christopher and Jackson, Jacob and others , title =. arXiv preprint arXiv:2010.14701 , year =

  13. [13]

    and Zhou, Denny , title =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc V. and Zhou, Denny , title =. Advances in Neural Information Processing Systems , volume =

  14. [14]

    Evaluating Large Language Models Trained on Code

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and others , title =. arXiv preprint arXiv:2107.03374 , year =

  15. [15]

    Zipf, George Kingsley , title =

  16. [16]

    Scientific reports , volume=

    Bias in Zipf’s law estimators , author=. Scientific reports , volume=. 2021 , publisher=

  17. [17]

    Rare Words: A Major Problem for Contextualized Embeddings and How to Fix It by Attentive Mimicking , booktitle =

    Schick, Timo and Sch. Rare Words: A Major Problem for Contextualized Embeddings and How to Fix It by Attentive Mimicking , booktitle =

  18. [18]

    arXiv preprint arXiv:2107.02137 , year=

    Sun, Yu and Wang, Shuohuan and Feng, Shikun and Ding, Siyu and Pang, Chao and Shang, Junyuan and others , title =. arXiv preprint arXiv:2107.02137 , year =

  19. [19]

    Proceedings of ICML , pages =

    Gong, Linyuan and He, Di and Li, Zhuohan and Qin, Tao and Wang, Liwei and Liu, Tie-Yan , title =. Proceedings of ICML , pages =

  20. [20]

    Advances in neural information processing systems , volume=

    Lipschitz regularity of deep neural networks: analysis and efficient estimation , author=. Advances in neural information processing systems , volume=

  21. [21]

    Proceedings of EMNLP , pages =

    Geva, Mor and Schuster, Roei and Berant, Jonathan and Levy, Omer , title =. Proceedings of EMNLP , pages =

  22. [22]

    Proceedings of EMNLP , pages =

    Geva, Mor and Caciularu, Avi and Wang, Kevin Ro and Goldberg, Yoav , title =. Proceedings of EMNLP , pages =

  23. [23]

    Advances in Neural Information Processing Systems , volume =

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems , volume =

  24. [24]

    Proceedings of ACL , pages =

    Dai, Damai and Dong, Li and Hao, Yaru and Sui, Zhifang and Chang, Baobao and Wei, Furu , title =. Proceedings of ACL , pages =

  25. [25]

    Proceedings of ICLR , year =

    Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu , title =. Proceedings of ICLR , year =

  26. [26]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =

  27. [27]

    Proceedings of ICML , pages =

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of ICML , pages =

  28. [28]

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , booktitle =

    Izacard, Gautier and Grave,. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , booktitle =

  29. [29]

    Journal of Machine Learning Research , volume =

    Izacard, Gautier and Lewis, Patrick and Lomeli, Maria and Hosseini, Lucas and Petroni, Fabio and Schick, Timo and others , title =. Journal of Machine Learning Research , volume =

  30. [30]

    Proceedings of ICML , pages =

    Borgeaud, Sebastian and Mensch, Arthur and Hoffmann, Jordan and Cai, Trevor and Rutherford, Eliza and Millican, Katie and others , title =. Proceedings of ICML , pages =

  31. [31]

    Large Memory Layers with Product Keys , booktitle =

    Lample, Guillaume and Sablayrolles, Alexandre and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J. Large Memory Layers with Product Keys , booktitle =

  32. [32]

    and Hinton, Geoffrey E

    Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc V. and Hinton, Geoffrey E. and Dean, Jeff , title =. Proceedings of ICLR , year =

  33. [33]

    Journal of Machine Learning Research , volume =

    Fedus, William and Zoph, Barret and Shazeer, Noam , title =. Journal of Machine Learning Research , volume =

  34. [34]

    Proceedings of ICLR , year =

    Lepikhin, Dmitry and Lee, HyoukJoong and Xu, Yuanzhong and Chen, Dehao and Firat, Orhan and Huang, Yanping and others , title =. Proceedings of ICLR , year =

  35. [35]

    and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and others , title =

    Du, Nan and Huang, Yanping and Dai, Andrew M. and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and others , title =. Proceedings of ICML , pages =

  36. [38]

    Hybrid Computing Using a Neural Network with Dynamic External Memory , journal =

    Graves, Alex and Wayne, Greg and Reynolds, Malcolm and Harley, Tim and Danihelka, Ivo and Grabska-Barwi. Hybrid Computing Using a Neural Network with Dynamic External Memory , journal =

  37. [39]

    and Hutchins, DeLesley and Szegedy, Christian , title =

    Wu, Yuhuai and Rabe, Markus N. and Hutchins, DeLesley and Szegedy, Christian , title =. Proceedings of ICLR , year =

  38. [40]

    Proceedings of ICML , pages =

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , title =. Proceedings of ICML , pages =

  39. [41]

    Proceedings of ACL , pages =

    Li, Xiang Lisa and Liang, Percy , title =. Proceedings of ACL , pages =

  40. [42]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. Proceedings of ICLR , year =

  41. [43]

    and McNaughton, Bruce L

    McClelland, James L. and McNaughton, Bruce L. and O'Reilly, Randall C. , title =. Psychological Review , volume =

  42. [44]

    and McClelland, James L

    Rogers, Timothy T. and McClelland, James L. , title =

  43. [45]

    Neural Networks , volume =

    Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , title =. Neural Networks , volume =

  44. [46]

    Advances in Neural Information Processing Systems , volume =

    Zhang, Biao and Sennrich, Rico , title =. Advances in Neural Information Processing Systems , volume =

  45. [47]

    GLU Variants Improve Transformer

    Shazeer, Noam , title =. arXiv preprint arXiv:2002.05202 , year =

  46. [48]

    Proceedings of CVPR , pages =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of CVPR , pages =

  47. [49]

    Transformer Circuits Thread , year =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and others , title =. Transformer Circuits Thread , year =

  48. [50]

    Proceedings of ICML , pages =

    Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , title =. Proceedings of ICML , pages =

  49. [51]

    Proceedings of ICLR , year =

    Nguyen, Thao and Raghu, Maithra , title =. Proceedings of ICLR , year =

  50. [52]

    Advances in Neural Information Processing Systems , volume =

    Raghu, Maithra and Gilmer, Justin and Yosinski, Jason and Sohl-Dickstein, Jascha , title =. Advances in Neural Information Processing Systems , volume =

  51. [53]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  52. [54]

    Proceedings of ICLR , year =

    Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. Proceedings of ICLR , year =

  53. [55]

    and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri

    Conmy, Arthur and Mavor-Parker, Augustine N. and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri. Towards Automated Circuit Discovery for Mechanistic Interpretability , booktitle =

  54. [56]

    Proceedings of ICLR , year =

    Meng, Kevin and Sharma, Arnab Sen and Andonian, Alex and Belinkov, Yonatan and Bau, David , title =. Proceedings of ICLR , year =

  55. [57]

    Transactions on Machine Learning Research , year =

    Gurnee, Wes and Nanda, Neel and Pauly, Matthew and Harvey, Katherine and Troitskii, Dmitrii and Bertsimas, Dimitris , title =. Transactions on Machine Learning Research , year =

  56. [58]

    , title =

    Luong, Minh-Thang and Pham, Hieu and Manning, Christopher D. , title =. Proceedings of EMNLP , pages =

  57. [59]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , title =. arXiv preprint arXiv:1909.08053 , year =

  58. [60]

    arXiv preprint arXiv:2211.05100 , year =

  59. [61]

    Journal of Machine Learning Research , volume =

    Duchi, John and Hazan, Elad and Singer, Yoram , title =. Journal of Machine Learning Research , volume =

  60. [62]

    Parallel Coordinate Descent Methods for Big Data Optimization , journal =

    Richt. Parallel Coordinate Descent Methods for Big Data Optimization , journal =

  61. [63]

    and Levy, O

    Goldberg, Yoav and Levy, Omer , title =. arXiv preprint arXiv:1402.3722 , year =

  62. [64]

    , title =

    Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D. , title =. Proceedings of the 2019 ACL Workshop BlackboxNLP , year =

  63. [67]

    Proceedings of ACL , pages =

    Dar, Guy and Geva, Mor and Gupta, Ankit and Berant, Jonathan , title =. Proceedings of ACL , pages =

  64. [69]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  65. [70]

    A multiscale visualization of attention in the transformer model

    A multiscale visualization of attention in the transformer model , author=. arXiv preprint arXiv:1906.05714 , year=

  66. [71]

    ArXiv , year=

    GLU Variants Improve Transformer , author=. ArXiv , year=

  67. [72]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  68. [73]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  69. [74]

    Instella: Fully open language models with stellar performance.arXiv preprint arXiv:2511.10628, 2025

    Instella: Fully open language models with stellar performance , author=. arXiv preprint arXiv:2511.10628 , year=

  70. [75]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

  71. [76]

    Nemotron-4 340b technical report

    Nemotron-4 340b technical report , author=. arXiv preprint arXiv:2406.11704 , year=

  72. [77]

    Apple intelligence foundation language models

    Apple intelligence foundation language models , author=. arXiv preprint arXiv:2407.21075 , year=

  73. [78]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  74. [79]
  75. [80]

    Advances in neural information processing systems , volume=

    End-to-end memory networks , author=. Advances in neural information processing systems , volume=

  76. [81]

    Neural Turing Machines

    Neural turing machines , author=. arXiv preprint arXiv:1410.5401 , year=

  77. [82]

    Advances in Neural Information Processing Systems , volume=

    Large memory layers with product keys , author=. Advances in Neural Information Processing Systems , volume=

  78. [83]

    Open-Domain Question Answering

    Chen, Danqi and Yih, Wen-tau. Open-Domain Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2020. doi:10.18653/v1/2020.acl-tutorials.8

  79. [84]

    arXiv preprint arXiv:2407.04153 , year=

    Mixture of a million experts , author=. arXiv preprint arXiv:2407.04153 , year=

  80. [85]

    How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

    How much knowledge can you pack into the parameters of a language model? , author=. arXiv preprint arXiv:2002.08910 , year=

Showing first 80 references.