pith. sign in

arxiv: 1907.01470 · v1 · pith:UVZI3KTGnew · submitted 2019-07-02 · 💻 cs.LG · cs.CL· stat.ML

Augmenting Self-attention with Persistent Memory

Pith reviewed 2026-05-25 10:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords transformerself-attentionpersistent memoryfeed-forward layerlanguage modelingsequence modeling
0
0 comments X

The pith

Persistent memory vectors let transformers drop their feed-forward layers without losing performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard transformers combine self-attention layers with separate feed-forward layers. The paper tests whether learned persistent memory vectors added directly inside self-attention can take over the work of those feed-forward layers. If they can, the entire model can be reduced to a stack of attention layers only. Experiments on character-level and word-level language modeling show that the simplified architecture reaches the same level of performance as the original transformer. A reader would care because the result questions whether feed-forward layers are essential or merely one convenient way to transform representations between attention steps.

Core claim

By augmenting the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

What carries the argument

Persistent memory vectors: fixed learned vectors that are concatenated with the input keys and values inside each self-attention layer and thereby supply the transformation previously performed by the feed-forward sub-layer.

If this is right

  • The resulting architecture contains only attention operations yet matches the original transformer on language modeling tasks.
  • Both character-level and word-level benchmarks can be solved without dedicated feed-forward sub-layers.
  • Self-attention with added memory is sufficient to capture the long-range dependencies that previously required the two-module design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Uniform attention-only stacks may simplify hardware mapping or gradient flow compared with mixed attention-plus-MLP blocks.
  • The same memory-augmentation trick could be tested in other attention-based sequence models that currently rely on position-wise feed-forward layers.
  • If memory vectors can substitute for feed-forward transformations, future work could explore whether the number or placement of such vectors can be learned rather than fixed per layer.

Load-bearing premise

The persistent memory vectors can play a similar functional role to the feed-forward layer in transforming representations across layers.

What would settle it

Train the memory-augmented attention-only model on the same character and word language-modeling benchmarks; if its perplexity is materially worse than the baseline transformer that still contains feed-forward layers, the claim is false.

Figures

Figures reproduced from arXiv: 1907.01470 by Armand Joulin, Edouard Grave, Guillaume Lample, Herve Jegou, Sainbayar Sukhbaatar.

Figure 1
Figure 1. Figure 1: On the left panel, the standard transformer layer is composed of a self-attention sublayer [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The performance of our large model on Text8 as we vary (left) the number of persistent vectors, or (right) the way how persistent vectors integrate with self-attention. importance of feedforward layers in transformer models. However, it maintains decent performances because it still has a lot of parameters (38M) in the Wq,k,v,o matrices. We also compare several different ways of integrating persistent vect… view at source ↗
read the original abstract

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper proposes augmenting self-attention layers in Transformers with persistent memory vectors that substitute for the role of feed-forward layers, allowing their removal without degrading performance on character- and word-level language modeling benchmarks.

Significance. If the empirical results hold under controlled conditions, the work would be significant for simplifying Transformer architectures and clarifying the functional contribution of feed-forward layers versus attention. The approach introduces a new architectural primitive (persistent memory vectors) whose parameter count is explicitly listed as a free variable, and the evaluation on external benchmarks provides a falsifiable test of the central claim.

minor comments (1)
  1. Abstract: the statement that evaluation 'shows the benefits' is not accompanied by any quantitative numbers, baseline comparisons, or dataset names, making it impossible to assess the magnitude of the claimed result from the provided text alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review. The provided summary accurately captures the core contribution of the work. No specific major comments appear in the report, so we have no point-by-point responses at this time. We remain available to supply additional controlled experiments or clarifications that would help resolve the uncertainty in the recommendation.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a new architecture by augmenting self-attention layers with persistent memory vectors that are proposed to play a role similar to feed-forward layers, allowing their removal. This is presented as an architectural choice evaluated empirically on external character- and word-level language modeling benchmarks. No equations, derivations, or steps are visible in the abstract or described claims that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claim rests on performance comparisons rather than any internal loop where a prediction is forced by the inputs or prior self-work. This is the most common honest finding for an architecture paper with independent empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the architectural choice that a fixed set of memory vectors can functionally replace the learned transformations performed by feed-forward layers; this choice is introduced without derivation from first principles.

free parameters (1)
  • number and dimension of persistent memory vectors
    Hyperparameters chosen to match or exceed baseline performance; their values are not derived from the model equations.
invented entities (1)
  • persistent memory vectors no independent evidence
    purpose: Augment self-attention computation and substitute for feed-forward layers
    New component introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5654 in / 1026 out tokens · 26228 ms · 2026-05-25T10:58:32.544759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  2. Deep sequence models tend to memorize geometrically; it is unclear why

    cs.LG 2025-10 unverdicted novelty 6.0

    Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.

  3. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  4. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 4 Pith papers · 7 internal anchors

  1. [1]

    Character-level language modeling with deeper self-attention

    Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019

  2. [2]

    Adaptive input representations for neural language modeling

    Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019

  3. [3]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015

  4. [4]

    A neural probabilistic language model

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003

  5. [5]

    Quick training of probabilistic neural nets by importance sampling

    Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1–9, 2003

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    Hierarchical multiscale recurrent neural networks

    Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017

  8. [8]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

  9. [9]

    Language modeling with gated convolutional networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017

  10. [10]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019

  11. [11]

    Adaptive subgradient methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

  12. [12]

    A bit of progress in language modeling

    Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–434, 2001

  13. [13]

    Efficient softmax approxi- mation for gpus

    Edouard Grave, Armand Joulin, Moustapha Cissé, and Hervé Jégou. Efficient softmax approxi- mation for gpus. In ICML, 2017

  14. [14]

    Improving neural language models with a continuous cache

    Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017

  15. [15]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. 9

  16. [16]

    Dai, and Quoc V

    David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In ICLR, 2017

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

  19. [19]

    Tying word vectors and word classifiers: A loss framework for language modeling

    Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017

  20. [20]

    Hierarchical mixtures of experts and the em algorithm

    Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994

  21. [21]

    Exploring the Limits of Language Modeling

    Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

  22. [22]

    Multiplicative LSTM for sequence modelling

    Ben Krause, Iain Murray, Steve Renals, and Liang Lu. Multiplicative LSTM for sequence modelling. In ICLR (Workshop), 2017

  23. [23]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  24. [24]

    Large text compression benchmark.URL: http://www

    Matt Mahoney. Large text compression benchmark.URL: http://www. mattmahoney. net/text/text. html, 2011

  25. [25]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017

  26. [26]

    An Analysis of Neural Language Modeling at Multiple Scales

    Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018

  27. [27]

    Recur- rent neural network based language model

    Tomáš Mikolov, Martin Karafiát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recur- rent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010

  28. [28]

    Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston

    Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016

  29. [29]

    Hierarchical probabilistic neural network language model

    Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS, 2005

  30. [30]

    Fast-slow recurrent neural networks

    Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, pages 5915–5924, 2017

  31. [31]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013

  32. [32]

    Using the output embedding to improve language models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. In EACL (2), 2017

  33. [33]

    Rae, Chris Dyer, Peter Dayan, and Timothy P

    Jack W. Rae, Chris Dyer, Peter Dayan, and Timothy P. Lillicrap. Fast parametric learning with activation memorization. In ICML, 2018

  34. [34]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1), 2016

  35. [35]

    Self-attention with relative position repre- sentations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. In NAACL-HLT (2), 2018

  36. [36]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In ICLR, 2017. 10

  37. [37]

    Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

  38. [38]

    End-to-end memory networks

    Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS, 2015

  39. [39]

    Adaptive attention span in transformers

    Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, 2019

  40. [40]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

  41. [41]

    Pointer networks

    Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015

  42. [42]

    Pay less attention with lightweight and dynamic convolutions

    Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019

  43. [43]

    Courville, Ruslan Salakhutdinov, Richard S

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015

  44. [44]

    Recurrent Neural Network Regularization

    Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014

  45. [45]

    Recurrent highway networks

    Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017. 11