Recognition: 3 theorem links
· Lean TheoremTransformer Feed-Forward Layers Are Key-Value Memories
Pith reviewed 2026-05-13 23:30 UTC · model grok-4.3
The pith
Transformer feed-forward layers function as key-value memories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. The learned patterns are human-interpretable, with lower layers capturing shallow patterns and upper layers learning more semantic ones. The values complement the keys by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern. The output of a feed-forward layer is a composition of its memories, refined throughout the model via residual connections.
What carries the argument
Key-value memory pairs inside each feed-forward layer, where a key detects an input pattern and a value supplies a next-token distribution.
Load-bearing premise
The correlations between learned keys and input patterns, and between values and output distributions, reflect the actual computation the model performs at inference time.
What would settle it
Alter the weights of one specific key-value pair and measure whether the model's next-token predictions shift only for inputs that match the corresponding pattern.
read the original abstract
Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that feed-forward layers in transformer language models function as key-value memories: keys correlate with human-interpretable textual patterns from the training data (shallow in lower layers, semantic in upper layers), values induce complementary next-token distributions, and the layer output is a composition of activated memories that is refined via residual connections to produce the final distribution.
Significance. If the interpretation holds, it supplies a concrete mechanistic account of two-thirds of transformer parameters, grounded in empirical probing that reveals interpretable patterns and input-output complementarity. This could support targeted model editing and deeper understanding of how transformers store and retrieve information.
major comments (2)
- [§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).
- [§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.
minor comments (2)
- [Figure 3] Figure 3 and Table 1: the value-distribution visualizations would be clearer with an explicit random-key baseline to quantify how much the reported concentration exceeds chance.
- Notation: the mapping from matrix rows/columns to keys and values is introduced without a compact equation; adding a single-line definition (e.g., key_i = row i of W1) would aid readability.
Simulated Author's Rebuttal
We thank the referee for the insightful comments and the recommendation for major revision. We provide detailed responses to each major comment below, indicating where we will revise the manuscript to address the concerns.
read point-by-point responses
-
Referee: [§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).
Authors: We acknowledge that the primary evidence consists of strong statistical correlations between the keys and specific textual patterns, identified by finding inputs that highly activate each key. These correlations are not merely side-effects, as they directly correspond to the computation in the forward pass where high key activation leads to the associated value contributing to the output. Nevertheless, to provide stronger causal evidence, we will add experiments involving the ablation of specific keys and measure the impact on the model's predictions for inputs containing the corresponding patterns. revision: yes
-
Referee: [§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.
Authors: The non-linearity f is applied element-wise to the pre-activations, meaning each key's activation scalar is computed independently as f(key_i · x). The layer output is then the linear combination sum_i activation_i * value_i. Therefore, the memories combine linearly once activated, with the non-linearity affecting only the activation strength of each memory individually. We will revise §3.2 to include this explicit mathematical expansion and present controlled experiments where we compare the actual FF output to the linear combination of individually computed memory contributions. revision: yes
Circularity Check
No circularity: empirical correlations from trained models do not reduce to self-definition or fitted inputs
full rationale
The paper's central claim rests on post-training analysis of existing transformer weights: identifying input patterns that strongly activate specific rows of the first FF matrix (treated as keys) and observing that the corresponding columns of the second matrix induce next-token distributions (treated as values). These are measured correlations on held-out data and activation statistics, not quantities defined in terms of each other or obtained by fitting a parameter whose value is then relabeled as a prediction. No equations are shown to be equivalent by construction, no uniqueness theorem is imported from the authors' prior work to force the interpretation, and the residual composition argument is demonstrated via direct layer-wise ablation rather than assumed. The derivation chain is therefore self-contained against external benchmarks (the trained models themselves).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformers are trained via next-token prediction on large corpora
invented entities (1)
-
key-value memory structure inside feed-forward layers
no independent evidence
Forward citations
Cited by 23 Pith papers
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
Multimodal knowledge editing causes models to confuse original and edited entity identities in text queries by failing to update image-entity bindings and instead overfitting entity-entity shortcuts.
-
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
A framework for analyzing concept representations in neural models
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
-
A Parametric Memory Head for Continual Generative Retrieval
A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.
-
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging
Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
Symmetric spectral diagnostics on attention are structurally blind to flow direction, with asymmetry G as the sole control parameter, yielding a two-axis test that distinguishes bottleneck versus diffuse hallucination...
-
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
-
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning
BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Automated Attention Pattern Discovery at Scale in Large Language Models
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
-
The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
Bidirectional objectives mitigate reversal by requiring explicit source-as-target signals and storing directions as distinct representations instead of inducing latent generalization.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
-
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
Reference graph
Works this paper leans on
-
[1]
O. Bastani and Y. Ioannou and L. Lampropoulos and D. Vytiniotis and A. Nori and A. Criminisi , booktitle =. Measuring neural net robustness with constraints , year =
-
[2]
J. Z. Kolter and E. Wong , journal =. Provable defenses against adversarial examples via the convex outer adversarial polytope (published at
-
[3]
E. Wong and J. Z. Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =
-
[4]
K. Dvijotham and R. Stanforth and S. Gowal and T. Mann and P. Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =
-
[5]
M. Hein and M. Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =
-
[6]
A. A. Ahmadi and A. Majumdar , journal =
-
[7]
N. Dalvi and A. Dasgupta and R. Kumar and V. Rastogi , booktitle =. Aggregating crowdsourced binary ratings , year =
-
[8]
M. Joglekar and H. Garcia-Molina and A. Parameswaran , booktitle =. Comprehensive and reliable crowd assessment algorithms , year =
-
[9]
Y. Zhang and X. Chen and D. Zhou and M. I. Jordan , journal =. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing , volume =
-
[10]
A. Balsubramani and Y. Freund , booktitle =. Scalable semi-supervised aggregation of classifiers , year =
-
[11]
M. Craven and J. Kumlien and others , booktitle =. Constructing biological knowledge bases by extracting information from text sources , year =
-
[12]
P. Varma and B. He and D. Iter and P. Xu and R. Yu and C. D. Sa and C. R. arXiv preprint arXiv:1610.08123 , title =
-
[13]
J. Shin and S. Wu and F. Wang and C. D. Sa and C. Zhang and C. R. Incremental knowledge base construction using. Very Large Data Bases (VLDB) , number =
-
[14]
B. Roth and D. Klakow , booktitle =. Combining Generative and Discriminative Model Scores for Distant Supervision , year =
-
[15]
S. Takamatsu and I. Sato and H. Nakagawa , booktitle =. Reducing wrong labels in distant supervision for relation extraction , year =
-
[16]
C. D. Sa and A. Ratner and C. R. Deepdive: declarative knowledge base construction , volume =. ACM SIGMOD Record , number =
- [17]
-
[18]
E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =
-
[19]
R. Bunescu and R. Mooney , booktitle =. Learning to extract relations from the web using minimal supervision , year =
-
[20]
A. Parkash and D. Parikh , booktitle =. Attributes for classifier feedback , year =
-
[21]
G. Druck and B. Settles and A. McCallum , booktitle =. Active learning by labeling features , year =
-
[22]
H. Raghavan and O. Madani and R. Jones , booktitle =. InterActive Feature Selection , volume =
-
[23]
G. S. Mann and A. McCallum , journal =. Generalized expectation criteria for semi-supervised learning with weakly labeled data , volume =
- [24]
-
[25]
D. H. Younger , journal =. Recognition and parsing of context-free languages in time n3 , volume =
-
[26]
A. J. Ratner and C. M. D. Sa and S. Wu and D. Selsam and C. R. Data programming: Creating large training sets, quickly , year =. Advances in Neural Information Processing Systems (NIPS) , pages =
-
[27]
B. S. H. and H. Bryan and R. Alexander and R. Christopher , booktitle =. Learning the Structure of Generative Models without Labeled Data , year =
-
[28]
D. Corney and D. Albakour and M. Martinez-Alvarez and S. Moussa , booktitle =. What do a million news articles look like? , year =
- [29]
-
[30]
A. J. Ratner and S. H. Bach and H. Ehrenberg and J. Fries and S. Wu and C. R. Very Large Data Bases (VLDB) , title =
-
[31]
S. Srivastava and I. Labutov and T. Mitchell , booktitle =. Joint concept learning and semantic parsing from natural language explanations , year =
-
[32]
H. Ling and S. Fidler , booktitle =. Teaching Machines to Describe Images via Natural Language Feedback , year =
- [33]
-
[34]
J. Andreas and D. Klein and S. Levine , journal =. Learning with Latent Language , year =
-
[35]
J. E. Weston , booktitle =. Dialog-based language learning , year =
-
[36]
L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =
-
[37]
S. Krening and B. Harrison and K. M. Feigh and C. L. Isbell and M. Riedl and A. Thomaz , journal =. Learning from explanations using sentiment and advice in
-
[38]
R. Guidotti and A. Monreale and F. Turini and D. Pedreschi and F. Giannotti , journal =. A Survey Of Methods For Explaining Black Box Models , year =
-
[39]
A. Yessenalina and Y. Choi and C. Cardie , booktitle =. Automatically generating annotator rationales to improve sentiment classification , year =
-
[40]
S. Arora and E. Nyberg , booktitle =. Interactive annotation learning with indirect feature voting , year =
-
[41]
M. Grechkin and H. Poon and B. Howe , journal =. EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation , year =
-
[42]
L. Ratinov and D. Roth and D. Downey and M. Anderson , booktitle =. Local and Global Algorithms for Disambiguation to
-
[43]
A. Kalyanpur and B. K. Boguraev and S. Patwardhan and J. W. Murdock and A. Lally and C. A. Welty and J. M. Prager and B. Coppola and A. Fokoue-Nkoutche and L. Zhang and Y. Pan and Z. M. Qui , journal =. Structured data and inference in DeepQA , volume =
- [44]
- [45]
-
[46]
J. Ellis and J. Getman and H. Simpson and K. Griffitt and H. T. Dang and R. Grishman and H. Ji and C. DePrince and T. Riese and N. Kuster , journal =
-
[47]
J. A. Aslam and V. Pavlu and E. Yilmaz , booktitle =. A statistical method for system evaluation using incomplete judgments , year =
-
[48]
C. Buckley and D. Dimmick and I. Soboroff and E. Voorhees , booktitle =. Bias and the limits of pooling for large collections , year =
-
[49]
C. Buckley and E. M. Voorhees , booktitle =. Retrieval evaluation with incomplete information , year =
-
[50]
T. Sakai and N. Kando , booktitle =. On information retrieval metrics designed for evaluation with incomplete relevance assessments , year =
-
[51]
G. V. Cormack and C. R. Palmer and C. L. A. Clarke , booktitle =. Efficient Construction of Large Test Collections , year =
-
[52]
E. Yilmaz and E. Kanoulas and J. A. Aslam , booktitle =. A simple and efficient sampling method for estimating
-
[53]
D. Vannella and D. Jurgens and D. Scarfini and D. Toscani and R. Navigli , booktitle =. Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose , year =
-
[54]
E. Pavlick and H. Ji and X. Pan and C. Callison-Burch , booktitle =. The Gun Violence Database: A new task and data set for
-
[55]
W. E. Webber , school =. Measurement in Information Retrieval Evaluation , year =
-
[56]
J. Zobel , booktitle =. How reliable are the results of large-scale information retrieval experiments? , year =
-
[57]
E. M. Voorhees and D. Harman , booktitle =. Overview of the Eight Text REtreival Conference (
-
[58]
H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =
-
[59]
A. B. Owen , publisher =. Monte Carlo theory, methods and examples , year =
-
[60]
K. S. Jones and C. V. Rijsbergen , journal =. Report on the Need for and Provision of an ``Ideal test collection , year =
-
[61]
D. K. Harman , journal =. The first text retrieval conference (TREC-1) Rockville, MD, U.S.A., 4-6 November, 1992 , volume =
work page 1992
- [62]
-
[63]
R. L. Burden and J. D. Faires , publisher =. Numerical Analysis (3rd ed.) , year =
- [64]
-
[65]
H. T. Dang , journal =. Cold Start Knowledge Base Population at
-
[66]
J. Ellis and J. Getman and D. Fore and N. Kuster and Z. Song and A. Bies and S. Strassel , journal =. Overview of linguistic resources for the
-
[67]
J. Ellis and X. Li and K. Griffitt and S. M. Strassel , journal =. Linguistic Resources for 2012 Knowledge Base Population Evaluations , year =
work page 2012
-
[68]
B. Plank , journal =. What to do about non-standard (or non-canonical) language in
-
[69]
J. Novikova and O. Du. Empirical Methods in Natural Language Processing (EMNLP) , title =
- [70]
-
[71]
A. Cohan and N. Goharian , booktitle =. Revisiting Summarization Evaluation for Scientific Articles , year =
-
[72]
A. Lavie and M. Denkowski , journal =. The Meteor Metric for Automatic Evaluation of Machine Translation , volume =
-
[73]
M. Denkowski and A. Lavie , booktitle =. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , year =
- [74]
-
[75]
G. A. Miller and J. G. Beebe-Center , journal =. Some Psychological Methods for Evaluating the Quality of Translations , volume =
-
[76]
J. H. Lau and A. Clark and S. Lappin , journal =. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , volume =
- [77]
-
[78]
R. Paulus and C. Xiong and R. Socher , booktitle =. A Deep Reinforced Model for Abstractive Summarization , year =
- [79]
-
[80]
J. M. Conroy and H. T. Dang , booktitle =. Mind the Gap : Dangers of Divorcing Evaluations of Summary Content from Linguistic Quality , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.