hub

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy · 2020 · cs.CL · DOI 10.18653/v1/2021.emnlp-main.446 · arXiv 2012.14913

21 Pith papers cite this work, alongside 186 external citations. Polarity classification is still indexing.

21 Pith papers citing it

186 external citations · Pith

open full Pith review browse 21 citing papers arXiv PDF

abstract

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

Multimodal knowledge editing causes models to confuse original and edited entity identities in text queries by failing to update image-entity bindings and instead overfitting entity-entity shortcuts.

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

A Parametric Memory Head for Continual Generative Retrieval

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.

One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.

Eliciting Latent Predictions from Transformers with the Tuned Lens

cs.LG · 2023-03-14 · accept · novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Symmetric spectral diagnostics on attention are structurally blind to flow direction, with asymmetry G as the sole control parameter, yielding a two-axis test that distinguishes bottleneck versus diffuse hallucination modes with opposite polarity.

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.

The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

cs.LG · 2026-04-26 · conditional · novelty 6.0 · 2 refs

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

cs.AI · 2026-04-20 · conditional · novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Automated Attention Pattern Discovery at Scale in Large Language Models

cs.LG · 2026-04-04 · unverdicted · novelty 6.0

AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

citing papers explorer

Showing 21 of 21 citing papers.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022-11-01 · conditional · none · ref 51 · internal anchor
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Uncovering Entity Identity Confusion in Multimodal Knowledge Editing cs.CL · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
Multimodal knowledge editing causes models to confuse original and edited entity identities in text queries by failing to update image-entity bindings and instead overfitting entity-entity shortcuts.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval stat.ML · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
How Language Models Process Negation cs.CL · 2026-05-04 · unverdicted · none · ref 57 · internal anchor
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
A framework for analyzing concept representations in neural models cs.CL · 2026-05-02 · unverdicted · none · ref 271 · internal anchor
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
A Parametric Memory Head for Continual Generative Retrieval cs.IR · 2026-04-25 · unverdicted · none · ref 8 · internal anchor
A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging cs.CL · 2026-04-03 · unverdicted · none · ref 11 · internal anchor
Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.
Eliciting Latent Predictions from Transformers with the Tuned Lens cs.LG · 2023-03-14 · accept · none · ref 36 · internal anchor
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 143 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics cs.LG · 2026-05-06 · unverdicted · none · ref 12 · internal anchor
Symmetric spectral diagnostics on attention are structurally blind to flow direction, with asymmetry G as the sole control parameter, yielding a two-axis test that distinguishes bottleneck versus diffuse hallucination modes with opposite polarity.
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments cs.CL · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation cs.LG · 2026-04-26 · conditional · none · ref 12 · 2 links · internal anchor
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization cs.CL · 2026-04-21 · unverdicted · none · ref 19 · internal anchor
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 26 · internal anchor
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 183 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning cs.LG · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 22 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Automated Attention Pattern Discovery at Scale in Large Language Models cs.LG · 2026-04-04 · unverdicted · none · ref 9 · internal anchor
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 77 · internal anchor
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

Transformer Feed-Forward Layers Are Key-Value Memories

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer