pith. sign in

hub Canonical reference

Transformer Feed-Forward Layers Are Key-Value Memories

Canonical reference. 86% of citing Pith papers cite this work as background.

55 Pith papers citing it
186 external citations · Crossref
Background 86% of classified citations
abstract

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.

hub tools

citation-role summary

background 7

citation-polarity summary

roles

background 7

polarities

background 6 support 1

representative citing papers

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

Norm Anchors Make Model Edits Last

cs.LG · 2026-01-30 · conditional · novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.

Improving Dictionary Learning with Gated Sparse Autoencoders

cs.LG · 2024-04-24 · unverdicted · novelty 7.0

Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

Cross-Lingual Exploration for Parametric Knowledge

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Cross-lingual prompt exploration improves factual recall and consistency in LLMs across 17 languages more efficiently than native-language scaling.

Variable-Width Transformers

cs.CL · 2026-06-16 · conditional · novelty 6.0

×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

Multi-component Causal Tracing in Large Language Models

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

A unified multi-component causal tracing method that uses soft interventions and a metric transformation to efficiently select critical LLM components for a target performance metric.

citing papers explorer

Showing 50 of 55 citing papers.