arxiv: 2012.14913 · v2 · submitted 2020-12-29 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva , Roei Schuster , Jonathan Berant , Omer Levy

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords transformersfeed-forward layerskey-value memorieslanguage modelsmodel interpretabilityneural network analysis

0 comments

The pith

Transformer feed-forward layers function as key-value memories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that feed-forward layers, which hold two-thirds of a transformer's parameters, act as key-value memories. Each key matches particular textual patterns seen during training, while each value produces a distribution over likely next tokens. Lower layers focus on simple surface patterns and upper layers on semantic ones, with the layer output combining multiple such memories. Residual connections then refine the combined result into the final prediction.

Core claim

Feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. The learned patterns are human-interpretable, with lower layers capturing shallow patterns and upper layers learning more semantic ones. The values complement the keys by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern. The output of a feed-forward layer is a composition of its memories, refined throughout the model via residual connections.

What carries the argument

Key-value memory pairs inside each feed-forward layer, where a key detects an input pattern and a value supplies a next-token distribution.

Load-bearing premise

The correlations between learned keys and input patterns, and between values and output distributions, reflect the actual computation the model performs at inference time.

What would settle it

Alter the weights of one specific key-value pair and measure whether the model's next-token predictions shift only for inputs that match the corresponding pattern.

read the original abstract

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Feed-forward layers look like key-value memories here, with interpretable patterns and output effects, but the evidence is correlational rather than causal.

read the letter

The main thing to know is that this paper frames transformer feed-forward layers as key-value memories: keys match textual patterns from training data, and values bias the output distribution toward likely next tokens, with lower layers doing shallower patterns and upper ones more semantic ones. The layer output is presented as a composition of these memories that residuals then refine. This is a useful shift because it targets the bulk of the parameters instead of just attention. They show the patterns are human-interpretable through activation checks on n-grams and by examining the distributions the value vectors induce over the vocabulary. Those observations line up with the claims and give a concrete way to inspect what the weights have stored. The work is grounded in probing trained models rather than assuming the structure upfront. The softer spot is exactly the one the stress-test note flags. The matches are statistical, but there is no causal test showing these parameters are read out and used that way in the actual forward pass. The layer equation W2 · f(W1x) could support other computations whose side effects happen to correlate with the observed patterns. Without interventions like targeted edits or ablations that change behavior as predicted, the mechanistic claim stays interpretive. This is for people working on mechanistic interpretability or knowledge editing in language models. It has enough new empirical detail and reproducible observations to deserve a serious referee, though the review would probably push for causal checks to tighten the argument.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that feed-forward layers in transformer language models function as key-value memories: keys correlate with human-interpretable textual patterns from the training data (shallow in lower layers, semantic in upper layers), values induce complementary next-token distributions, and the layer output is a composition of activated memories that is refined via residual connections to produce the final distribution.

Significance. If the interpretation holds, it supplies a concrete mechanistic account of two-thirds of transformer parameters, grounded in empirical probing that reveals interpretable patterns and input-output complementarity. This could support targeted model editing and deeper understanding of how transformers store and retrieve information.

major comments (2)

[§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).
[§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.

minor comments (2)

[Figure 3] Figure 3 and Table 1: the value-distribution visualizations would be clearer with an explicit random-key baseline to quantify how much the reported concentration exceeds chance.
Notation: the mapping from matrix rows/columns to keys and values is introduced without a compact equation; adding a single-line definition (e.g., key_i = row i of W1) would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We provide detailed responses to each major comment below, indicating where we will revise the manuscript to address the concerns.

read point-by-point responses

Referee: [§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).

Authors: We acknowledge that the primary evidence consists of strong statistical correlations between the keys and specific textual patterns, identified by finding inputs that highly activate each key. These correlations are not merely side-effects, as they directly correspond to the computation in the forward pass where high key activation leads to the associated value contributing to the output. Nevertheless, to provide stronger causal evidence, we will add experiments involving the ablation of specific keys and measure the impact on the model's predictions for inputs containing the corresponding patterns. revision: yes
Referee: [§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.

Authors: The non-linearity f is applied element-wise to the pre-activations, meaning each key's activation scalar is computed independently as f(key_i · x). The layer output is then the linear combination sum_i activation_i * value_i. Therefore, the memories combine linearly once activated, with the non-linearity affecting only the activation strength of each memory individually. We will revise §3.2 to include this explicit mathematical expansion and present controlled experiments where we compare the actual FF output to the linear combination of individually computed memory contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations from trained models do not reduce to self-definition or fitted inputs

full rationale

The paper's central claim rests on post-training analysis of existing transformer weights: identifying input patterns that strongly activate specific rows of the first FF matrix (treated as keys) and observing that the corresponding columns of the second matrix induce next-token distributions (treated as values). These are measured correlations on held-out data and activation statistics, not quantities defined in terms of each other or obtained by fitting a parameter whose value is then relabeled as a prediction. No equations are shown to be equivalent by construction, no uniqueness theorem is imported from the authors' prior work to force the interpretation, and the residual composition argument is demonstrated via direct layer-wise ablation rather than assumed. The derivation chain is therefore self-contained against external benchmarks (the trained models themselves).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on empirical observations from trained transformers rather than new free parameters or invented physical entities; standard language-modeling assumptions are used.

axioms (1)

domain assumption Transformers are trained via next-token prediction on large corpora
Invoked implicitly when linking keys to training patterns and values to output distributions.

invented entities (1)

key-value memory structure inside feed-forward layers no independent evidence
purpose: Interpretive lens to explain layer behavior
This is a conceptual reframing of existing weights, not a new postulated object with independent falsifiable predictions.

pith-pipeline@v0.9.0 · 5435 in / 1148 out tokens · 38726 ms · 2026-05-13T23:30:42.949422+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
cs.LG 2022-11 conditional novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
cs.CL 2026-05 unverdicted novelty 7.0

Multimodal knowledge editing causes models to confuse original and edited entity identities in text queries by failing to update image-entity bindings and instead overfitting entity-entity shortcuts.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
stat.ML 2026-05 unverdicted novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
How Language Models Process Negation
cs.CL 2026-05 unverdicted novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
A framework for analyzing concept representations in neural models
cs.CL 2026-05 unverdicted novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
A Parametric Memory Head for Continual Generative Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging
cs.CL 2026-04 unverdicted novelty 7.0

Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
cs.LG 2026-05 unverdicted novelty 6.0

Symmetric spectral diagnostics on attention are structurally blind to flow direction, with asymmetry G as the sole control parameter, yielding a two-axis test that distinguishes bottleneck versus diffuse hallucination...
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
cs.CL 2026-05 unverdicted novelty 6.0

LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
cs.CL 2026-04 unverdicted novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning
cs.LG 2026-04 unverdicted novelty 6.0

BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Automated Attention Pattern Discovery at Scale in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
cs.CL 2026-03 unverdicted novelty 6.0

Bidirectional objectives mitigate reversal by requiring explicit source-as-target signals and storing directions as distinct representations instead of inducing latent generalization.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 22 Pith papers

[1]

Bastani and Y

O. Bastani and Y. Ioannou and L. Lampropoulos and D. Vytiniotis and A. Nori and A. Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

work page
[2]

J. Z. Kolter and E. Wong , journal =. Provable defenses against adversarial examples via the convex outer adversarial polytope (published at

work page
[3]

Wong and J

E. Wong and J. Z. Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

work page
[4]

Dvijotham and R

K. Dvijotham and R. Stanforth and S. Gowal and T. Mann and P. Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

work page
[5]

Hein and M

M. Hein and M. Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

work page
[6]

A. A. Ahmadi and A. Majumdar , journal =

work page
[7]

Dalvi and A

N. Dalvi and A. Dasgupta and R. Kumar and V. Rastogi , booktitle =. Aggregating crowdsourced binary ratings , year =

work page
[8]

Joglekar and H

M. Joglekar and H. Garcia-Molina and A. Parameswaran , booktitle =. Comprehensive and reliable crowd assessment algorithms , year =

work page
[9]

Zhang and X

Y. Zhang and X. Chen and D. Zhou and M. I. Jordan , journal =. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing , volume =

work page
[10]

Balsubramani and Y

A. Balsubramani and Y. Freund , booktitle =. Scalable semi-supervised aggregation of classifiers , year =

work page
[11]

Craven and J

M. Craven and J. Kumlien and others , booktitle =. Constructing biological knowledge bases by extracting information from text sources , year =

work page
[12]

Varma and B

P. Varma and B. He and D. Iter and P. Xu and R. Yu and C. D. Sa and C. R. arXiv preprint arXiv:1610.08123 , title =

work page arXiv
[13]

Shin and S

J. Shin and S. Wu and F. Wang and C. D. Sa and C. Zhang and C. R. Incremental knowledge base construction using. Very Large Data Bases (VLDB) , number =

work page
[14]

Roth and D

B. Roth and D. Klakow , booktitle =. Combining Generative and Discriminative Model Scores for Distant Supervision , year =

work page
[15]

Takamatsu and I

S. Takamatsu and I. Sato and H. Nakagawa , booktitle =. Reducing wrong labels in distant supervision for relation extraction , year =

work page
[16]

C. D. Sa and A. Ratner and C. R. Deepdive: declarative knowledge base construction , volume =. ACM SIGMOD Record , number =

work page
[17]

Wu and L

S. Wu and L. Hsiao and X. Cheng and B. Hancock and T. Rekatsinas and P. Levis and C. R. Proceedings of SIGMOD 2018 , title =

work page 2018
[18]

Alfonseca and K

E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =

work page
[19]

Bunescu and R

R. Bunescu and R. Mooney , booktitle =. Learning to extract relations from the web using minimal supervision , year =

work page
[20]

Parkash and D

A. Parkash and D. Parikh , booktitle =. Attributes for classifier feedback , year =

work page
[21]

Druck and B

G. Druck and B. Settles and A. McCallum , booktitle =. Active learning by labeling features , year =

work page
[22]

Raghavan and O

H. Raghavan and O. Madani and R. Jones , booktitle =. InterActive Feature Selection , volume =

work page
[23]

G. S. Mann and A. McCallum , journal =. Generalized expectation criteria for semi-supervised learning with weakly labeled data , volume =

work page
[24]

MacCartney , howpublished =

B. MacCartney , howpublished =. SippyCup , year =

work page
[25]

D. H. Younger , journal =. Recognition and parsing of context-free languages in time n3 , volume =

work page
[26]

A. J. Ratner and C. M. D. Sa and S. Wu and D. Selsam and C. R. Data programming: Creating large training sets, quickly , year =. Advances in Neural Information Processing Systems (NIPS) , pages =

work page
[27]

B. S. H. and H. Bryan and R. Alexander and R. Christopher , booktitle =. Learning the Structure of Generative Models without Labeled Data , year =

work page
[28]

Corney and D

D. Corney and D. Albakour and M. Martinez-Alvarez and S. Moussa , booktitle =. What do a million news articles look like? , year =

work page
[29]

Wei and Y

C. Wei and Y. Peng and R. Leaman and A. P. Davis and C. J. Mattingly and J. Li and T. C. Wiegers and Z. Lu , booktitle =. Overview of the BioCreative

work page
[30]

A. J. Ratner and S. H. Bach and H. Ehrenberg and J. Fries and S. Wu and C. R. Very Large Data Bases (VLDB) , title =

work page
[31]

Srivastava and I

S. Srivastava and I. Labutov and T. Mitchell , booktitle =. Joint concept learning and semantic parsing from natural language explanations , year =

work page
[32]

Ling and S

H. Ling and S. Fidler , booktitle =. Teaching Machines to Describe Images via Natural Language Feedback , year =

work page
[33]

Li and A

J. Li and A. H. Miller and S. Chopra and M. Ranzato and J. Weston , journal =. Learning Through Dialogue Interactions , year =

work page
[34]

Andreas and D

J. Andreas and D. Klein and S. Levine , journal =. Learning with Latent Language , year =

work page
[35]

J. E. Weston , booktitle =. Dialog-based language learning , year =

work page
[36]

L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =

work page
[37]

Krening and B

S. Krening and B. Harrison and K. M. Feigh and C. L. Isbell and M. Riedl and A. Thomaz , journal =. Learning from explanations using sentiment and advice in

work page
[38]

Guidotti and A

R. Guidotti and A. Monreale and F. Turini and D. Pedreschi and F. Giannotti , journal =. A Survey Of Methods For Explaining Black Box Models , year =

work page
[39]

Yessenalina and Y

A. Yessenalina and Y. Choi and C. Cardie , booktitle =. Automatically generating annotator rationales to improve sentiment classification , year =

work page
[40]

Arora and E

S. Arora and E. Nyberg , booktitle =. Interactive annotation learning with indirect feature voting , year =

work page
[41]

Grechkin and H

M. Grechkin and H. Poon and B. Howe , journal =. EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation , year =

work page
[42]

Ratinov and D

L. Ratinov and D. Roth and D. Downey and M. Anderson , booktitle =. Local and Global Algorithms for Disambiguation to

work page
[43]

Kalyanpur and B

A. Kalyanpur and B. K. Boguraev and S. Patwardhan and J. W. Murdock and A. Lally and C. A. Welty and J. M. Prager and B. Coppola and A. Fokoue-Nkoutche and L. Zhang and Y. Pan and Z. M. Qui , journal =. Structured data and inference in DeepQA , volume =

work page
[44]

Lee and P

K. Lee and P. H. Seo and J. Choi and S. Koo and G. G. Lee , journal =. Conversational knowledge teaching agent that uses a knowledge base , year =

work page
[45]

Han and J

S. Han and J. Bang and S. Ryu and G. G. Lee , journal =. Exploiting knowledge base to generate responses for natural language dialog listening agents , year =

work page
[46]

Ellis and J

J. Ellis and J. Getman and H. Simpson and K. Griffitt and H. T. Dang and R. Grishman and H. Ji and C. DePrince and T. Riese and N. Kuster , journal =

work page
[47]

J. A. Aslam and V. Pavlu and E. Yilmaz , booktitle =. A statistical method for system evaluation using incomplete judgments , year =

work page
[48]

Buckley and D

C. Buckley and D. Dimmick and I. Soboroff and E. Voorhees , booktitle =. Bias and the limits of pooling for large collections , year =

work page
[49]

Buckley and E

C. Buckley and E. M. Voorhees , booktitle =. Retrieval evaluation with incomplete information , year =

work page
[50]

Sakai and N

T. Sakai and N. Kando , booktitle =. On information retrieval metrics designed for evaluation with incomplete relevance assessments , year =

work page
[51]

G. V. Cormack and C. R. Palmer and C. L. A. Clarke , booktitle =. Efficient Construction of Large Test Collections , year =

work page
[52]

Yilmaz and E

E. Yilmaz and E. Kanoulas and J. A. Aslam , booktitle =. A simple and efficient sampling method for estimating

work page
[53]

Vannella and D

D. Vannella and D. Jurgens and D. Scarfini and D. Toscani and R. Navigli , booktitle =. Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose , year =

work page
[54]

Pavlick and H

E. Pavlick and H. Ji and X. Pan and C. Callison-Burch , booktitle =. The Gun Violence Database: A new task and data set for

work page
[55]

W. E. Webber , school =. Measurement in Information Retrieval Evaluation , year =

work page
[56]

Zobel , booktitle =

J. Zobel , booktitle =. How reliable are the results of large-scale information retrieval experiments? , year =

work page
[57]

E. M. Voorhees and D. Harman , booktitle =. Overview of the Eight Text REtreival Conference (

work page
[58]

Adel and B

H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =

work page
[59]

A. B. Owen , publisher =. Monte Carlo theory, methods and examples , year =

work page
[60]

K. S. Jones and C. V. Rijsbergen , journal =. Report on the Need for and Provision of an ``Ideal test collection , year =

work page
[61]

D. K. Harman , journal =. The first text retrieval conference (TREC-1) Rockville, MD, U.S.A., 4-6 November, 1992 , volume =

work page 1992
[62]

Ji and R

H. Ji and R. Grishman and H. Text Analytics Conference , title =

work page
[63]

R. L. Burden and J. D. Faires , publisher =. Numerical Analysis (3rd ed.) , year =

work page
[64]

Liu and S

A. Liu and S. Soderland and J. Bragg and C. H. Lin and X. Ling and D. S. Weld , booktitle =. Effective Crowd Annotation for Relation Extraction , year =

work page
[65]

H. T. Dang , journal =. Cold Start Knowledge Base Population at

work page
[66]

Ellis and J

J. Ellis and J. Getman and D. Fore and N. Kuster and Z. Song and A. Bies and S. Strassel , journal =. Overview of linguistic resources for the

work page
[67]

Ellis and X

J. Ellis and X. Li and K. Griffitt and S. M. Strassel , journal =. Linguistic Resources for 2012 Knowledge Base Population Evaluations , year =

work page 2012
[68]

Plank , journal =

B. Plank , journal =. What to do about non-standard (or non-canonical) language in

work page
[69]

Novikova and O

J. Novikova and O. Du. Empirical Methods in Natural Language Processing (EMNLP) , title =

work page
[70]

Lin and M

C. Lin and M. Rey , booktitle =. Looking for a Few Good Metrics:

work page
[71]

Cohan and N

A. Cohan and N. Goharian , booktitle =. Revisiting Summarization Evaluation for Scientific Articles , year =

work page
[72]

Lavie and M

A. Lavie and M. Denkowski , journal =. The Meteor Metric for Automatic Evaluation of Machine Translation , volume =

work page
[73]

Denkowski and A

M. Denkowski and A. Lavie , booktitle =. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , year =

work page
[74]

Vedantam and C

R. Vedantam and C. L. Zitnick and D. Parikh , booktitle =

work page
[75]

G. A. Miller and J. G. Beebe-Center , journal =. Some Psychological Methods for Evaluating the Quality of Translations , volume =

work page
[76]

J. H. Lau and A. Clark and S. Lappin , journal =. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , volume =

work page
[77]

See and P

A. See and P. J. Liu and C. D. Manning , booktitle =. Get To The Point: Summarization with Pointer-Generator Networks , year =

work page
[78]

Paulus and C

R. Paulus and C. Xiong and R. Socher , booktitle =. A Deep Reinforced Model for Abstractive Summarization , year =

work page
[79]

Lin and M

T. Lin and M. Maire and S. Belongie and J. Hays and P. Perona and D. Ramanan and P. Doll. Microsoft. European Conference on Computer Vision (ECCV) , pages =

work page
[80]

J. M. Conroy and H. T. Dang , booktitle =. Mind the Gap : Dangers of Divorcing Evaluations of Summary Content from Linguistic Quality , year =

work page

Showing first 80 references.