Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

Alexander Fraser; Sabine Schulte im Walde; Sinan Kurtyigit

arxiv: 2604.13713 · v1 · submitted 2026-04-15 · 💻 cs.CL

Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

Sinan Kurtyigit , Sabine Schulte im Walde , Alexander Fraser This is my paper

Pith reviewed 2026-05-10 12:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords metaphor detectiongeneralizationlexical hold-outcontextual patternsverb lemmasRoBERTaVU Amsterdam Metaphor Corpus

0 comments

The pith

Metaphor detection models generalize mainly by learning transferable sentence patterns rather than memorizing specific verbs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether strong performance in verb metaphor detection comes from picking up general contextual signals or from storing knowledge about particular verbs. A lexical hold-out design withholds every instance of chosen verb lemmas from fine-tuning, then measures results on those unseen verbs versus verbs the model saw during training. The model keeps solid accuracy on the held-out verbs. Using sentence context by itself reaches the same level as the full model for those verbs, but static verb embeddings do not. These patterns indicate that generalization rests chiefly on learning reusable contextual cues, while exposure to a verb supplies only an extra increment.

Core claim

A lexical hold-out setup on the VU Amsterdam Metaphor Corpus, using RoBERTa, excludes all instances of selected verb lemmas from fine-tuning. The model maintains robust metaphor detection on these held-out lemmas. Sentence context alone matches full-model performance on held-out lemmas, whereas static verb-level embeddings do not. Generalization is therefore driven primarily by learning transferable contextual patterns, with verb-specific memorization supplying an additive boost only when lexical exposure is available.

What carries the argument

Lexical hold-out setup that withholds every instance of chosen verb lemmas from fine-tuning and compares performance on held-out versus exposed lemmas to isolate contextual pattern learning from verb memorization.

Load-bearing premise

Completely excluding all instances of selected verb lemmas from fine-tuning isolates contextual pattern learning without any residual verb-specific memorization or data leakage.

What would settle it

If providing only sentence context to the model produced markedly lower performance on held-out lemmas than the full model, or if hidden representations still encoded verb identity despite the hold-out, the claim that context alone drives generalization would be falsified.

Figures

Figures reproduced from arXiv: 2604.13713 by Alexander Fraser, Sabine Schulte im Walde, Sinan Kurtyigit.

read the original abstract

Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by "learning the cue" (transferable contextual patterns), while "learning the word" (verb-specific memorization) provides an additive boost when lexical exposure is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The lexical hold-out points to contextual cues driving generalization in verb metaphor detection, though pretraining may complicate the isolation.

read the letter

The punchline is that the authors' lexical hold-out experiment indicates generalization in verb metaphor detection comes mainly from learning contextual patterns, with verb-specific exposure providing only an extra lift. They reach this by training RoBERTa on the VU Amsterdam corpus while holding out all instances of selected lemmas and still getting robust results on those. What is new here is the strict separation of lemmas between fine-tuning and test sets, plus the follow-up showing that full sentence context matches the model's performance on held-out cases but isolated verb embeddings do not. This is a direct test of cue versus word learning that goes beyond standard benchmarks. The paper handles the contrast cleanly and focuses on a sensible target—verbs in a established corpus. The qualitative conclusion follows from the setup without obvious circularity. The soft spot is the pretraining issue. RoBERTa has almost certainly encountered the held-out verbs during its original training, so performance on them could reflect pretrained verb knowledge working with the fine-tuned context encoder rather than purely transferable cues learned from scratch in this task. The additive effect for exposed lemmas is more convincing, but the isolation of 'learning the cue' during fine-tuning is not complete. The work is also limited to one English corpus and one backbone model, which keeps the scope narrow. This paper suits readers interested in probing what language models learn in figurative language tasks and in designing better generalization tests. It is worth sending to a serious referee because the experimental idea is useful and the results could guide future robustness work, even if the interpretation needs tightening around pretraining effects. I would recommend peer review with attention to that point.

Referee Report

2 major / 2 minor

Summary. The paper examines whether strong performance in verb metaphor detection with RoBERTa on the VU Amsterdam Metaphor Corpus stems from learning transferable contextual patterns ('the cue') or verb-specific memorization ('the word'). It introduces a lexical hold-out protocol that excludes all instances of selected target lemmas from fine-tuning and compares model predictions on these held-out lemmas versus exposed lemmas seen during fine-tuning. The authors report robust performance on held-out lemmas, show that sentence context alone suffices to match full-model results for held-out cases while static verb embeddings do not, and conclude that generalization is driven primarily by cue learning with an additive boost from word exposure.

Significance. If the lexical hold-out successfully isolates contextual pattern learning acquired during fine-tuning, the work would offer a useful empirical decomposition of generalization mechanisms in metaphor detection and similar tasks, informing whether future models should prioritize contextual feature engineering or lexical specialization. The controlled contrast between held-out and exposed conditions is a methodological strength that enables direct comparison.

major comments (2)

[Experimental design / lexical hold-out] The lexical hold-out setup (described in the experimental design) excludes target lemmas only from the fine-tuning stage but leaves intact any verb-specific representations acquired during RoBERTa's pretraining on large corpora that almost certainly contain the held-out lemmas in both literal and metaphorical contexts. Consequently, the observed robust performance and sufficiency of sentence context on held-out lemmas may reflect interaction between pretrained verb knowledge and fine-tuned contextual patterns rather than purely transferable cue learning; the additive boost for exposed lemmas would then be only the marginal fine-tuning effect. This directly affects the central claim that generalization is 'primarily driven by learning the cue.'
[Results / ablation studies] The ablation showing that 'sentence context alone' matches full-model performance on held-out lemmas (results section) does not specify the exact implementation (e.g., whether the target verb token is masked, replaced by a generic embedding, or removed entirely). Without this detail it is unclear whether the context-only condition still benefits from any residual verb-specific information encoded in the pretrained transformer layers, weakening the contrast with static verb-level embeddings.

minor comments (2)

[Abstract] The abstract states the qualitative conclusions but omits all quantitative results (accuracy, F1, or statistical significance) for the held-out versus exposed conditions; including these numbers would allow readers to assess effect sizes immediately.
[Methods] Lemma selection criteria for the hold-out set (frequency, metaphoricity distribution, or random sampling) are not stated explicitly; adding this information would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and thoughtful review of our paper. Their comments highlight important aspects of our experimental design and results presentation. We provide point-by-point responses below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Experimental design / lexical hold-out] The lexical hold-out setup (described in the experimental design) excludes target lemmas only from the fine-tuning stage but leaves intact any verb-specific representations acquired during RoBERTa's pretraining on large corpora that almost certainly contain the held-out lemmas in both literal and metaphorical contexts. Consequently, the observed robust performance and sufficiency of sentence context on held-out lemmas may reflect interaction between pretrained verb knowledge and fine-tuned contextual patterns rather than purely transferable cue learning; the additive boost for exposed lemmas would then be only the marginal fine-tuning effect. This directly affects the central claim that generalization is 'primarily driven by learning the cue.'

Authors: We appreciate this observation regarding the pretraining phase. Our lexical hold-out is designed to prevent task-specific exposure to the target lemmas during fine-tuning on the metaphor detection task, thereby testing whether the model can generalize using contextual cues learned from other verbs. While we acknowledge that RoBERTa’s pretraining likely includes occurrences of these lemmas, the comparison between held-out and exposed conditions isolates the additional effect of fine-tuning on specific lemmas. The strong performance on held-out lemmas indicates that transferable cue learning occurs during fine-tuning, independent of task-specific lexical memorization. The additive boost for exposed lemmas supports the contribution of word learning. To clarify this distinction, we will add a paragraph in the discussion section explicitly addressing the role of pretraining and how our protocol controls for task-specific generalization. revision: yes
Referee: [Results / ablation studies] The ablation showing that 'sentence context alone' matches full-model performance on held-out lemmas (results section) does not specify the exact implementation (e.g., whether the target verb token is masked, replaced by a generic embedding, or removed entirely). Without this detail it is unclear whether the context-only condition still benefits from any residual verb-specific information encoded in the pretrained transformer layers, weakening the contrast with static verb-level embeddings.

Authors: We agree that the implementation details of the 'sentence context alone' condition are insufficiently specified in the current manuscript. We will revise the results section to clearly describe the procedure used for this ablation, including how the target verb is handled and how predictions are made from context alone. This will allow readers to better assess whether residual verb information from pretraining affects the results and will strengthen the comparison to the static verb embeddings condition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical hold-out experiments are self-contained

full rationale

The paper conducts an empirical analysis by training RoBERTa on the VU Amsterdam Metaphor Corpus under a lexical hold-out protocol that excludes all instances of selected lemmas from fine-tuning, then directly compares performance metrics between Held-out and Exposed lemmas. Claims about 'learning the cue' versus 'learning the word' are inferences drawn from these controlled comparisons (sentence context sufficiency, static embeddings insufficiency) rather than any derivation, equation, or fitted parameter that reduces to the input data by construction. No self-citations are load-bearing for the central result, and the setup is externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about data distribution and model behavior plus the domain assumption that the hold-out protocol cleanly separates contextual from lexical learning.

axioms (2)

domain assumption The VU Amsterdam Metaphor Corpus contains representative verb metaphor instances suitable for generalization testing.
The entire analysis is performed on this single corpus.
domain assumption Excluding all instances of a lemma from fine-tuning prevents any verb-specific memorization.
This is the core premise of the held-out versus exposed comparison.

pith-pipeline@v0.9.0 · 5470 in / 1306 out tokens · 62834 ms · 2026-05-10T12:58:06.713914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

[1]

InProceedings of the Second Workshop on Figurative Language Processing, pages 235–243, Online

Go fig- ure! multi-task transformer-based architecture for metaphor detection using idioms: ETS team in 2020 metaphor shared task. InProceedings of the Second Workshop on Figurative Language Processing, pages 235–243, Online. Association for Computational Lin- guistics. Minjin Choi, Sunkyung Lee, Eunseong Choi, Heesoo Park, Junhyuk Lee, Dongwon Lee, and J...

work page 2020
[2]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1763–1773, Online

MelBERT: Metaphor detection via contextual- ized late interaction using metaphorical identification theories. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1763–1773, Online. Association for Computational Linguistics. Peter Crisp, Raymond Gibbs, Ali...

work page 2021
[3]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Matteo ...

work page 2019
[4]

Preprint, arXiv:2509.24866

Metaphor identification using large language models: A com- parison of rag, prompt engineering, and fine-tuning. Preprint, arXiv:2509.24866. Hongyu Gong, Kshitij Gupta, Akriti Jain, and Suma Bhat

work page arXiv
[5]

InProceedings of the Second Workshop on Figura- tive Language Processing, pages 146–153, Online

IlliniMet: Illinois system for metaphor detection with contextual and linguistic information. InProceedings of the Second Workshop on Figura- tive Language Processing, pages 146–153, Online. Association for Computational Linguistics. George Lakoff and Mark Johnson. 1980.Metaphors we live by. University of Chicago press. Chee Wee (Ben) Leong, Beata Beigman...

work page 1980
[6]

InProceed- ings of the Second Workshop on Figurative Language Processing, pages 18–29, Online

A report on the 2020 VUA and TOEFL metaphor detection shared task. InProceed- ings of the Second Workshop on Figurative Language Processing, pages 18–29, Online. Association for Computational Linguistics. Chee Wee (Ben) Leong, Beata Beigman Klebanov, and Ekaterina Shutova

work page 2020
[7]

InProceedings of the Workshop on Figurative Language Processing, pages 56–66, New Orleans, Louisiana

A report on the 2018 VUA metaphor detection shared task. InProceedings of the Workshop on Figurative Language Processing, pages 56–66, New Orleans, Louisiana. Association for Computational Linguistics. Yucheng Li, Shun Wang, Chenghua Lin, and Frank Guerin

work page 2018
[8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining ap- proach.Preprint, arXiv:1907.11692. Saif Mohammad, Ekaterina Shutova, and Peter Tur- ney

work page internal anchor Pith review Pith/arXiv arXiv 1907
[9]

Gerard Steen, Lettie Dorst, J

rspeer/wordfreq: v3.0. Gerard Steen, Lettie Dorst, J. Herrmann, Anna Kaal, Tina Krennmayr, and Trijntje Pasma. 2010.A method for linguistic metaphor identification: From MIP to MIPVU. 9 Chuandong Su, Fumiyo Fukumoto, Xiaoxi Huang, Jiyi Li, Rongbo Wang, and Zhiqun Chen

work page 2010

[1] [1]

InProceedings of the Second Workshop on Figurative Language Processing, pages 235–243, Online

Go fig- ure! multi-task transformer-based architecture for metaphor detection using idioms: ETS team in 2020 metaphor shared task. InProceedings of the Second Workshop on Figurative Language Processing, pages 235–243, Online. Association for Computational Lin- guistics. Minjin Choi, Sunkyung Lee, Eunseong Choi, Heesoo Park, Junhyuk Lee, Dongwon Lee, and J...

work page 2020

[2] [2]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1763–1773, Online

MelBERT: Metaphor detection via contextual- ized late interaction using metaphorical identification theories. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1763–1773, Online. Association for Computational Linguistics. Peter Crisp, Raymond Gibbs, Ali...

work page 2021

[3] [3]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Matteo ...

work page 2019

[4] [4]

Preprint, arXiv:2509.24866

Metaphor identification using large language models: A com- parison of rag, prompt engineering, and fine-tuning. Preprint, arXiv:2509.24866. Hongyu Gong, Kshitij Gupta, Akriti Jain, and Suma Bhat

work page arXiv

[5] [5]

InProceedings of the Second Workshop on Figura- tive Language Processing, pages 146–153, Online

IlliniMet: Illinois system for metaphor detection with contextual and linguistic information. InProceedings of the Second Workshop on Figura- tive Language Processing, pages 146–153, Online. Association for Computational Linguistics. George Lakoff and Mark Johnson. 1980.Metaphors we live by. University of Chicago press. Chee Wee (Ben) Leong, Beata Beigman...

work page 1980

[6] [6]

InProceed- ings of the Second Workshop on Figurative Language Processing, pages 18–29, Online

A report on the 2020 VUA and TOEFL metaphor detection shared task. InProceed- ings of the Second Workshop on Figurative Language Processing, pages 18–29, Online. Association for Computational Linguistics. Chee Wee (Ben) Leong, Beata Beigman Klebanov, and Ekaterina Shutova

work page 2020

[7] [7]

InProceedings of the Workshop on Figurative Language Processing, pages 56–66, New Orleans, Louisiana

A report on the 2018 VUA metaphor detection shared task. InProceedings of the Workshop on Figurative Language Processing, pages 56–66, New Orleans, Louisiana. Association for Computational Linguistics. Yucheng Li, Shun Wang, Chenghua Lin, and Frank Guerin

work page 2018

[8] [8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining ap- proach.Preprint, arXiv:1907.11692. Saif Mohammad, Ekaterina Shutova, and Peter Tur- ney

work page internal anchor Pith review Pith/arXiv arXiv 1907

[9] [9]

Gerard Steen, Lettie Dorst, J

rspeer/wordfreq: v3.0. Gerard Steen, Lettie Dorst, J. Herrmann, Anna Kaal, Tina Krennmayr, and Trijntje Pasma. 2010.A method for linguistic metaphor identification: From MIP to MIPVU. 9 Chuandong Su, Fumiyo Fukumoto, Xiaoxi Huang, Jiyi Li, Rongbo Wang, and Zhiqun Chen

work page 2010