pith. sign in

arxiv: 2604.08156 · v1 · submitted 2026-04-09 · 💻 cs.CL

Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords rhyme recognitionunsupervised learningRhymeTaggerinter-annotator agreementmultilingual poetrylarge language modelstraining data sensitivityphonetic patterns
0
0 comments X

The pith

RhymeTagger exceeds human inter-annotator agreement on rhyme identification once given enough training data across seven languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how much unlabeled poetry data RhymeTagger needs to recognize rhymes reliably without language-specific rules. It runs experiments on Czech, German, English, French, Italian, Russian, and Slovene corpora and measures results against expert human agreement on the same poems. The authors also test three large language models in a one-shot setting. With sufficient data the tool surpasses typical human consistency while the LLMs perform poorly because they lack phonetic representations. The work matters because rhyme is historically variable and people disagree on borderline cases, so a data-driven benchmark helps separate genuine pattern detection from annotation noise.

Core claim

RhymeTagger, a language-independent tool that identifies rhymes by detecting repeating sound patterns in poetry corpora, reaches and then exceeds inter-annotator agreement levels once supplied with adequate training data. The same evaluation shows that large language models using one-shot prompting without explicit phonetic information remain far below both human agreement and the unsupervised tool. Performance varies with language and training size, and human disagreements themselves correlate with phonetic similarity between candidate rhymes and their distance within the poem.

What carries the argument

RhymeTagger, which locates rhymes through unsupervised detection of repeating phonological patterns across an entire poetry corpus without hand-coded rules or phonetic dictionaries.

If this is right

  • Rhyme recognition accuracy rises with training corpus size until a language-specific sufficiency threshold is crossed.
  • Once above that threshold the unsupervised method becomes more consistent than human experts on the same material.
  • Large language models without phonetic input cannot match the pattern-based tool even in one-shot settings.
  • Phonetic similarity and line distance between words are measurable sources of human annotation disagreement.
  • Language differences affect both the minimum data needed and the final accuracy ceiling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern-matching approach could be applied to other historically variable poetic features such as meter or alliteration.
  • Adding phonetic embeddings to large language models might close the performance gap observed here.
  • Digital archives could use the method to tag rhyme structures at scale and enable quantitative studies of rhyme evolution across centuries.
  • Minimum training sizes per language could serve as practical guidelines for building similar tools for under-resourced poetic traditions.

Load-bearing premise

Inter-annotator agreement on a manually annotated subset of poems is a suitable upper-bound benchmark for what counts as reliable automated rhyme recognition.

What would settle it

If RhymeTagger trained on the largest available corpora still falls short of the measured inter-annotator agreement on the held-out annotated poems, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.08156 by Antonina Martynenko, Artjoms \v{S}e\c{l}a, Ben Nagy, Lara Nugues, Luca Giovannini, Mirella De Sisto, Ne\v{z}a Ko\v{c}nik, Petr Plech\'a\v{c}, Robert Kol\'ar, Silvie Cinkov\'a.

Figure 1
Figure 1. Figure 1: F1-scores achieved by RhymeTagger against human annotators. 100 models per single model size en and fr, the results remain consistent across the entire range—from 1k-line models to those trained on 1M lines. We assume this is due to the predominance of perfect rhyme matches in these languages, which require little learning from the training data for rhymeTagger to recognize. In other languages, by contrast… view at source ↗
read the original abstract

Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RhymeTagger, an unsupervised, language-independent tool that detects rhymes via repeating phonetic patterns in poetry corpora. It reports experiments across seven languages (Czech, German, English, French, Italian, Russian, Slovene) that vary training-corpus size, benchmark the tool against inter-annotator agreement (IAA) measured on a manually annotated poem subset, analyze disagreement factors (phonetic similarity and poetic distance), and compare one-shot LLM performance. The central claim is that, once sufficient training data is supplied, RhymeTagger exceeds IAA while LLMs without explicit phonetic representations perform poorly.

Significance. If the central claim is substantiated, the work supplies a practical, data-scalable unsupervised baseline for multilingual rhyme detection that can exceed human consistency; this would be useful for digital humanities pipelines and for testing phonetic representations in LLMs. The multilingual design and explicit IAA benchmark are positive features. The result is not parameter-free or machine-checked, but the empirical scope is non-trivial.

major comments (3)
  1. [Evaluation section (IAA benchmark paragraph)] Evaluation section (IAA benchmark paragraph): the claim that outperforming IAA demonstrates 'superior reliability' is load-bearing for the main conclusion, yet the manuscript does not report RhymeTagger accuracy broken down on the subset of pairs where the two human annotators disagree. Without this split, it is impossible to distinguish whether the tool resolves genuine ambiguity or simply adopts a more consistent (but not necessarily more correct) labeling rule.
  2. [Results (training-size curves)] Results (training-size curves, Table 2 or equivalent): the 'sufficient training data' threshold at which RhymeTagger exceeds IAA is reported per language, but no statistical test or confidence interval is given for the crossing point; given the acknowledged language-specific phonetic and poetic-distance confounds, the sensitivity analysis risks over-interpreting small-sample fluctuations as a general data-size effect.
  3. [LLM comparison paragraph] LLM comparison paragraph: the one-shot prompt is presented as a fair test of 'LLMs lacking phonetic representation,' but the manuscript does not include an ablation that supplies phonetic transcriptions or IPA to the same models; without that control, the performance gap cannot be attributed specifically to the absence of phonetic features rather than prompt engineering or model scale.
minor comments (2)
  1. [Methods] The definition of 'repeating patterns' used by RhymeTagger is referenced but not formalized with pseudocode or an equation; a short algorithmic sketch would improve reproducibility.
  2. [Results figures] Figure captions for the training-size plots do not state the exact number of poems or rhyme pairs per language, making it difficult to assess whether the reported curves are comparable across languages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the strength of our claims. We address each major point below, indicating revisions where we agree changes are needed to improve the manuscript.

read point-by-point responses
  1. Referee: Evaluation section (IAA benchmark paragraph): the claim that outperforming IAA demonstrates 'superior reliability' is load-bearing for the main conclusion, yet the manuscript does not report RhymeTagger accuracy broken down on the subset of pairs where the two human annotators disagree. Without this split, it is impossible to distinguish whether the tool resolves genuine ambiguity or simply adopts a more consistent (but not necessarily more correct) labeling rule.

    Authors: We agree this breakdown is necessary to substantiate the reliability claim. In the revised version, we will report RhymeTagger performance separately on the disagreed pairs from the IAA subset. This analysis will use the existing manual annotations and RhymeTagger outputs on those pairs to show whether the model improves on ambiguous cases or merely enforces consistency. revision: yes

  2. Referee: Results (training-size curves, Table 2 or equivalent): the 'sufficient training data' threshold at which RhymeTagger exceeds IAA is reported per language, but no statistical test or confidence interval is given for the crossing point; given the acknowledged language-specific phonetic and poetic-distance confounds, the sensitivity analysis risks over-interpreting small-sample fluctuations as a general data-size effect.

    Authors: This observation is correct and we will strengthen the analysis. The revision will incorporate bootstrap-derived confidence intervals on the performance curves and permutation tests to evaluate the statistical significance of the crossing points. We will also emphasize that language-specific factors are already analyzed in the discussion section. revision: yes

  3. Referee: LLM comparison paragraph: the one-shot prompt is presented as a fair test of 'LLMs lacking phonetic representation,' but the manuscript does not include an ablation that supplies phonetic transcriptions or IPA to the same models; without that control, the performance gap cannot be attributed specifically to the absence of phonetic features rather than prompt engineering or model scale.

    Authors: We accept that an IPA ablation would provide stronger causal attribution. However, such an experiment requires non-trivial changes to prompting and input handling for the LLMs, which falls outside the scope of the current one-shot comparison focused on standard text-based usage. We will revise the text to explicitly acknowledge this limitation and note that the observed gap may involve multiple factors while still illustrating the value of explicit phonetic modeling. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons to independent annotations.

full rationale

The paper conducts an empirical study measuring RhymeTagger performance (pattern-based, unsupervised) against inter-annotator agreement on a manually annotated subset and against LLM outputs. No equations, derivations, or self-citations reduce the reported accuracies or data-size sensitivities to quantities fitted from the evaluation data itself. IAA serves as an external benchmark derived from human judgments independent of the tool's pattern extraction; varying training corpus sizes and comparing outputs does not create self-definitional or fitted-input circularity. The central claims rest on direct, falsifiable comparisons rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or non-standard axioms are mentioned. Relies on standard domain assumptions in unsupervised NLP and annotation studies.

axioms (2)
  • domain assumption Rhymes can be reliably identified from repeating phonetic patterns in poetry corpora without language-specific rules
    Core premise of the RhymeTagger tool as described in the abstract.
  • domain assumption Inter-annotator agreement on rhyme labels constitutes a meaningful performance ceiling for automated systems
    Used to establish the realistic benchmark referenced in the findings.

pith-pipeline@v0.9.0 · 5545 in / 1388 out tokens · 76737 ms · 2026-05-10T17:08:49.871543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Word Association Norms, Mutual Information, and Lexicography

    Church, Kenneth Ward and Patrick Hanks (1990). “Word Association Norms, Mutual Information, and Lexicography”. In:Computational Linguistics16.1, pp. 22–29.url:https://aclanthology.org/ J90-1003/

  2. [2]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    Cohen, Jacob (1960). “A Coefficient of Agreement for Nominal Scales”. In:Educational and Psychological Measurement20.1, pp. 37–46.doi:10.1177/001316446002000104. eSpeak-NG (2023).eSpeak-NG.url:https://github.com/espeak-ng/espeak-ng

  3. [3]

    Supervised Rhyme Detection with Siamese Recurrent Net- works

    Haider, Thomas and Jonas Kuhn (2018). “Supervised Rhyme Detection with Siamese Recurrent Net- works”. In:Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cul- tural Heritage, Social Sciences, Humanities and Literature. Ed. by Beatrice Alex, Stefania Degaetano-

  4. [4]

    Rhymefindr : An Historical Poetics Method for Identifying Rhymes in Nineteenth-Century English Poetry

    Houston, Natalie M. (2025). “Rhymefindr : An Historical Poetics Method for Identifying Rhymes in Nineteenth-Century English Poetry”. In: Publisher: UNSPECIFIED.doi:10 . 26083 / TUPRINTS - 00030148.url:https : / / tuprints . ulb . tu - darmstadt . de / id / eprint / 30148(visited on 09/29/2025)

  5. [5]

    Agreement, the f-measure, and reliability in infor- mation retrieval

    Hripcsak, George and Adam S. Rothschild (2005). “Agreement, the f-measure, and reliability in infor- mation retrieval”. In:Journal of the American Medical Informatics Association12.3, pp. 296–298. doi:10.1197/jamia.M1733

  6. [6]

    PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors

    Mortensen, David R., Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, and Lori Levin (2016). “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors”. In:Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Pa- pers. Ed. by Yuji Matsumoto and Rashmi Prasad. Osaka, Japan: The...

  7. [7]

    Rhyme in classical Latin poetry: Stylistic or stochastic?

    Nagy, Ben (Dec. 2022). “Rhyme in classical Latin poetry: Stylistic or stochastic?” In:Digital Scholarship in the Humanities37.4, pp. 1097–1118.issn: 2055-7671.doi:10.1093/llc/fqab105.url:https: //doi.org/10.1093/llc/fqab105(visited on 10/23/2022). Plech´ aˇ c, Petr (2018). “Taming the Corpus. From Inflection and Lexis to Interpretation”. In: ed. by Masako...

  8. [8]

    In:Research Data Journal for the Humanities and Social Sciences9.1, pp

    Hungarian, Italian, Portuguese, Russian, Slovenian and Spanish”. In:Research Data Journal for the Humanities and Social Sciences9.1, pp. 1–17.doi:10.1163/24523666-bja10044

  9. [9]

    Unsupervised discovery of rhyme schemes

    Reddy, Sravana and Kevin Knight (2011). “Unsupervised discovery of rhyme schemes”. In:Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland: ACL, pp. 77–82. url:https://www.aclweb.org/anthology/P11-2014/

  10. [10]

    Does ChatGPT Have a Poetic Style?

    Walsh, Melanie, Anna Preus, and Elizabeth Gronski (2024). “Does ChatGPT Have a Poetic Style?” In: CHR2024: Proceedings of the Computational Humanities Research Conference 2024. Ed. by Wouter