Training Data Size Sensitivity in Unsupervised Rhyme Recognition
Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3
The pith
RhymeTagger exceeds human inter-annotator agreement on rhyme identification once given enough training data across seven languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RhymeTagger, a language-independent tool that identifies rhymes by detecting repeating sound patterns in poetry corpora, reaches and then exceeds inter-annotator agreement levels once supplied with adequate training data. The same evaluation shows that large language models using one-shot prompting without explicit phonetic information remain far below both human agreement and the unsupervised tool. Performance varies with language and training size, and human disagreements themselves correlate with phonetic similarity between candidate rhymes and their distance within the poem.
What carries the argument
RhymeTagger, which locates rhymes through unsupervised detection of repeating phonological patterns across an entire poetry corpus without hand-coded rules or phonetic dictionaries.
If this is right
- Rhyme recognition accuracy rises with training corpus size until a language-specific sufficiency threshold is crossed.
- Once above that threshold the unsupervised method becomes more consistent than human experts on the same material.
- Large language models without phonetic input cannot match the pattern-based tool even in one-shot settings.
- Phonetic similarity and line distance between words are measurable sources of human annotation disagreement.
- Language differences affect both the minimum data needed and the final accuracy ceiling.
Where Pith is reading between the lines
- The same pattern-matching approach could be applied to other historically variable poetic features such as meter or alliteration.
- Adding phonetic embeddings to large language models might close the performance gap observed here.
- Digital archives could use the method to tag rhyme structures at scale and enable quantitative studies of rhyme evolution across centuries.
- Minimum training sizes per language could serve as practical guidelines for building similar tools for under-resourced poetic traditions.
Load-bearing premise
Inter-annotator agreement on a manually annotated subset of poems is a suitable upper-bound benchmark for what counts as reliable automated rhyme recognition.
What would settle it
If RhymeTagger trained on the largest available corpora still falls short of the measured inter-annotator agreement on the held-out annotated poems, the central performance claim would be falsified.
Figures
read the original abstract
Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RhymeTagger, an unsupervised, language-independent tool that detects rhymes via repeating phonetic patterns in poetry corpora. It reports experiments across seven languages (Czech, German, English, French, Italian, Russian, Slovene) that vary training-corpus size, benchmark the tool against inter-annotator agreement (IAA) measured on a manually annotated poem subset, analyze disagreement factors (phonetic similarity and poetic distance), and compare one-shot LLM performance. The central claim is that, once sufficient training data is supplied, RhymeTagger exceeds IAA while LLMs without explicit phonetic representations perform poorly.
Significance. If the central claim is substantiated, the work supplies a practical, data-scalable unsupervised baseline for multilingual rhyme detection that can exceed human consistency; this would be useful for digital humanities pipelines and for testing phonetic representations in LLMs. The multilingual design and explicit IAA benchmark are positive features. The result is not parameter-free or machine-checked, but the empirical scope is non-trivial.
major comments (3)
- [Evaluation section (IAA benchmark paragraph)] Evaluation section (IAA benchmark paragraph): the claim that outperforming IAA demonstrates 'superior reliability' is load-bearing for the main conclusion, yet the manuscript does not report RhymeTagger accuracy broken down on the subset of pairs where the two human annotators disagree. Without this split, it is impossible to distinguish whether the tool resolves genuine ambiguity or simply adopts a more consistent (but not necessarily more correct) labeling rule.
- [Results (training-size curves)] Results (training-size curves, Table 2 or equivalent): the 'sufficient training data' threshold at which RhymeTagger exceeds IAA is reported per language, but no statistical test or confidence interval is given for the crossing point; given the acknowledged language-specific phonetic and poetic-distance confounds, the sensitivity analysis risks over-interpreting small-sample fluctuations as a general data-size effect.
- [LLM comparison paragraph] LLM comparison paragraph: the one-shot prompt is presented as a fair test of 'LLMs lacking phonetic representation,' but the manuscript does not include an ablation that supplies phonetic transcriptions or IPA to the same models; without that control, the performance gap cannot be attributed specifically to the absence of phonetic features rather than prompt engineering or model scale.
minor comments (2)
- [Methods] The definition of 'repeating patterns' used by RhymeTagger is referenced but not formalized with pseudocode or an equation; a short algorithmic sketch would improve reproducibility.
- [Results figures] Figure captions for the training-size plots do not state the exact number of poems or rhyme pairs per language, making it difficult to assess whether the reported curves are comparable across languages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the strength of our claims. We address each major point below, indicating revisions where we agree changes are needed to improve the manuscript.
read point-by-point responses
-
Referee: Evaluation section (IAA benchmark paragraph): the claim that outperforming IAA demonstrates 'superior reliability' is load-bearing for the main conclusion, yet the manuscript does not report RhymeTagger accuracy broken down on the subset of pairs where the two human annotators disagree. Without this split, it is impossible to distinguish whether the tool resolves genuine ambiguity or simply adopts a more consistent (but not necessarily more correct) labeling rule.
Authors: We agree this breakdown is necessary to substantiate the reliability claim. In the revised version, we will report RhymeTagger performance separately on the disagreed pairs from the IAA subset. This analysis will use the existing manual annotations and RhymeTagger outputs on those pairs to show whether the model improves on ambiguous cases or merely enforces consistency. revision: yes
-
Referee: Results (training-size curves, Table 2 or equivalent): the 'sufficient training data' threshold at which RhymeTagger exceeds IAA is reported per language, but no statistical test or confidence interval is given for the crossing point; given the acknowledged language-specific phonetic and poetic-distance confounds, the sensitivity analysis risks over-interpreting small-sample fluctuations as a general data-size effect.
Authors: This observation is correct and we will strengthen the analysis. The revision will incorporate bootstrap-derived confidence intervals on the performance curves and permutation tests to evaluate the statistical significance of the crossing points. We will also emphasize that language-specific factors are already analyzed in the discussion section. revision: yes
-
Referee: LLM comparison paragraph: the one-shot prompt is presented as a fair test of 'LLMs lacking phonetic representation,' but the manuscript does not include an ablation that supplies phonetic transcriptions or IPA to the same models; without that control, the performance gap cannot be attributed specifically to the absence of phonetic features rather than prompt engineering or model scale.
Authors: We accept that an IPA ablation would provide stronger causal attribution. However, such an experiment requires non-trivial changes to prompting and input handling for the LLMs, which falls outside the scope of the current one-shot comparison focused on standard text-based usage. We will revise the text to explicitly acknowledge this limitation and note that the observed gap may involve multiple factors while still illustrating the value of explicit phonetic modeling. revision: partial
Circularity Check
No significant circularity; empirical comparisons to independent annotations.
full rationale
The paper conducts an empirical study measuring RhymeTagger performance (pattern-based, unsupervised) against inter-annotator agreement on a manually annotated subset and against LLM outputs. No equations, derivations, or self-citations reduce the reported accuracies or data-size sensitivities to quantities fitted from the evaluation data itself. IAA serves as an external benchmark derived from human judgments independent of the tool's pattern extraction; varying training corpus sizes and comparing outputs does not create self-definitional or fitted-input circularity. The central claims rest on direct, falsifiable comparisons rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Rhymes can be reliably identified from repeating phonetic patterns in poetry corpora without language-specific rules
- domain assumption Inter-annotator agreement on rhyme labels constitutes a meaningful performance ceiling for automated systems
Reference graph
Works this paper leans on
-
[1]
Word Association Norms, Mutual Information, and Lexicography
Church, Kenneth Ward and Patrick Hanks (1990). “Word Association Norms, Mutual Information, and Lexicography”. In:Computational Linguistics16.1, pp. 22–29.url:https://aclanthology.org/ J90-1003/
work page 1990
-
[2]
Cohen, Jacob (1960). “A Coefficient of Agreement for Nominal Scales”. In:Educational and Psychological Measurement20.1, pp. 37–46.doi:10.1177/001316446002000104. eSpeak-NG (2023).eSpeak-NG.url:https://github.com/espeak-ng/espeak-ng
-
[3]
Supervised Rhyme Detection with Siamese Recurrent Net- works
Haider, Thomas and Jonas Kuhn (2018). “Supervised Rhyme Detection with Siamese Recurrent Net- works”. In:Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cul- tural Heritage, Social Sciences, Humanities and Literature. Ed. by Beatrice Alex, Stefania Degaetano-
work page 2018
-
[4]
Houston, Natalie M. (2025). “Rhymefindr : An Historical Poetics Method for Identifying Rhymes in Nineteenth-Century English Poetry”. In: Publisher: UNSPECIFIED.doi:10 . 26083 / TUPRINTS - 00030148.url:https : / / tuprints . ulb . tu - darmstadt . de / id / eprint / 30148(visited on 09/29/2025)
work page 2025
-
[5]
Agreement, the f-measure, and reliability in infor- mation retrieval
Hripcsak, George and Adam S. Rothschild (2005). “Agreement, the f-measure, and reliability in infor- mation retrieval”. In:Journal of the American Medical Informatics Association12.3, pp. 296–298. doi:10.1197/jamia.M1733
-
[6]
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
Mortensen, David R., Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, and Lori Levin (2016). “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors”. In:Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Pa- pers. Ed. by Yuji Matsumoto and Rashmi Prasad. Osaka, Japan: The...
work page 2016
-
[7]
Rhyme in classical Latin poetry: Stylistic or stochastic?
Nagy, Ben (Dec. 2022). “Rhyme in classical Latin poetry: Stylistic or stochastic?” In:Digital Scholarship in the Humanities37.4, pp. 1097–1118.issn: 2055-7671.doi:10.1093/llc/fqab105.url:https: //doi.org/10.1093/llc/fqab105(visited on 10/23/2022). Plech´ aˇ c, Petr (2018). “Taming the Corpus. From Inflection and Lexis to Interpretation”. In: ed. by Masako...
-
[8]
In:Research Data Journal for the Humanities and Social Sciences9.1, pp
Hungarian, Italian, Portuguese, Russian, Slovenian and Spanish”. In:Research Data Journal for the Humanities and Social Sciences9.1, pp. 1–17.doi:10.1163/24523666-bja10044
-
[9]
Unsupervised discovery of rhyme schemes
Reddy, Sravana and Kevin Knight (2011). “Unsupervised discovery of rhyme schemes”. In:Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland: ACL, pp. 77–82. url:https://www.aclweb.org/anthology/P11-2014/
work page 2011
-
[10]
Does ChatGPT Have a Poetic Style?
Walsh, Melanie, Anna Preus, and Elizabeth Gronski (2024). “Does ChatGPT Have a Poetic Style?” In: CHR2024: Proceedings of the Computational Humanities Research Conference 2024. Ed. by Wouter
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.