pith. sign in

arxiv: 2605.19575 · v1 · pith:FZNQ26HCnew · submitted 2026-05-19 · 💻 cs.CL

A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

Pith reviewed 2026-05-20 05:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords idiomaticitymulti-word expressionsexpert annotationlexical criteriagrammatical criteriatheoretical linguisticsdata-driven analysis
0
0 comments X

The pith

Expert ratings of 286 multi-word expressions show none qualify as absolutely idiomatic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects 286 multi-word expressions from theoretical linguistics sources and has experts annotate them using 16 lexical, grammatical, and other criteria for idiomaticity. The distribution of annotations reveals that lexical criteria exert the strongest influence on judgments, while grammatical criteria only apply under specific conditions. No expression receives ratings that would classify it as fully idiomatic across all criteria. Obsolete words and grammar further reduce the likelihood that an expression can be replaced by a single word. These patterns indicate that idiomaticity operates as a graded property shaped by particular linguistic features rather than an all-or-nothing category.

Core claim

The central claim is that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

What carries the argument

The 16 criteria drawn from theoretical linguistics sources, applied through expert annotation to MWEs collected from the same sources.

If this is right

  • Lexical criteria provide the primary signal for distinguishing degrees of idiomaticity in multi-word expressions.
  • Grammatical criteria only contribute to idiomaticity judgments when specific contextual conditions are met.
  • Expressions containing obsolete words or grammar are less likely to be replaceable by a single word.
  • Idiomaticity should be treated as a matter of degree in linguistic analysis rather than a binary property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The annotation method could be adapted to train computational systems for grading idiomaticity in large text corpora.
  • The absence of absolute idiomaticity may affect how language models handle multi-word expressions during parsing or generation.
  • Testing the same criteria on expressions drawn from non-theoretical sources would check whether the patterns hold beyond the original collection.

Load-bearing premise

The 16 criteria drawn from theoretical sources are sufficient to capture the full notion of idiomaticity and that expert annotations provide a reliable, unbiased measurement of those criteria.

What would settle it

Finding even one multi-word expression that multiple independent groups of linguistics experts unanimously rate as satisfying every one of the 16 criteria would falsify the claim that no absolutely idiomatic expressions exist.

Figures

Figures reproduced from arXiv: 2605.19575 by Aleksander Zhmykhov, Anastasiya Drozdova, Anastasiya Vishnyakova, Elena Mikhalkova, Polina Gavin, Timofey Protasov.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Classification of MWEs in terms of their idiomaticity, by (Baldwin and Kim, 2010). proach classification of MWEs from another per￾spective: they enlist properties of MWEs and an￾notate several examples with these properties, ac￾quiring a matrix of feature distribution from which they conclude about the probability of “MWEhood”, see fig. 2. In our opinion, this approach could be called data-driven, as class… view at source ↗
Figure 3
Figure 3. Figure 3: Number of MWEs scoring from 3 to 13 – left. Sum of scores for each of the 16 categories (ranged) – right. Replacement. Counter-intuitively, this category seems to be in a conflict with Grammatical change and Obsolescence. Often, if an MWE scores high in it, it scores low (or medium) in one or both. The example in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 3D scatter plot of the sums of scores of MWEs in each category. Future work. Future work requires a larger annotated collection. The main obstacle here is that manual annotation that we performed is very time-consuming. We can see several more ways, beside the mentioned ones, of making it par￾tially automatic, e.g. impossibility of translating an MWE word by word as well as translation with one word can be… view at source ↗
read the original abstract

The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper collects 286 multi-word expressions (MWEs) from theoretical linguistics sources and has linguistics experts annotate them according to 16 lexical, grammatical, and other criteria drawn from the literature. Analysis of the resulting annotation distributions supports the claims that no MWEs qualify as absolutely idiomatic, that lexical criteria exert the strongest influence, that grammatical criteria apply only under specific conditions, and that the presence of obsolete words or grammar affects an MWE's replaceability by a single word.

Significance. If the annotation protocol were fully operationalized and shown to be reliable, the work would offer a useful empirical test of theoretical criteria for idiomaticity and could inform both linguistic theory and computational models of MWEs. At present the absence of a clear mapping from the 16 criteria to the notion of 'absolute idiomaticity' and the lack of reliability metrics limit the strength of the distributional conclusions.

major comments (2)
  1. [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.
  2. [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.
minor comments (1)
  1. [Abstract] The abstract and summary statements would benefit from a brief table or figure summarizing the 16 criteria and their observed frequencies across the 286 MWEs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful and constructive review of our manuscript. Below we provide point-by-point responses to the major comments and indicate how we plan to revise the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.

    Authors: We agree that specifying the mapping from the criteria to absolute idiomaticity is essential for the validity of our conclusion. We will revise the manuscript to explicitly state that an expression is considered absolutely idiomatic only if it receives annotations indicating idiomaticity on every one of the 16 criteria. Given that our annotations of the 286 MWEs yielded no instances meeting this criterion, the data supports the claim of no absolutely idiomatic expressions. This clarification will be added to the abstract and the results section. revision: yes

  2. Referee: [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.

    Authors: The referee is correct that the annotation procedure section lacks several key details. We will expand this section in the revised manuscript to include the precise operational definitions for each of the 16 criteria, which are based directly on the theoretical linguistics literature cited in the paper. We will also detail how disagreements among the expert annotators were resolved through iterative discussions leading to consensus. Additionally, we will describe the analytical approach used to assess the relative influence of lexical versus grammatical criteria, including the distributional comparisons performed. However, formal inter-annotator agreement statistics were not computed as part of the original study, limiting our ability to report them. revision: partial

standing simulated objections not resolved
  • Formal inter-annotator agreement metrics

Circularity Check

0 steps flagged

Empirical annotation study with no circular derivation

full rationale

The paper selects 286 MWEs and 16 criteria from existing theoretical linguistics sources, then obtains fresh expert annotations on those criteria for the collected expressions. The central claim that no expressions are absolutely idiomatic is presented as following directly from the resulting category distributions in this new annotated dataset. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the result; the observations rest on independent expert judgments rather than reducing to the paper's own inputs or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on domain assumptions about the adequacy of the chosen criteria and the reliability of expert judgment; no numerical free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption Expert linguists can reliably and consistently apply the 16 theoretical criteria to MWEs.
    The distribution results and influence rankings depend directly on the quality of the expert annotations described in the abstract.
  • domain assumption The 286 MWEs collected from theoretical sources form a representative sample for studying idiomaticity.
    All examples originate from the same books and papers that supplied the criteria, creating a closed loop whose representativeness is not independently verified.

pith-pipeline@v0.9.0 · 5646 in / 1432 out tokens · 47887 ms · 2026-05-20T05:54:15.097589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    A Comparative Evaluation of Collocation Extraction Techniques

    Pearce, Darren. A Comparative Evaluation of Collocation Extraction Techniques. Proceedings of the Third International Conference on Language Resources and Evaluation ( LREC ' 02). 2002

  2. [2]

    Multiword Expressions

    Baldwin, Timothy and Su Nam Kim. Multiword Expressions. Handbook of natural language processing

  3. [3]

    Computational Linguistics and Intelligent Text Processing: Third International Conference, CICLing 2002 Mexico City, Mexico, February 17--23, 2002 Proceedings 3 , pages=

    Multiword expressions: A pain in the neck for NLP , author=. Computational Linguistics and Intelligent Text Processing: Third International Conference, CICLing 2002 Mexico City, Mexico, February 17--23, 2002 Proceedings 3 , pages=. 2002 , organization=

  4. [4]

    Computational linguistics , volume=

    Accurate methods for the statistics of surprise and coincidence , author=. Computational linguistics , volume=. 1993 , publisher=

  5. [5]

    Computational linguistics , volume=

    Word association norms, mutual information, and lexicography , author=. Computational linguistics , volume=

  6. [6]

    cost criteria , author=

    A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria , author=. Journal of Natural Language Processing , volume=. 1994 , publisher=

  7. [7]

    Using conceptual similarity for collocation extraction , author=. Proc. of the 4th UK Special Interest Group for Computational Linguistics (CLUK4) , year=

  8. [8]

    Computer Speech & Language , volume=

    Comparing and combining a semantic tagger and a statistical tool for MWE extraction , author=. Computer Speech & Language , volume=. 2005 , publisher=

  9. [9]

    Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) , volume=

    A machine learning approach to multiword expression extraction , author=. Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) , volume=. 2008 , organization=

  10. [10]

    2008 , publisher=

    Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008) , author=. 2008 , publisher=

  11. [11]

    Proceedings of the 18th Workshop on Multiword Expressions @LREC2022. 2022

  12. [12]

    Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications , pages=

    Automatic extraction of Arabic multiword expressions , author=. Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications , pages=

  13. [13]

    C oll F r E n: Rich Bilingual E nglish -- F rench Collocation Resource

    Fisas, Beatriz and Espinosa Anke, Luis and Codina-Filb \'a , Joan and Wanner, Leo. C oll F r E n: Rich Bilingual E nglish -- F rench Collocation Resource. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020

  14. [14]

    A lpha MWE : Construction of Multilingual Parallel Corpora with MWE Annotations

    Han, Lifeng and Jones, Gareth and Smeaton, Alan. A lpha MWE : Construction of Multilingual Parallel Corpora with MWE Annotations. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020

  15. [15]

    Multi-word Expressions for Abusive Speech Detection in S erbian

    Stankovi \'c , Ranka and Mitrovi \'c , Jelena and Joki \'c , Danka and Krstev, Cvetana. Multi-word Expressions for Abusive Speech Detection in S erbian. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020

  16. [16]

    Hear about Verbal Multiword Expressions in the B ulgarian and the R omanian Wordnets Straight from the Horse ' s Mouth

    Barbu Mititelu, Verginica and Stoyanova, Ivelina and Leseva, Svetlozara and Mitrofan, Maria and Dimitrova, Tsvetana and Todorova, Maria. Hear about Verbal Multiword Expressions in the B ulgarian and the R omanian Wordnets Straight from the Horse ' s Mouth. Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019). 2019. doi:10.1...

  17. [17]

    I lfhocail: A Lexicon of I rish MWE s

    Walsh, Abigail and Lynn, Teresa and Foster, Jennifer. I lfhocail: A Lexicon of I rish MWE s. Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019). 2019. doi:10.18653/v1/W19-5120

  18. [18]

    Rule-Based Translation of S panish Verb-Noun Combinations into B asque

    I \ n urrieta, Uxoa and Aduriz, Itziar and D \' az de Ilarraza, Arantza and Labaka, Gorka and Sarasola, Kepa. Rule-Based Translation of S panish Verb-Noun Combinations into B asque. Proceedings of the 13th Workshop on Multiword Expressions ( MWE 2017). 2017. doi:10.18653/v1/W17-1720

  19. [19]

    Extraction and Recognition of P olish Multiword Expressions using W ikipedia and Finite-State Automata

    Chrz a szcz, Pawe. Extraction and Recognition of P olish Multiword Expressions using W ikipedia and Finite-State Automata. Proceedings of the 12th Workshop on Multiword Expressions. 2016. doi:10.18653/v1/W16-1815

  20. [20]

    Phrase translation using a bilingual dictionary and n-gram data: A case study from V ietnamese to E nglish

    Lam, Khang Nhut and Al Tarouti, Feras and Kalita, Jugal. Phrase translation using a bilingual dictionary and n-gram data: A case study from V ietnamese to E nglish. Proceedings of the 11th Workshop on Multiword Expressions. 2015. doi:10.3115/v1/W15-0911

  21. [21]

    A Multiword Unit Analysis COCA Multiword Unit List 20 and ColloGram , volume =

    Shin, Dongkwang and Chon, Yuah , year =. A Multiword Unit Analysis COCA Multiword Unit List 20 and ColloGram , volume =. Journal of Asia TEFL , doi =

  22. [22]

    1983 , publisher=

    English word-formation , author=. 1983 , publisher=

  23. [23]

    2014 , publisher=

    Comprehensive annotation of multiword expressions in a social web corpus , author=. 2014 , publisher=

  24. [24]

    Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=

    SemEval-2016 task 10: Detecting minimal semantic units and their meanings (DiMSUM) , author=. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=

  25. [25]

    Comparing different word embeddings for multiword expression identification , author=. Natural Language Processing and Information Systems: 24th International Conference on Applications of Natural Language to Information Systems, NLDB 2019, Salford, UK, June 26--28, 2019, Proceedings 24 , pages=. 2019 , organization=

  26. [26]

    Computational Linguistics , volume=

    Multiword expression processing: A survey , author=. Computational Linguistics , volume=. 2017 , publisher=

  27. [27]

    1999 , publisher=

    Foundations of statistical natural language processing , author=. 1999 , publisher=

  28. [28]

    , title =

    Vinogradov, V.V. , title =

  29. [29]

    Linguistics and Language Pedagogy: The State of the Art , year=

    Soviet Phraseology: Problems in the analysis and teaching of idioms , author=. Linguistics and Language Pedagogy: The State of the Art , year=

  30. [30]

    stability

    On the terms "stability" and "idiomaticity" (O terminakh "ustoichivost" i "idiomatichnost") , author=. Voprosy Yazykoznaniya , volume=

  31. [31]

    Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021) , pages=

    Data-driven identification of idioms in song lyrics , author=. Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021) , pages=

  32. [32]

    Komp'juternaja Lingvistika i Intellektual'nye Tehnologii , pages=

    Measure clustering approach to MWE extraction , author=. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii , pages=

  33. [33]

    Finding New Multiword Expressions for Existing Thesaurus

    Rossyaykin, Petr and Loukachevitch, Natalia. Finding New Multiword Expressions for Existing Thesaurus. Artificial Intelligence and Natural Language. 2020

  34. [34]

    Proceedings of the 11th Workshop on Multiword Expressions , pages=

    Clustering-based approach to multiword expression extraction and ranking , author=. Proceedings of the 11th Workshop on Multiword Expressions , pages=

  35. [35]

    ACM Transactions on Speech and Language Processing (TSLP) , volume=

    Modeling the internal variability of multiword expressions through a pattern-based method , author=. ACM Transactions on Speech and Language Processing (TSLP) , volume=. 2013 , publisher=

  36. [36]

    Lexical collocation analysis: advances and applications , pages=

    Multi-word expressions: A novel computational approach to their bottom-up statistical extraction , author=. Lexical collocation analysis: advances and applications , pages=. 2018 , publisher=

  37. [37]

    Evaluating Distributional Features for Multiword Expression Recognition

    Loukachevitch, Natalia and Parkhomenko, Ekaterina. Evaluating Distributional Features for Multiword Expression Recognition. Text, Speech, and Dialogue. 2018

  38. [38]

    Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach , author=. Lexis. Journal in English Lexicology , number=. 2022 , publisher=

  39. [39]

    British studies in applied linguistics , volume=

    Phraseological competence and written proficiency , author=. British studies in applied linguistics , volume=. 1996 , publisher=