A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics
Pith reviewed 2026-05-20 05:54 UTC · model grok-4.3
The pith
Expert ratings of 286 multi-word expressions show none qualify as absolutely idiomatic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.
What carries the argument
The 16 criteria drawn from theoretical linguistics sources, applied through expert annotation to MWEs collected from the same sources.
If this is right
- Lexical criteria provide the primary signal for distinguishing degrees of idiomaticity in multi-word expressions.
- Grammatical criteria only contribute to idiomaticity judgments when specific contextual conditions are met.
- Expressions containing obsolete words or grammar are less likely to be replaceable by a single word.
- Idiomaticity should be treated as a matter of degree in linguistic analysis rather than a binary property.
Where Pith is reading between the lines
- The annotation method could be adapted to train computational systems for grading idiomaticity in large text corpora.
- The absence of absolute idiomaticity may affect how language models handle multi-word expressions during parsing or generation.
- Testing the same criteria on expressions drawn from non-theoretical sources would check whether the patterns hold beyond the original collection.
Load-bearing premise
The 16 criteria drawn from theoretical sources are sufficient to capture the full notion of idiomaticity and that expert annotations provide a reliable, unbiased measurement of those criteria.
What would settle it
Finding even one multi-word expression that multiple independent groups of linguistics experts unanimously rate as satisfying every one of the 16 criteria would falsify the claim that no absolutely idiomatic expressions exist.
Figures
read the original abstract
The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper collects 286 multi-word expressions (MWEs) from theoretical linguistics sources and has linguistics experts annotate them according to 16 lexical, grammatical, and other criteria drawn from the literature. Analysis of the resulting annotation distributions supports the claims that no MWEs qualify as absolutely idiomatic, that lexical criteria exert the strongest influence, that grammatical criteria apply only under specific conditions, and that the presence of obsolete words or grammar affects an MWE's replaceability by a single word.
Significance. If the annotation protocol were fully operationalized and shown to be reliable, the work would offer a useful empirical test of theoretical criteria for idiomaticity and could inform both linguistic theory and computational models of MWEs. At present the absence of a clear mapping from the 16 criteria to the notion of 'absolute idiomaticity' and the lack of reliability metrics limit the strength of the distributional conclusions.
major comments (2)
- [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.
- [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.
minor comments (1)
- [Abstract] The abstract and summary statements would benefit from a brief table or figure summarizing the 16 criteria and their observed frequencies across the 286 MWEs.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review of our manuscript. Below we provide point-by-point responses to the major comments and indicate how we plan to revise the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.
Authors: We agree that specifying the mapping from the criteria to absolute idiomaticity is essential for the validity of our conclusion. We will revise the manuscript to explicitly state that an expression is considered absolutely idiomatic only if it receives annotations indicating idiomaticity on every one of the 16 criteria. Given that our annotations of the 286 MWEs yielded no instances meeting this criterion, the data supports the claim of no absolutely idiomatic expressions. This clarification will be added to the abstract and the results section. revision: yes
-
Referee: [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.
Authors: The referee is correct that the annotation procedure section lacks several key details. We will expand this section in the revised manuscript to include the precise operational definitions for each of the 16 criteria, which are based directly on the theoretical linguistics literature cited in the paper. We will also detail how disagreements among the expert annotators were resolved through iterative discussions leading to consensus. Additionally, we will describe the analytical approach used to assess the relative influence of lexical versus grammatical criteria, including the distributional comparisons performed. However, formal inter-annotator agreement statistics were not computed as part of the original study, limiting our ability to report them. revision: partial
- Formal inter-annotator agreement metrics
Circularity Check
Empirical annotation study with no circular derivation
full rationale
The paper selects 286 MWEs and 16 criteria from existing theoretical linguistics sources, then obtains fresh expert annotations on those criteria for the collected expressions. The central claim that no expressions are absolutely idiomatic is presented as following directly from the resulting category distributions in this new annotated dataset. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the result; the observations rest on independent expert judgments rather than reducing to the paper's own inputs or prior author work by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert linguists can reliably and consistently apply the 16 theoretical criteria to MWEs.
- domain assumption The 286 MWEs collected from theoretical sources form a representative sample for studying idiomaticity.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The distribution of categories shows that there are no absolutely idiomatic expressions. ... vector sum ... higher is the vector sum of an annotated MWE, the more idiomatic an MWE is.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A Comparative Evaluation of Collocation Extraction Techniques
Pearce, Darren. A Comparative Evaluation of Collocation Extraction Techniques. Proceedings of the Third International Conference on Language Resources and Evaluation ( LREC ' 02). 2002
work page 2002
-
[2]
Baldwin, Timothy and Su Nam Kim. Multiword Expressions. Handbook of natural language processing
-
[3]
Multiword expressions: A pain in the neck for NLP , author=. Computational Linguistics and Intelligent Text Processing: Third International Conference, CICLing 2002 Mexico City, Mexico, February 17--23, 2002 Proceedings 3 , pages=. 2002 , organization=
work page 2002
-
[4]
Computational linguistics , volume=
Accurate methods for the statistics of surprise and coincidence , author=. Computational linguistics , volume=. 1993 , publisher=
work page 1993
-
[5]
Computational linguistics , volume=
Word association norms, mutual information, and lexicography , author=. Computational linguistics , volume=
-
[6]
A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria , author=. Journal of Natural Language Processing , volume=. 1994 , publisher=
work page 1994
-
[7]
Using conceptual similarity for collocation extraction , author=. Proc. of the 4th UK Special Interest Group for Computational Linguistics (CLUK4) , year=
-
[8]
Computer Speech & Language , volume=
Comparing and combining a semantic tagger and a statistical tool for MWE extraction , author=. Computer Speech & Language , volume=. 2005 , publisher=
work page 2005
-
[9]
A machine learning approach to multiword expression extraction , author=. Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) , volume=. 2008 , organization=
work page 2008
-
[10]
Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008) , author=. 2008 , publisher=
work page 2008
-
[11]
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022. 2022
work page 2022
-
[12]
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications , pages=
Automatic extraction of Arabic multiword expressions , author=. Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications , pages=
work page 2010
-
[13]
C oll F r E n: Rich Bilingual E nglish -- F rench Collocation Resource
Fisas, Beatriz and Espinosa Anke, Luis and Codina-Filb \'a , Joan and Wanner, Leo. C oll F r E n: Rich Bilingual E nglish -- F rench Collocation Resource. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020
work page 2020
-
[14]
A lpha MWE : Construction of Multilingual Parallel Corpora with MWE Annotations
Han, Lifeng and Jones, Gareth and Smeaton, Alan. A lpha MWE : Construction of Multilingual Parallel Corpora with MWE Annotations. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020
work page 2020
-
[15]
Multi-word Expressions for Abusive Speech Detection in S erbian
Stankovi \'c , Ranka and Mitrovi \'c , Jelena and Joki \'c , Danka and Krstev, Cvetana. Multi-word Expressions for Abusive Speech Detection in S erbian. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020
work page 2020
-
[16]
Barbu Mititelu, Verginica and Stoyanova, Ivelina and Leseva, Svetlozara and Mitrofan, Maria and Dimitrova, Tsvetana and Todorova, Maria. Hear about Verbal Multiword Expressions in the B ulgarian and the R omanian Wordnets Straight from the Horse ' s Mouth. Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019). 2019. doi:10.1...
-
[17]
I lfhocail: A Lexicon of I rish MWE s
Walsh, Abigail and Lynn, Teresa and Foster, Jennifer. I lfhocail: A Lexicon of I rish MWE s. Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019). 2019. doi:10.18653/v1/W19-5120
-
[18]
Rule-Based Translation of S panish Verb-Noun Combinations into B asque
I \ n urrieta, Uxoa and Aduriz, Itziar and D \' az de Ilarraza, Arantza and Labaka, Gorka and Sarasola, Kepa. Rule-Based Translation of S panish Verb-Noun Combinations into B asque. Proceedings of the 13th Workshop on Multiword Expressions ( MWE 2017). 2017. doi:10.18653/v1/W17-1720
-
[19]
Chrz a szcz, Pawe. Extraction and Recognition of P olish Multiword Expressions using W ikipedia and Finite-State Automata. Proceedings of the 12th Workshop on Multiword Expressions. 2016. doi:10.18653/v1/W16-1815
-
[20]
Lam, Khang Nhut and Al Tarouti, Feras and Kalita, Jugal. Phrase translation using a bilingual dictionary and n-gram data: A case study from V ietnamese to E nglish. Proceedings of the 11th Workshop on Multiword Expressions. 2015. doi:10.3115/v1/W15-0911
-
[21]
A Multiword Unit Analysis COCA Multiword Unit List 20 and ColloGram , volume =
Shin, Dongkwang and Chon, Yuah , year =. A Multiword Unit Analysis COCA Multiword Unit List 20 and ColloGram , volume =. Journal of Asia TEFL , doi =
- [22]
-
[23]
Comprehensive annotation of multiword expressions in a social web corpus , author=. 2014 , publisher=
work page 2014
-
[24]
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=
SemEval-2016 task 10: Detecting minimal semantic units and their meanings (DiMSUM) , author=. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) , pages=
work page 2016
-
[25]
Comparing different word embeddings for multiword expression identification , author=. Natural Language Processing and Information Systems: 24th International Conference on Applications of Natural Language to Information Systems, NLDB 2019, Salford, UK, June 26--28, 2019, Proceedings 24 , pages=. 2019 , organization=
work page 2019
-
[26]
Computational Linguistics , volume=
Multiword expression processing: A survey , author=. Computational Linguistics , volume=. 2017 , publisher=
work page 2017
-
[27]
Foundations of statistical natural language processing , author=. 1999 , publisher=
work page 1999
- [28]
-
[29]
Linguistics and Language Pedagogy: The State of the Art , year=
Soviet Phraseology: Problems in the analysis and teaching of idioms , author=. Linguistics and Language Pedagogy: The State of the Art , year=
- [30]
-
[31]
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021) , pages=
Data-driven identification of idioms in song lyrics , author=. Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021) , pages=
work page 2021
-
[32]
Komp'juternaja Lingvistika i Intellektual'nye Tehnologii , pages=
Measure clustering approach to MWE extraction , author=. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii , pages=
-
[33]
Finding New Multiword Expressions for Existing Thesaurus
Rossyaykin, Petr and Loukachevitch, Natalia. Finding New Multiword Expressions for Existing Thesaurus. Artificial Intelligence and Natural Language. 2020
work page 2020
-
[34]
Proceedings of the 11th Workshop on Multiword Expressions , pages=
Clustering-based approach to multiword expression extraction and ranking , author=. Proceedings of the 11th Workshop on Multiword Expressions , pages=
-
[35]
ACM Transactions on Speech and Language Processing (TSLP) , volume=
Modeling the internal variability of multiword expressions through a pattern-based method , author=. ACM Transactions on Speech and Language Processing (TSLP) , volume=. 2013 , publisher=
work page 2013
-
[36]
Lexical collocation analysis: advances and applications , pages=
Multi-word expressions: A novel computational approach to their bottom-up statistical extraction , author=. Lexical collocation analysis: advances and applications , pages=. 2018 , publisher=
work page 2018
-
[37]
Evaluating Distributional Features for Multiword Expression Recognition
Loukachevitch, Natalia and Parkhomenko, Ekaterina. Evaluating Distributional Features for Multiword Expression Recognition. Text, Speech, and Dialogue. 2018
work page 2018
-
[38]
Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach , author=. Lexis. Journal in English Lexicology , number=. 2022 , publisher=
work page 2022
-
[39]
British studies in applied linguistics , volume=
Phraseological competence and written proficiency , author=. British studies in applied linguistics , volume=. 1996 , publisher=
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.