pith. machine review for the scientific record. sign in

arxiv: 2604.16593 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Chao Huang, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Yang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords semanticsemanticqareasoningtasksassessbenchmarkevaluationlanguage
0
0 comments X

The pith

SemanticQA is a unified benchmark that reveals substantial performance gaps in language models on semantic reasoning tasks involving multiword expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many phrases in language do not mean what their individual words suggest, such as idioms or specific word combinations. This paper gathers existing tests for these phrases into one collection called SemanticQA. It checks how well different AI language models can find, sort, and explain these phrases, including when models must do several related tasks one after another. Results show models vary a lot in how well they handle the meaning, especially when simple word-by-word reading is not enough.

Core claim

Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning.

Load-bearing premise

That reorganizing existing multiword expression resources into unified tasks accurately measures semantic reasoning without introducing selection or annotation biases from the source datasets.

Figures

Figures reproduced from arXiv: 2604.16593 by Chao Huang, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Yang Liu.

Figure 1
Figure 1. Figure 1: Atomic task exemplars of idiomatic expres [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SEMANTICQA for benchmarking LMs on lexical phenomena. behaviours beyond the understanding of superfacial language (Miletic and Schulte im Walde ´ , 2024). We therefore ask: How do language models behave when evaluated on phrasal semantics across distinct but structurally constrained task operations? To answer this question, we intro￾duce SEMANTICQA, an operation-aligned bench￾mark for semantic … view at source ↗
Figure 3
Figure 3. Figure 3: The coverage of coarse- and fine-grain seman [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall the best performance (i.e., capacity triangle △) of models on SEMANTICQA tion in context, covering VPC (VPE), LVC (LVE), and VID (VIE) (Savary et al., 2023). Finally, we formalize SP processing as a con￾ditional generation problem under operation con￾straints. Given a prompt template P (cf. Appendix §C) that specifies a target operation and an SP em￾bedded in its context S, a LM is required to gene… view at source ↗
Figure 5
Figure 5. Figure 5: Grouped bars represent the mean performance of each model, while circular markers denote the population [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The ability of semantic relation categorization of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A data example of idiomacity detection (IED). [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A data example of idiom extraction (IEE). [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: A data example of lexical collocation extrac [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A data example of lexical collocation inter [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: A data example of noun compound interpre [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 13
Figure 13. Figure 13: A data example of noun compound composi [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: A data example of VMWE Extraction. C Example Prompt We manually create a unified prompt template for all tasks that can be adapted to each task with specific filling arguments. The prompt format is shown in the [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 14
Figure 14. Figure 14: A data example of noun compound extraction [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 17
Figure 17. Figure 17: Unified prompt template used in the work. [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Oracle prompt template used in the work. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
read the original abstract

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SemanticQA, a benchmark that consolidates and reorganizes existing multiword expression (MWE) resources into a unified suite of tasks covering extraction, classification, interpretation, and sequential task compositions. It evaluates language models of varying architectures and scales on these tasks, claiming to reveal substantial performance variation—particularly on those requiring semantic reasoning—and provides public access to the evaluation harness and data.

Significance. If the benchmark construction successfully isolates semantic reasoning differences without inheriting source-dataset artifacts, SemanticQA could offer a valuable standardized testbed for diagnosing LM limitations on non-compositional phrases and guiding improvements in semantic comprehension. The public release of code and data supports reproducibility and is a clear strength.

major comments (2)
  1. [Benchmark construction] Benchmark construction (likely §3): The reorganization of prior MWE corpora (idioms, noun compounds, verbal constructions, collocations) into unified extraction/classification/interpretation tasks provides no description of new validation steps, adversarial controls, inter-annotator agreement checks, or bias audits on the reorganized splits. This leaves open the possibility that observed performance variation reflects source-specific annotation guidelines, frequency imbalances, or domain skews rather than differences in semantic reasoning efficacy.
  2. [Results and evaluation] Results and evaluation sections: The abstract asserts 'substantial performance variation' and 'differences in reasoning efficacy' but supplies no quantitative metrics, error analysis, per-task breakdowns, or statistical significance tests. Without these details, it is impossible to assess whether the variation is large enough, consistent across models, or genuinely attributable to semantic reasoning rather than task formulation artifacts.
minor comments (2)
  1. [Data availability] The GitHub repository link is provided, but the paper should explicitly document the exact train/dev/test splits, licensing of source resources, and any preprocessing steps applied during reorganization.
  2. [Task definitions] Notation for task compositions (e.g., sequential pipelines) could be clarified with a small diagram or pseudocode to improve readability for readers unfamiliar with MWE literature.

Circularity Check

0 steps flagged

No circularity: empirical benchmark relies on external resources without internal derivations or self-referential fitting

full rationale

The paper constructs SemanticQA by consolidating and reorganizing existing multiword expression resources into unified tasks for extraction, classification, interpretation, and composition. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text or abstract. All load-bearing elements rest on external corpora and observed LM performance differences rather than any reduction to the paper's own inputs by construction. This is a standard empirical benchmark setup with no self-citation chains or ansatz smuggling that would trigger circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that consolidated MWE resources validly test semantic reasoning and that performance differences reflect model comprehension rather than dataset artifacts.

axioms (1)
  • domain assumption Existing multiword expression resources can be reorganized into a unified testbed that preserves their original semantic properties.
    Invoked when the paper states it consolidates and reorganizes resources into SemanticQA.

pith-pipeline@v0.9.0 · 5449 in / 1069 out tokens · 43206 ms · 2026-05-10T08:50:39.767845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    Rahmani, and Marek Rei

    Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, and Marek Rei. 2025. https://arxiv.org/abs/2508.19988 Agentcoma: A compositional benchmark mixing commonsense and mathematical reasoning in real-world scenarios . Preprint, arXiv:2508.19988

  2. [2]

    Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. 2025. https://arxiv.org/abs/2510.26768 Amo-bench: Large language models still struggle in high school math competitions . Preprint, arXiv:2510.26768

  3. [3]

    Anthropic. 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf The claude 3 model family: Opus, sonnet, haiku . In Anthropic Blog

  4. [4]

    Anthropic. 2025. Anthropic. https://www.anthropic.com/news/claude-sonnet-4-5. September 30, 2025

  5. [5]

    Yuki Arase and Jun ' ichi Tsujii. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.125 Compositional phrase alignment and beyond . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1611--1623, Online. Association for Computational Linguistics

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. https://api.semanticscholar.org/CorpusID:237142385 Program synthesis with large language models . ArXiv, abs/2108.07732

  7. [7]

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. https://matharena.ai/ Matharena: Evaluating llms on uncontaminated math competitions

  8. [8]

    Ekaba Bisong. 2019. Google colaboratory. Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners, pages 59--64

  9. [9]

    Lars Buijtelaar and Sandro Pezzelle. 2023. https://doi.org/10.18653/v1/2023.eacl-main.163 A psycholinguistic analysis of BERT ' s representations of compounds . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2230--2241, Dubrovnik, Croatia. Association for Computational Linguistics

  10. [10]

    Tuhin Chakrabarty, Yejin Choi, and Vered Shwartz. 2022 a . https://doi.org/10.1162/tacl_a_00478 It ' s not rocket science: Interpreting figurative language in narratives . Transactions of the Association for Computational Linguistics, 10:589--606

  11. [11]

    Tuhin Chakrabarty, Yejin Choi, and Vered Shwartz. 2022 b . It’s not rocket science: Interpreting figurative language in narratives. Transactions of the Association for Computational Linguistics, 10:589--606

  12. [12]

    I-Hsuan Chen, Yunfei Long, Qin Lu, and Chu-Ren Huang. 2017. https://doi.org/10.18653/v1/K17-1006 Leveraging eventive information for better metaphor detection and classification . In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017) , pages 36--46, Vancouver, Canada. Association for Computational Linguistics

  13. [13]

    Albert Coil and Vered Shwartz. 2023. https://doi.org/10.18653/v1/2023.findings-acl.169 From chocolate bunny to chocolate crocodile: Do language models understand noun compounds? In Findings of the Association for Computational Linguistics: ACL 2023, pages 2698--2710, Toronto, Canada. Association for Computational Linguistics

  14. [14]

    Mathieu Constant, G \"u l s en Eryi g it, Johanna Monti, Lonneke Van Der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017 a . Multiword expression processing: A survey. Computational Linguistics, 43(4):837--892

  15. [15]

    Mathieu Constant, G \"u l s en Eryi g it, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017 b . https://doi.org/10.1162/COLI_a_00302 S urvey: Multiword expression processing: A S urvey . Computational Linguistics, 43(4):837--892

  16. [16]

    DeepSeek. 2025. https://doi.org/10.1038/s41586-025-09422-z Deepseek-r1 incentivizes reasoning in llms through reinforcement learning . Nature, 645:633--638

  17. [17]

    Michael Denkowski and Alon Lavie. 2014. https://doi.org/10.3115/v1/W14-3348 Meteor universal: Language specific translation evaluation for any target language . In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376--380, Baltimore, Maryland, USA. Association for Computational Linguistics

  18. [18]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171--4186

  19. [19]

    Luis Espinosa-Anke, Joan Codina-Filba, and Leo Wanner. 2021. https://doi.org/10.18653/v1/2021.eacl-main.120 Evaluating language models for the retrieval and categorization of lexical collocations . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1406--1417, Online. Associat...

  20. [20]

    Luis Espinosa-Anke, Steven Schockaert, and Leo Wanner. 2019. https://doi.org/10.18653/v1/P19-1576 Collocation classification with unsupervised relation vectors . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5765--5772, Florence, Italy. Association for Computational Linguistics

  21. [21]

    Luis Espinosa-Anke, Alexander Shvets, Alireza Mohammadshahi, James Henderson, and Leo Wanner. 2022. https://doi.org/10.18653/v1/2022.starsem-1.8 Multilingual extraction and categorization of lexical collocations with graph-aware transformers . In Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 89--100, Seattle, Washi...

  22. [22]

    Afsaneh Fazly, Paul Cook, and Suzanne Stevenson. 2009. Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 35(1):61--103

  23. [23]

    Beatriz Fisas, Luis Espinosa-Anke, Joan Codina-Filb \'a , and Leo Wanner. 2020. https://aclanthology.org/2020.mwe-1.1 C oll F r E n: Rich bilingual E nglish -- F rench collocation resource . In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 1--12, online. Association for Computational Linguistics

  24. [24]

    Thierry Fontenelle. 1997. Turning a bilingual dictionary into a lexical-semantic database. De Gruyter

  25. [25]

    Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, and Aline Villavicencio. 2021. https://doi.org/10.18653/v1/2021.acl-long.212 Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels . In Proceedings of the 59th Annual Meeting of the Association for Computational Lingui...

  26. [26]

    Alexander Gelbukh and 1 others. 2012. Semantic analysis of verbal collocations with lexical functions, volume 414. Springer

  27. [27]

    Gemma . 2025. https://goo.gle/Gemma3Report Gemma 3

  28. [28]

    Google. 2025. Google. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025. Mar 25, 2025

  29. [29]

    Tayyar Madabushi Harish, Gow-Smith Edward, Scarton Carolina, and Villavicencio Aline. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.294 AS titch I n L anguage M odels: Dataset and methods for the exploration of idiomaticity in pre-trained language models . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3464--3477, ...

  30. [30]

    Adi Haviv, Ido Cohen, Jacob Gidron, Roei Schuster, Yoav Goldberg, and Mor Geva. 2023. https://doi.org/10.18653/v1/2023.eacl-main.19 Understanding transformer memorization recall through idioms . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 248--264, Dubrovnik, Croatia. Association fo...

  31. [31]

    Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, Diarmuid \'O S \'e aghdha, Stan Szpakowicz, and Tony Veale. 2013. https://aclanthology.org/S13-2025 S em E val-2013 task 4: Free paraphrases of noun compounds . In Second Joint Conference on Lexical and Computational Semantics (* SEM ), Volume 2: Proceedings of the Seventh International Workshop on Semanti...

  32. [32]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In International Conference on Learning Representations

  33. [33]

    Xuming Hu, Junzhe Chen, Xiaochuan Li, Yufei Guo, Lijie Wen, Philip Yu, and Zhijiang Guo. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/7b7d7985f62284060d65f532ed2ea5fa-Paper-Conference.pdf Towards understanding factual knowledge of large language models . In International Conference on Representation Learning, volume 2024, pages 28680--28715

  34. [34]

    Sirui Huang, Yanggan Gu, Zhonghao Li, Xuming Hu, Li Qing, and Guandong Xu. 2025. https://doi.org/10.18653/v1/2025.findings-acl.391 S truct F act: Reasoning factual knowledge from structured data with large language models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 7521--7552, Vienna, Austria. Association for Computatio...

  35. [35]

    Kimi . 2025. https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

  36. [36]

    Filip Klubi c ka, Vasudevan Nedumpozhimana, and John Kelleher. 2023. https://doi.org/10.18653/v1/2023.mwe-1.8 Idioms, probing and dangerous things: Towards structural probing for idiomaticity in vector space . In Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), pages 45--57, Dubrovnik, Croatia. Association for Computational Linguistics

  37. [37]

    Olga Kolesnikova. 2020. Automatic detection of lexical functions in context. Computaci \'o n y sistemas , 24(3):1337--1352

  38. [38]

    Keshav Kolluru, Gabriel Stanovsky, and Mausam . 2022. https://doi.org/10.18653/v1/2022.emnlp-main.711 `` covid vaccine is against covid but O xford vaccine is made at O xford! '' semantic interpretation of proper noun compounds . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10407--10420, Abu Dhabi, Unite...

  39. [39]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  40. [40]

    Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. https://doi.org/10.52202/079017-1837 Evocodebench: An evolving code generation benchmark with domain-specific evaluations . In Advances in Neural Information Processing Systems, volume 37, pages 57619--57641. Curran Associates, Inc

  41. [41]

    Jiaqi Li, Xinyi Dong, Yang Liu, Zhizhuo Yang, Quansen Wang, Xiaobo Wang, Song-Chun Zhu, Zixia Jia, and Zilong Zheng. 2025. https://doi.org/10.18653/v1/2025.findings-acl.871 R eflect E vo: Improving meta introspection of small LLM s by learning self-reflection . In Findings of the Association for Computational Linguistics: ACL 2025, pages 16948--16966, Vie...

  42. [42]

    Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013 ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  43. [43]

    Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. 2024. https://arxiv.org/abs/2405.12209 Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark . Preprint, arXiv:2405.12209

  44. [44]

    Yang Liu, Jiaqi Li, and Zilong Zheng. 2026 a . https://openreview.net/forum?id=MQV4TJyqnb Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling . In The Fourteenth International Conference on Learning Representations

  45. [45]

    Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li, and Lingyong Yan. 2026 b . https://doi.org/10.18653/v1/2026.eacl-long.1 LM -lexicon: Improving definition modeling via harmonizing semantic experts . In Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers) , pages 1--...

  46. [46]

    Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu Hoang Trinh, Quoc V Le, and Junehyuk Jung. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1794 Towards robust ma...

  47. [47]

    Mel' c uk

    Igor A. Mel' c uk. 1998. Collocations and lexical functions. Phraseology. Theory, analysis, and applications, pages 23--53

  48. [48]

    Mel' c uk

    Igor A. Mel' c uk. 2023. General phraseology: Theory and practice. John Benjamins

  49. [49]

    Filip Mileti \'c and Sabine Schulte im Walde. 2024. https://doi.org/10.1162/tacl_a_00657 Semantics of multiword expressions in transformer-based models: A survey . Transactions of the Association for Computational Linguistics, 12:593--612

  50. [50]

    Sag, and Thomas Wasow

    Geoffrey Nunberg, Ivan A. Sag, and Thomas Wasow. 1994. http://www.jstor.org/stable/416483 Idioms . Language, 70(3):491--538

  51. [51]

    OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . https://arxiv.org/pdf/2303.08774.pdf. Preprint, arXiv:2303.08774

  52. [52]

    OpenAI. 2025 a . Openai. https://openai.com/index/introducing-gpt-5. Accessed: August 7, 2025

  53. [53]

    OpenAI. 2025 b . Openai. https://openai.com/index/introducing-o3-and-o4-mini. April 16, 2025

  54. [54]

    Caroline Pasquer, Agata Savary, Carlos Ramisch, and Jean-Yves Antoine. 2020. https://doi.org/10.18653/v1/2020.coling-main.296 Verbal multiword expression identification: Do we need a sledgehammer to crack a nut? In Proceedings of the 28th International Conference on Computational Linguistics, pages 3333--3345, Barcelona, Spain (Online). International Comm...

  55. [55]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and 1 others. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32

  56. [56]

    Thang Pham, Seunghyun Yoon, Trung Bui, and Anh Nguyen. 2023. https://doi.org/10.18653/v1/2023.eacl-main.1 P i C : A phrase-in-context dataset for phrase understanding and semantic search . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1--26, Dubrovnik, Croatia. Association for Computa...

  57. [57]

    Qwen . 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  58. [58]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  59. [59]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485--5551

  60. [60]

    Parikshit Ram, Tim Klinger, and Alexander G. Gray. 2024. https://doi.org/10.24963/ijcai.2024/533 What makes models compositional? a theoretical view . In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI '24

  61. [61]

    Carlos Ramisch. 2023. https://theses.hal.science/tel-04216223 Multiword expressions in computational linguistics . Habilitation \`a diriger des recherches, Aix Marseille Universit \'e (AMU)

  62. [62]

    Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa I \ n urrieta, Voula Giouli, Tunga G \"u ng \"o r, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Renata Ramisch, Sara Stymne, Abigail Walsh, and Hongzhi Xu. 2020. https://aclanthology.org/2020.mwe-1.14 ...

  63. [63]

    Carlos Ramisch, Abigail Walsh, Thomas Blanchard, and Shiva Taslimipoor. 2023 a . A survey of mwe identification experiments: The devil is in the details. In Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), pages 106--120

  64. [64]

    Carlos Ramisch, Abigail Walsh, Thomas Blanchard, and Shiva Taslimipoor. 2023 b . https://doi.org/10.18653/v1/2023.mwe-1.15 A survey of MWE identification experiments: The devil is in the details . In Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), pages 106--120, Dubrovnik, Croatia. Association for Computational Linguistics

  65. [65]

    Mar \' a A Barrios Rodr \' guez. 2003. The domain of the lexical functions fact0, causfact0 and real1. learning, page 64

  66. [66]

    Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002 a . Multiword expressions: A pain in the neck for nlp. In Computational Linguistics and Intelligent Text Processing: Third International Conference, CICLing 2002 Mexico City, Mexico, February 17--23, 2002 Proceedings 3, pages 1--15. Springer

  67. [67]

    Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger

    Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002 b . Multiword expressions: A pain in the neck for nlp. In Computational Linguistics and Intelligent Text Processing, pages 1--15, Berlin, Heidelberg. Springer Berlin Heidelberg

  68. [68]

    Manfred Sailer and Stella Markantonatou. 2018. Multiword expressions: Insights from a multi-lingual perspective. Language Science Press

  69. [69]

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2020. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938--4947

  70. [70]

    Agata Savary, Cherifa Ben Khelil, Carlos Ramisch, Voula Giouli, Verginica Barbu Mititelu, Najet Hadj Mohamed, Cvetana Krstev, Chaya Liebeskind, Hongzhi Xu, Sara Stymne, Tunga G \"u ng \"o r, Thomas Pickard, Bruno Guillaume, Eduard Bej c ek, Archna Bhatia, Marie Candito, Polona Gantar, Uxoa I \ n urrieta, Albert Gatt, and 9 others. 2023. https://doi.org/10...

  71. [71]

    Agata Savary, Carlos Ramisch, Silvio Cordeiro, Federico Sangati, Veronika Vincze, Behrang QasemiZadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, and Antoine Doucet. 2017. https://doi.org/10.18653/v1/W17-1704 The PARSEME shared task on automatic identification of verbal multiword expressions . In Proceedings of the 13th Workshop on Mult...

  72. [72]

    Alexander Shvets and Leo Wanner. 2022. https://doi.org/10.3390/math10203831 The relation dimension in the identification and classification of lexically restricted word co-occurrences in text corpora . Mathematics, 10(20)

  73. [73]

    Vered Shwartz and Ido Dagan. 2019. https://doi.org/10.1162/tacl_a_00277 Still a pain in the neck: Evaluating text representations on lexical composition . Transactions of the Association for Computational Linguistics, 7:403--419

  74. [74]

    Giorgos Spathas and Dimitris Michelioudakis. 2021. https://doi.org/10.1007/s11049-020-09496-6 States in the decomposition of verbal predicates . Natural Language & Linguistic Theory, 39(4):1253--1306

  75. [75]

    Joshua Tanner and Jacob Hoffman. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.14 MWE as WSD : Solving multiword expression identification with word sense disambiguation . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 181--193, Singapore. Association for Computational Linguistics

  76. [76]

    Simone Tedeschi, Federico Martelli, and Roberto Navigli. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.208 ID 10 M : Idiom identification in 10 languages . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2715--2726, Seattle, United States. Association for Computational Linguistics

  77. [77]

    Stephen Tratz and Eduard Hovy. 2010. https://aclanthology.org/P10-1070/ A taxonomy, dataset, and classifier for automatic noun compound interpretation . In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 678--687, Uppsala, Sweden. Association for Computational Linguistics

  78. [78]

    Valenzuela-Esc \'a rcega, Rebecca Sharp, and Mihai Surdeanu

    Robert Vacareanu, Marco A. Valenzuela-Esc \'a rcega, Rebecca Sharp, and Mihai Surdeanu. 2020. https://doi.org/10.18653/v1/2020.coling-main.297 An unsupervised method for learning representations of multi-word expressions for semantic classification . In Proceedings of the 28th International Conference on Computational Linguistics, pages 3346--3356, Barcel...

  79. [79]

    Takashi Wada, Yuji Matsumoto, Timothy Baldwin, and Jey Han Lau. 2023. https://doi.org/10.18653/v1/2023.findings-acl.290 Unsupervised paraphrasing of multiword expressions . In Findings of the Association for Computational Linguistics: ACL 2023, pages 4732--4746, Toronto, Canada. Association for Computational Linguistics

  80. [80]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

Showing first 80 references.