Recognition: unknown
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
Pith reviewed 2026-05-10 08:39 UTC · model grok-4.3
The pith
LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers.
Load-bearing premise
That the three chosen pragmatic settings and the specific judgment/generation tasks accurately measure pragmatic competence without introducing systematic biases that favor listening over speaking.
Figures
read the original abstract
Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
Empirical benchmarking study with no derivation chain or self-referential structure
full rationale
The paper is a direct empirical comparison of LLM performance in pragmatic listener (judgment) versus speaker (generation) roles across three settings. It reports observed asymmetries from model evaluations without any equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claim rests on experimental results rather than any closed logical loop or renaming of prior findings. This matches the default case of a self-contained benchmarking study against external model outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. https://arxiv.org/abs/2412.08905 Phi-4 technic...
work page internal anchor Pith review arXiv 2024
- [2]
-
[3]
Anthropic . 2025. Introducing Claude Sonnet 4.5 . https://www.anthropic.com/news/claude-sonnet-4-5. Accessed: 2025-12-22
2025
-
[4]
Nicholas Asher and Alex Lascarides. 2003. Logics of conversation
2003
-
[5]
Raha Askari, Sina Zarrie , \"O zge Alacam, and Judith Sieker. 2025. https://doi.org/10.18653/v1/2025.babylm-main.4 Are B aby LM s deaf to G ricean maxims? a pragmatic evaluation of sample-efficient language models . In Proceedings of the First BabyLM Workshop, pages 52--65, Suzhou, China. Association for Computational Linguistics
-
[6]
Tara Azin, Daniel Dumitrescu, Diana Inkpen, and Raj Singh. 2025. https://caiac.pubpub.org/pub/keh8ij01 Let s CONFER : A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition . Proceedings of the Canadian Conference on Artificial Intelligence
2025
-
[7]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. https:...
-
[8]
Bornstein and Charleene Hendricks
Marc H. Bornstein and Charleene Hendricks. 2012. https://doi.org/10.1017/S0305000911000407 Basic language comprehension and production in > 100,000 young children from sixteen developing nations . Journal of Child Language, 39(4):899–918
-
[9]
Nitay Calderon, Roi Reichart, and Rotem Dror. 2025. https://doi.org/10.18653/v1/2025.acl-long.782 The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16051--16081, Vi...
-
[10]
Tyler A. Chang and Benjamin K. Bergen. 2023. https://arxiv.org/abs/2303.11504 Language model behavior: A comprehensive survey . Preprint, arXiv:2303.11504
-
[11]
Ferreira
Fernanda Ferreira and Victor S. Ferreira. 2024. https://oecs.mit.edu/pub/y1uhdz0y Psycholinguistics . MIT Press
2024
-
[12]
Suzanne Flynn. 1986. https://doi.org/10.1017/S0272263100006057 Production vs. comprehension: Differences in underlying competences . Studies in Second Language Acquisition, 8(2):135–164
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.18653/v1/2024.acl-long.841 OLM o: Acceleratin...
-
[15]
Irene Heim. 1991. https://doi.org/doi:10.1515/9783110126969.7.487 Artikel und Definitheit , pages 487--535. De Gruyter Mouton, Berlin, New York
-
[16]
Jennifer Hu, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. 2023. https://doi.org/10.18653/v1/2023.acl-long.230 A fine-grained comparison of pragmatic language understanding in humans and language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4194--...
-
[17]
Jennifer Hu and Roger Levy. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.306 Prompting is not a substitute for probability measurements in large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060, Singapore. Association for Computational Linguistics
- [18]
-
[19]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Jad Kabbara and Jackie Chi Kit Cheung. 2022. https://aclanthology.org/2022.coling-1.65/ Investigating the performance of transformer-based NLI models on presuppositional inferences . In Proceedings of the 29th International Conference on Computational Linguistics, pages 779--785, Gyeongju, Republic of Korea. International Committee on Computational Linguistics
2022
-
[21]
Clara Lachenmaier, Judith Sieker, and Sina Zarrie . 2025. https://doi.org/10.18653/v1/2025.acl-long.728 Can LLM s ground when they (don ' t) know: A study on direct and loaded political questions . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14956--14975, Vienna, Austria. Associ...
-
[22]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. https://arxiv.org/abs/2412.05579 Llms-as-judges: A comprehensive survey on llm-based evaluation methods . Preprint, arXiv:2412.05579
work page internal anchor Pith review arXiv 2024
-
[23]
Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, and Barbara Plank. 2025. https://doi.org/10.18653/v1/2025.acl-long.425 Pragmatics in the era of large language models: A survey on datasets, evaluation, opportunities and challenges . In Proceedings of the 63rd Annual Meeting...
-
[24]
Meyer, Falk Huettig, and Willem J.M
Antje S. Meyer, Falk Huettig, and Willem J.M. Levelt. 2016. https://doi.org/10.1016/j.jml.2016.03.002 Same, different, or closely related: What is the relationship between language production and comprehension? Journal of Memory and Language, 89:1--7. Speaking and Listening: Relationships Between Language Production and Comprehension
-
[25]
Mistral AI . 2023. Mixtral of experts . https://mistral.ai/news/mixtral-of-experts/. Accessed: 2025-12-22
2023
-
[26]
Philipp Mondorf and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.acl-long.508 Comparing inferential strategies of humans and large language models in deductive reasoning . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9370--9402, Bangkok, Thailand. Association for Computa...
-
[27]
OpenAI . 2024. GPT-4o . https://openai.com/de-DE/index/hello-gpt-4o. Accessed: 2025-12-22
2024
-
[28]
OpenAI . 2025 a . GPT-4.1 . https://platform.openai.com/docs/models/gpt-4.1. Accessed: 2025-12-22
2025
-
[29]
OpenAI . 2025 b . GPT-5 . https://platform.openai.com/docs/models/gpt-5. Accessed: 2025-12-22
2025
-
[30]
Walter Paci, Alessandro Panunzi, and Sandro Pezzelle. 2025. https://doi.org/10.18653/v1/2025.findings-acl.804 They want to pretend not to understand: The limits of current LLM s in interpreting implicit content of political discourse . In Findings of the Association for Computational Linguistics: ACL 2025, pages 15569--15593, Vienna, Austria. Association ...
-
[31]
Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, and Sungeun Lee. 2024. https://doi.org/10.18653/v1/2024.genbench-1.7 M ulti P rag E val: Multilingual pragmatic evaluation of large language models . In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 96--119, Miami, Florida...
-
[32]
Orin Percus. 2006. https://semanticsarchive.net/Archive/GI3YzhlM/AntipresuppositionsVersion1.pdf Antipresuppositions . Theoretical and Empirical Studies of Reference and Anaphora: Toward the establishment of generative grammar as an empirical science,, pages 52--73
2006
- [33]
- [34]
-
[35]
Linlu Qiu, Cedegao E. Zhang, Joshua B. Tenenbaum, Yoon Kim, and Roger P. Levy. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1008 On the same wavelength? evaluating pragmatic reasoning in language models across broad concepts . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19924--19946, Suzhou, China....
-
[36]
Qwen Team . 2025. Qwen3: Think Deeper, Act Faster . https://qwenlm.github.io/blog/qwen3/. Accessed: 2025-12-22
2025
-
[37]
Cosima Schneider, Carolin Schonard, Michael Franke, Gerhard Jäger, and Markus Janczyk. 2019. https://doi.org/10.1016/j.cognition.2019.104024 Pragmatic processing: An investigation of the (anti-)presuppositions of determiners using mouse-tracking . Cognition, 193:104024
-
[38]
Judith Sieker, Oliver Bott, Torgrim Solstad, and Sina Zarrie . 2023. https://doi.org/10.18653/v1/2023.inlg-main.15 Beyond the bias: Unveiling the quality of implicit causality prompt continuations in language models . In Proceedings of the 16th International Natural Language Generation Conference, pages 206--220, Prague, Czechia. Association for Computati...
-
[39]
Judith Sieker, Clara Lachenmaier, and Sina Zarrie . 2025. https://escholarship.org/uc/item/4932r1hx LLMs struggle to reject false presuppositions when misinformation stakes are high . Proceedings of the Annual Meeting of the Cognitive Science Society, 47
2025
-
[40]
Judith Sieker and Sina Zarrie . 2023. https://doi.org/10.18653/v1/2023.blackboxnlp-1.14 When your language model cannot E ven do determiners right: Probing for anti-presuppositions and the maximize presupposition! principle . In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 180--198, Singapore. Asso...
-
[41]
Damien Sileo, Philippe Muller, Tim Van de Cruys, and Camille Pradel. 2022. https://aclanthology.org/2022.lrec-1.255/ A pragmatics-centered evaluation framework for natural language understanding . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2382--2394, Marseille, France. European Language Resources Association
2022
-
[42]
Settaluri Sravanthi, Meet Doshi, Pavan Tankala, Rudra Murthy, Raj Dabre, and Pushpak Bhattacharyya. 2024. https://doi.org/10.18653/v1/2024.findings-acl.719 PUB : A pragmatics understanding benchmark for assessing LLM s' pragmatics capabilities . In Findings of the Association for Computational Linguistics: ACL 2024, pages 12075--12097, Bangkok, Thailand. ...
-
[43]
Robert Stalnaker. 1973. https://doi.org/10.1007/bf00262951 Presuppositions . Journal of Philosophical Logic, 2(4):447--457
-
[44]
Robert Stalnaker. 1978. Assertion. Syntax and Semantics (New York Academic Press), 9:315--332
1978
-
[45]
Andreas Stephan, Dawei Zhu, Matthias A enmacher, Xiaoyu Shen, and Benjamin Roth. 2025. https://aclanthology.org/2025.gem-1.65/ From calculation to adjudication: Examining LLM judges on mathematical reasoning tasks . In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ) , pages 759--773, Vienna, Austria and virtual meeting. Ass...
2025
-
[46]
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. https://aclanthology.org/2025.gem-1.33/ Judging the judges: Evaluating alignment and vulnerabilities in LLM s-as-judges . In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ) , pages 404--430, Vienna, Austria and v...
2025
-
[47]
Jean-Baptiste Van der Henst, Yingrui Yang, and P.N. Johnson-Laird. 2002. https://doi.org/10.1207/s15516709cog2604\_2 Strategies in sentential reasoning . Cognitive Science, 26(4):425--468
-
[48]
Shengguang Wu, Shusheng Yang, Zhenglun Chen, and Qi Su. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1258 Rethinking pragmatics in large language models: Towards open-ended evaluation and preference tuning . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22583--22599, Miami, Florida, USA. Association ...
- [49]
-
[50]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[51]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.