pith. sign in

arxiv: 2606.17688 · v1 · pith:Q7D2BN5Fnew · submitted 2026-06-16 · 💻 cs.CL

LLMs Infer Cultural Context but Fail to Apply It When Responding

Pith reviewed 2026-06-27 00:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLMscultural adaptationcultural inferenceCAPRI datasetmeasurement unitspragmatic responsescultural context
0
0 comments X

The pith

LLMs can infer a user's cultural background from conversation cues but fail to adapt their responses to match local conventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models apply cultural knowledge during response generation. It creates the CAPRI dataset of conversations that include varying amounts of cultural signals. Experiments reveal that current models correctly detect cultural background and know the associated conventions such as local measurement units. Yet the models seldom adjust their actual answers to fit those conventions unless the prompt requires them to handle inference and adaptation as two separate sequential steps. Adaptation improves when more cues are supplied, but default responses often carry biases tied to the model's training origins.

Core claim

State-of-the-art LLMs infer cultural background and recall conventions from cues in conversations, but fail to utilize this information to adapt responses to cultural conventions such as measurement units, time, and quantity expressions, unless explicitly prompted to perform the tasks sequentially.

What carries the argument

The CAPRI dataset of conversations with varying levels of cultural cues, used to test the separation between cultural inference and application in generated responses.

If this is right

  • Models increasingly adapt their answers as cultural cues accumulate during a conversation.
  • Model priors are not culture-neutral and sometimes align with the country of the model's origin.
  • Explicit sequential prompting allows models to bridge the gap between knowing conventions and applying them.
  • CAPRI can serve as a benchmark for methods aimed at closing the inference-to-application gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures or training objectives that embed cultural adaptation as an implicit step may reduce reliance on explicit prompting.
  • Similar evaluation setups could be applied to additional cultural dimensions such as social norms or humor styles.
  • The observed gap could lead to systematic misalignment in real deployments where users expect culturally attuned answers without spelling out the steps.

Load-bearing premise

The cultural cues and tasks in the CAPRI dataset, such as choice of measurement units or interpretation of time and quantity expressions, are representative of real user interactions and isolate cultural adaptation without major confounds.

What would settle it

An experiment in which models receive conversations containing only implicit cultural cues and still produce adapted responses without sequential prompting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.17688 by Jian Zhu, Vered Shwartz, Yisong Miao.

Figure 1
Figure 1. Figure 1: Formalization of CAPRI: the model should infer the user’s background from cultural cues in the conversation (BG; Task 1), and adapt its answer in line with that background (VQA; Task 2). personalize LLM outputs for a given user’s cul￾ture (Cao et al., 2024, 2025a). This approach is not a panacea; overly personalizing model outputs may lead to inadvertently amplifying echo cham￾bers, ignoring cultural nuanc… view at source ↗
Figure 2
Figure 2. Figure 2: A conversation scaffold across the six cue lev [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ2: VQA performance under CoT reasoning. Both Plain CoT and Pragmatic CoT outperform direct prediction significantly, with Pragmatic CoT giving the larger gain and producing deeper reasoning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: RQ1: Direct-prediction performance on BG (solid) vs. VQA (dashed): models show a significant gap between the two tasks. E2B, E4B, and 31B, all of which support both di￾rect and thinking modes (details in Appendix B.1). Implementation Details. We employ a mini￾mal prompt: (1) for the primary VQA task, we ask the models to role-play the chatbot and an￾swer the user’s question in line with their cul￾tural bac… view at source ↗
Figure 5
Figure 5. Figure 5: RQ3: Effect of model scaling across the Qwen, Gemma, and Gemini-3.1 families: Larger models narrow the gap between VQA and BG (the shaded area). tive. (2) Plain CoT yields shallower reasoning over￾all, though larger models still reach the deeper lev￾els, suggesting that pragmatic reasoning capability scales with model size (detailed in Appendix B.4). 3.4 Benefit of Scaling (RQ3) We further find that larger… view at source ↗
Figure 6
Figure 6. Figure 6: RQ4: Models’ cultural prior at NULL, across the two tasks. Gemini-3.1 are the most “diverse” models, with notable shares for Japan, UK, India, and others. Models’ prior on unit prediction tells a different story (Figure 6b shows Gemini-3.1; other mod￾els are similar). Although background prediction leans toward the US, unit predictions are dominated by the metric system, the opposite of US conven￾tion. Whi… view at source ↗
Figure 7
Figure 7. Figure 7: RQ5: The impact of cultural cues on models’ inter-culture divergence (DInter): more cues yield more diverse answers. We first study whether models show more di￾vergence as cues accumulate. Given that there is no ground-truth answer for subjective dimensions, we instead examine how models’ predictions shift as cultural cues accumulate, by defining the inter￾culture divergence score: DInter = 1 N 2  X {B,B′… view at source ↗
Figure 8
Figure 8. Figure 8: Annotation interface, Step 1: Cultural Priming. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Annotation interface, Step 2: Annotation Instruction. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Annotation interface, Step 3: Annotation Instance. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of cultural-reasoning depth (L0 to L3) under different reasoning settings. Pragmatic CoT [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemini-3.1’s predicted time of day (morning / noon / afternoon / evening / night) on 10 clock images (11 AM to 8 PM, columns) under each cue level from NULL to EXPLICITFULL (rows), across all 10 countries. These cases concretize the per-culture diver￾gence scores DImpl and DExpl from §4.2: each shifted cell in our prediction tables contributes to one of those scores. Low divergence (the answer unchanged f… view at source ↗
read the original abstract

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the CAPRI dataset of conversations with varying cultural cues and evaluates state-of-the-art LLMs on inferring cultural background and applying relevant conventions (e.g., local measurement units, time/quantity expressions). The central claim is that models can infer backgrounds and recall conventions but often fail to adapt responses accordingly unless explicitly prompted to perform inference and application sequentially; adaptation increases with accumulating cues but is influenced by non-neutral model priors such as country of origin.

Significance. If the empirical results hold after detailed verification, the work is significant for identifying a specific inference-application gap in cultural adaptation for LLMs, beyond mere overrepresentation of dominant cultures. The CAPRI dataset offers a concrete benchmark resource for future research on culturally adaptive generation.

major comments (2)
  1. Abstract: the central empirical claim that models 'often fail to utilize the information to adapt their answers' is presented without any sample sizes, model versions, statistical tests, effect sizes, or exclusion criteria, which is load-bearing for assessing whether the data supports the stated distinction between inference and application.
  2. Evaluation setup (implied in abstract description of CAPRI): the paper does not address potential confounds from training data leakage or prompt design that could undermine the isolation of cultural adaptation, which directly affects the validity of the claim that models fail to apply inferred context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the presentation of empirical claims and evaluation details.

read point-by-point responses
  1. Referee: Abstract: the central empirical claim that models 'often fail to utilize the information to adapt their answers' is presented without any sample sizes, model versions, statistical tests, effect sizes, or exclusion criteria, which is load-bearing for assessing whether the data supports the stated distinction between inference and application.

    Authors: We agree that the abstract should be more self-contained with quantitative anchors for the central claim. In the revised manuscript we have updated the abstract to report the size of the CAPRI dataset (number of conversations and cue levels), the specific model versions evaluated, and the key percentages showing the inference-application gap (with reference to the statistical tests and effect sizes reported in Sections 4.2 and 5). The full exclusion criteria and per-model breakdowns remain in the experimental sections, but the abstract now provides the minimal numerical context needed to evaluate the claim. revision: yes

  2. Referee: Evaluation setup (implied in abstract description of CAPRI): the paper does not address potential confounds from training data leakage or prompt design that could undermine the isolation of cultural adaptation, which directly affects the validity of the claim that models fail to apply inferred context.

    Authors: We acknowledge that the original submission did not explicitly discuss data-leakage risks or the rationale for the prompt templates. We have added a dedicated paragraph in the Limitations section that (a) describes the construction of novel, post-cutoff conversations to reduce leakage, (b) reports the model knowledge cutoffs relative to dataset release, and (c) explains the controlled prompt variations used to isolate sequential versus joint inference-application. While complete leakage elimination is impossible for web-scale models, these additions make the design choices and their limitations transparent. revision: yes

Circularity Check

0 steps flagged

Empirical prompting study; no derivations or load-bearing self-citations

full rationale

The paper reports experimental results on LLMs' cultural inference vs. application using the CAPRI dataset of conversations. No equations, parameters, or derivation chains appear in the abstract or described structure. Claims rest on direct prompting experiments and observed model outputs rather than any self-referential reduction, fitted inputs renamed as predictions, or uniqueness theorems. Self-citations, if present, are not load-bearing for the central empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no mathematical free parameters, axioms, or invented physical entities. The CAPRI dataset is a methodological tool rather than an invented entity under the ledger definition.

pith-pipeline@v0.9.1-grok · 5699 in / 1059 out tokens · 30296 ms · 2026-06-27T00:45:53.288185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 27 canonical work pages

  1. [1]

    Jacob Andreas and Dan Klein. 2016. https://doi.org/10.18653/v1/D16-1125 Reasoning about pragmatics with neural listeners and speakers . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1173--1182, Austin, Texas. Association for Computational Linguistics

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  3. [3]

    Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom II, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, and Seid Muhie Yimam. 2025. https://doi.org/10.18653/v1/2025.acl-long.925 CULEMO : Cultural lenses on emotion - benchmarking LLM s for cross-cultural emotion understanding . In Proceedings of the 63rd Annual Meeting of the Association for Computational ...

  4. [4]

    Yong Cao, Min Chen, and Daniel Hershcovich. 2024. https://doi.org/10.18653/v1/2024.findings-eacl.63 Bridging cultural nuances in dialogue agents through cultural value surveys . In Findings of the Association for Computational Linguistics: EACL 2024, pages 929--945, St. Julian ' s, Malta. Association for Computational Linguistics

  5. [5]

    Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul R \"o ttger, and Daniel Hershcovich. 2025 a . https://doi.org/10.18653/v1/2025.naacl-long.162 Specializing large language models to simulate survey response distributions for global populations . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association fo...

  6. [6]

    Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. 2023. https://doi.org/10.18653/v1/2023.c3nlp-1.7 Assessing cross-cultural alignment between C hat GPT and human societies: An empirical study . In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53--67, Dubrovnik, Croatia. Association ...

  7. [7]

    Zhuchen Cao, Sven Apel, Adish Singla, and Vera Demberg. 2025 b . Pragmatic reasoning improves llm code generation. arXiv preprint arXiv:2502.15835

  8. [8]

    Khyathi Raghavi Chandu, Yonatan Bisk, and Alan W Black. 2021. https://doi.org/10.18653/v1/2021.findings-acl.375 Grounding `grounding' in NLP . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4283--4305, Online. Association for Computational Linguistics

  9. [9]

    Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, and Yejin Choi. 2025. https://doi.org/10.18653/v1/2025.acl-long.1247 C ultural B ench: A robust, diverse and challenging benchmark for measuring LM s' cultural knowledge through human- AI red-teaming . ...

  10. [10]

    Reuben Cohn-Gordon, Noah Goodman, and Christopher Potts. 2018. https://doi.org/10.18653/v1/N18-2070 Pragmatically informative image captioning with character-level inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages 4...

  11. [11]

    Yan Cong. 2024. Manner implicatures in large language models. Scientific Reports, 14(1):29113

  12. [12]

    Esin DURMUS, Karina Nguyen, Thomas Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2024. https://openreview.net/forum?id=zl16jLb91v Towards measuring the representation...

  13. [13]

    Yanai Elazar and Yoav Goldberg. 2019. https://doi.org/10.1162/tacl_a_00280 Where ' s my head? D efinition, data set, and models for numeric fused-head identification and resolution . Transactions of the Association for Computational Linguistics, 7:519--535

  14. [14]

    Lautaro Estienne, Gabriel Ben Zenou, Nona Naderi, Jackie CK Cheung, and Pablo Piantanida. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1145 Collaborative rational speech act: Pragmatic reasoning for multi-turn dialog . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22509--22523, Suzhou, China. Associa...

  15. [15]

    Michael C Frank and Noah D Goodman. 2012. Predicting pragmatic reasoning in language games. Science, 336(6084):998--998

  16. [16]

    Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, and Aida Nematzadeh. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.840 Pragmatics in language grounding: Phenomena, tasks, and modeling approaches . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12619--12640, Singapore. Association for Computational Linguistics

  17. [17]

    Aina Gar \'i Soler and Marianna Apidianaki. 2021. https://doi.org/10.18653/v1/2021.naacl-main.370 Scalar adjective identification and multilingual ranking . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4653--4660, Online. Association for Computation...

  18. [18]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  19. [19]

    Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, and Anders S gaard. 2022. https://doi.org/10.18653/v1/2022.acl-long.482 Challenges and strategies in cross-cultural NLP . In Pro...

  20. [20]

    Jennifer Hu, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. 2023. https://doi.org/10.18653/v1/2023.acl-long.230 A fine-grained comparison of pragmatic language understanding in humans and language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4194--...

  21. [21]

    Siddharth

    Mingyue Jian and N. Siddharth. 2024. https://arxiv.org/abs/2411.01562 Are llms good pragmatic speakers? Preprint, arXiv:2411.01562

  22. [22]

    Mohsinul Kabir, Ajwad Abrar, and Sophia Ananiadou. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.2 Break the checkbox: Challenging closed-style evaluations of cultural alignment in LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24--51, Suzhou, China. Association for Computational Linguistics

  23. [23]

    Anjali Kantharuban, Jeremiah Milbauer, Maarten Sap, Emma Strubell, and Graham Neubig. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1254 Stereotype or personalization? user identity biases chatbot recommendations . In Findings of the Association for Computational Linguistics: ACL 2025, pages 24418--24436, Vienna, Austria. Association for Computation...

  24. [24]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626

  25. [25]

    Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. 2025 a . https://doi.org/10.1162/tacl_a_00760 Culturally aware and adapted NLP : A taxonomy and a survey of the state of the art . Transactions of the Association for Computational Linguistics, 13:652--689

  26. [26]

    Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi. 2024. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens. In First Conference on Language Modeling

  27. [27]

    Nelson, and Vered Shwartz

    Zhuozhuo Joy Liu, Farhan Samir, Mehar Bhatia, Laura K. Nelson, and Vered Shwartz. 2025 b . https://arxiv.org/abs/2505.18322 Is it bad to work all the time? cross-cultural evaluation of social norm biases in gpt-4 . Preprint, arXiv:2505.18322

  28. [28]

    Guy Mor-Lan, Omer Goldman, Matan Eyal, Adi Mayrav Gilady, Sivan Eiger, Idan Szpektor, Avinatan Hassidim, Yossi Matias, and Reut Tsarfaty. 2026. https://arxiv.org/abs/2604.19292 Location not found: Exposing implicit local and global biases in multilingual llms . Preprint, arXiv:2604.19292

  29. [29]

    Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Victor Gutierrez Basulto, Yazmin Ibanez-Garcia, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, and 3 others. 2024. https...

  30. [30]

    Allen Nie, Reuben Cohn-Gordon, and Christopher Potts. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.173 Pragmatic issue-sensitive image captioning . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1924--1938, Online. Association for Computational Linguistics

  31. [31]

    Aida Ramezani and Yang Xu. 2023. https://doi.org/10.18653/v1/2023.acl-long.26 Knowledge of cultural moral norms in large language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 428--446, Toronto, Canada. Association for Computational Linguistics

  32. [32]

    Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, and Maarten Sap. 2025. https://doi.org/10.18653/v1/2025.naacl-long.120 N orm A d: A framework for measuring the cultural adaptability of large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

  33. [33]

    William A Gaviria Rojas, Sudnya Diamos, Keertan Ranjan Kini, David Kanter, Vijay Janapa Reddi, and Cody Coleman. 2022. https://openreview.net/forum?id=qnfYsave0U4 The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world . In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  34. [34]

    Laura Eline Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rockt \"a schel, and Edward Grefenstette. 2023. https://openreview.net/forum?id=5bWW9Eop7l The goldilocks of pragmatic understanding: Fine-tuning strategy matters for implicature resolution by LLM s . In Thirty-seventh Conference on Neural Information Processing Systems

  35. [35]

    Vered Shwartz. 2022. https://doi.org/10.18653/v1/2022.findings-acl.224 Good night at 4 pm?! time expressions in different cultures . In Findings of the Association for Computational Linguistics: ACL 2022, pages 2842--2853, Dublin, Ireland. Association for Computational Linguistics

  36. [36]

    Vered Shwartz. 2025. Lost in Automatic Translation. Cambridge University Press

  37. [37]

    Judith Sieker and Sina Zarrieß. 2026. https://arxiv.org/abs/2604.15873 How hypocritical is your llm judge? listener-speaker asymmetries in the pragmatic competence of large language models . Preprint, arXiv:2604.15873

  38. [38]

    Settaluri Sravanthi, Meet Doshi, Pavan Tankala, Rudra Murthy, Raj Dabre, and Pushpak Bhattacharyya. 2024. https://doi.org/10.18653/v1/2024.findings-acl.719 PUB : A pragmatics understanding benchmark for assessing LLM s' pragmatics capabilities . In Findings of the Association for Computational Linguistics: ACL 2024, pages 12075--12097, Bangkok, Thailand. ...

  39. [39]

    Penka Stateva, Arthur Stepanov, Viviane D \'e prez, Ludivine Emma Dupuy, and Anne Colette Reboul. 2019. Cross-linguistic variation in the meaning of quantifiers: Implications for pragmatic enrichment. Frontiers in Psychology, 10:957

  40. [40]

    Yan Tao, Olga Viberg, Ryan S Baker, and Ren \'e F Kizilcec. 2024. Cultural bias and cultural alignment of large language models. PNAS nexus, 3(9):pgae346

  41. [41]

    Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael Lyu. 2024. https://doi.org/10.18653/v1/2024.acl-long.345 Not all countries celebrate thanksgiving: On the cultural dominance in large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...

  42. [42]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  43. [43]

    Isadora White, Sashrika Pandey, and Michelle Pan. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.711 Communicate to play: Pragmatic reasoning for efficient cross-cultural communication . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12201--12216, Miami, Florida, USA. Association for Computational Linguistics

  44. [44]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

  45. [45]

    Hugh Mee Wong, Rick Nouwen, and Albert Gatt. 2025. https://doi.org/10.18653/v1/2025.findings-acl.619 VAQUUM : Are vague quantifiers grounded in visual data? In Findings of the Association for Computational Linguistics: ACL 2025, pages 11966--11982, Vienna, Austria. Association for Computational Linguistics

  46. [46]

    Senqi Yang, Dongyu Zhang, Jing Ren, Ziqi Xu, Xiuzhen Zhang, Yiliao Song, Hongfei Lin, and Feng Xia. 2025. https://doi.org/10.18653/v1/2025.acl-long.1275 Cultural bias matters: A cross-cultural benchmark dataset and sentiment-enriched model for understanding multimodal metaphors . In Proceedings of the 63rd Annual Meeting of the Association for Computation...

  47. [47]

    Wenlong Zhao, Debanjan Mondal, Niket Tandon, Danica Dillion, Kurt Gray, and Yuling Gu. 2024. https://aclanthology.org/2024.lrec-main.1539/ W orld V alues B ench: A large-scale benchmark dataset for multi-cultural value awareness of language models . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources ...