pith. sign in

arxiv: 2607.02049 · v1 · pith:2ZWSE3ALnew · submitted 2026-07-02 · 💻 cs.CL · cs.AI· cs.CY

SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

Pith reviewed 2026-07-03 14:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords cross-lingual evaluationempathy in LLMscultural groundingUkrainian responsesemotional support benchmarksLLM-as-a-jurycrisis responsemultilingual LLMs
0
0 comments X

The pith

Two of three tested LLMs produce weaker empathetic and culturally grounded responses in Ukrainian than in English on crisis prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPLIT, a benchmark of 500 prompts across five crisis categories, to measure whether large language models maintain emotional support quality when switching from English to Ukrainian. It evaluates three models on empathetic accuracy, linguistic naturalness, and contextual cultural grounding, then compares human and AI raters. The central result is that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct decline in Ukrainian while DeepSeek-V3 stays roughly stable, and that human and AI judges diverge most on cultural grounding. The authors conclude that generating Ukrainian text is not the same as generating culturally appropriate Ukrainian emotional support.

Core claim

We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Uk

What carries the argument

The SPLIT benchmark: 500 prompts in five crisis categories scored on Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding in both English and Ukrainian.

If this is right

  • Model selection for multilingual emotional support should favor architectures that preserve performance across languages rather than those that degrade.
  • Evaluation of crisis-related LLM output must include explicit cultural-grounding metrics beyond language fluency.
  • LLM-as-a-jury methods are unreliable for cultural aspects and should be supplemented by human raters from the target culture.
  • Future training or fine-tuning of LLMs for low-to-mid-resource languages should target emotional-support contexts separately from general text generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment of current LLMs for crisis support in Ukrainian-speaking regions carries a risk of culturally mismatched advice that English-only testing would miss.
  • The same degradation pattern may appear in other mid-resource languages; repeating the SPLIT design for Polish or Romanian would test this.
  • The observed human-AI rater divergence implies that purely automated leaderboards will systematically under-detect cultural failures in emotional support tasks.

Load-bearing premise

The 500 prompts and the three scoring dimensions validly measure cross-lingual empathy and cultural grounding for the chosen crisis categories.

What would settle it

A replication using a fresh set of 500 prompts drawn from the same categories but with Ukrainian cultural experts as primary raters that finds no performance drop for Gemini or LLaMA.

Figures

Figures reproduced from arXiv: 2607.02049 by Anna Chorna.

Figure 1
Figure 1. Figure 1: Cross-lingual performance trajectories showing macro-average human evaluation scores from EN to UA. 1 arXiv:2607.02049v1 [cs.CL] 2 Jul 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SPLIT evaluation framework. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human Evaluation Baseline scores across the three evaluated dimensions in English (EN) and Ukrainian (UA). Based on the results above, several observations can be made to rigorously explain the behavioral pat￾terns of each of the models, and outline the potential reasons for the specific outcome. Our interpretation is informed by differences in model architectures and training data. DeepSeek-V3 demonstrate… view at source ↗
Figure 4
Figure 4. Figure 4: Automated Baseline scores across the three evaluated dimensions in English (EN) and Ukrainian (UA). Linguistic Naturalness remains another question￾able dimension, with the AI jury being partly unable to identify a range of conversational discrepancies. While assigning lower scores for LLaMA when tran￾sitioning to Ukrainian, the difference still remains insignificant, experiencing a drop of 0.156. Although… view at source ↗
Figure 5
Figure 5. Figure 5: 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall agreement between Automated and Human Evaluation Baselines Our SPLIT benchmark serves as a logical supple￾ment to prior studies in the realm of Cross-Lingual Empathy Divergence. Our findings suggest that LLMs showcase strong capabilities in maintaining coherent cross-lingual interactions and preserving linguistic fluency, thus correlating modestly with humans in general-purpose tasks. However, our … view at source ↗
read the original abstract

Large Language Models are increasingly deployed in emotional-support contexts and crisis-related situations. Nevertheless, their cross-lingual abilities in these circumstances remain underexplored. Existing benchmarks emphasize multilingual performance but rarely examine crisis-related empathy and cultural grounding in low-to-mid-resource languages. We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM-as-a-jury paradigm. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human-centered evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SPLIT, a 500-prompt benchmark spanning five crisis categories (Stress, Panic, Loneliness, Internal Displacement, Tension) to evaluate three LLMs (Gemini-2.5-Flash, LLaMA-3.3-70B-Instruct, DeepSeek-V3) on Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding in both English and Ukrainian. It reports degradation for the first two models when switching to Ukrainian, relative stability for DeepSeek-V3, and weak human-AI evaluator agreement on empathy/naturalness with divergence on cultural grounding, while arguing that Ukrainian text generation does not equate to culturally appropriate emotional support.

Significance. If the benchmark and evaluation framework validly isolate the targeted constructs, the work would usefully highlight gaps in cross-lingual empathy for crisis-related support in mid-resource languages and question the reliability of LLM-as-judge paradigms. The explicit comparison of human and automated evaluation is a constructive element that could guide more robust multilingual benchmark design.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods (SPLIT benchmark design): The paper states the five crisis categories and three scoring dimensions but supplies no information on prompt authorship (native-speaker cultural adaptation versus translation), rubric operationalization, or any reliability metric such as inter-annotator agreement; this directly undermines the load-bearing claim that the observed language-gap and evaluator-divergence results reflect model behavior rather than benchmark artifacts.
  2. [Evaluation framework / Results] Evaluation framework and Results sections: No details are provided on the scoring rubrics for Contextual & Cultural Grounding, the statistical tests supporting degradation claims, or the correlation metrics and sample sizes used to establish 'weak agreement' between human and AI evaluators; without these, the central findings cannot be verified.
  3. [Methods] § on Ukrainian prompt construction: The absence of explicit criteria tying the cultural-grounding rubric to Ukrainian-specific norms (e.g., internal displacement or tension contexts) leaves open the possibility that the reported stability of DeepSeek-V3 and divergence on cultural grounding are artifacts of literal translation rather than genuine cross-lingual capability.
minor comments (2)
  1. [Abstract] The abstract is information-dense; separating the benchmark description, model results, and evaluator-agreement findings into distinct sentences would improve readability.
  2. [Methods] Consider adding a table summarizing the three dimensions with example rubric items to clarify the evaluation criteria for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods (SPLIT benchmark design): The paper states the five crisis categories and three scoring dimensions but supplies no information on prompt authorship (native-speaker cultural adaptation versus translation), rubric operationalization, or any reliability metric such as inter-annotator agreement; this directly undermines the load-bearing claim that the observed language-gap and evaluator-divergence results reflect model behavior rather than benchmark artifacts.

    Authors: We agree that explicit details on benchmark construction are essential. Prompts were developed through collaboration with native Ukrainian speakers who created culturally adapted scenarios rather than performing literal translations from English. Rubrics were operationalized drawing on established empathy and cultural psychology literature, with dimension-specific criteria. We will add a dedicated Methods subsection describing the full prompt authorship process, rubric operationalization, and explicitly note that inter-annotator agreement was not computed (single primary annotator with spot checks); this limitation and its implications will be discussed. These additions will directly address the concern about potential benchmark artifacts. revision: yes

  2. Referee: [Evaluation framework / Results] Evaluation framework and Results sections: No details are provided on the scoring rubrics for Contextual & Cultural Grounding, the statistical tests supporting degradation claims, or the correlation metrics and sample sizes used to establish 'weak agreement' between human and AI evaluators; without these, the central findings cannot be verified.

    Authors: We acknowledge the omission of these specifics. The full scoring rubric for Contextual & Cultural Grounding (including example anchors for Ukrainian cultural elements) will be added to an appendix. Degradation claims were supported by paired statistical comparisons (to be named explicitly, e.g., Wilcoxon signed-rank tests with effect sizes). Human-AI agreement used Cohen's kappa and Pearson correlation on the full set of 500 prompts evaluated by three human raters and the LLM judge; exact sample sizes and coefficients will be reported. These details will be inserted into the Evaluation framework and Results sections. revision: yes

  3. Referee: [Methods] § on Ukrainian prompt construction: The absence of explicit criteria tying the cultural-grounding rubric to Ukrainian-specific norms (e.g., internal displacement or tension contexts) leaves open the possibility that the reported stability of DeepSeek-V3 and divergence on cultural grounding are artifacts of literal translation rather than genuine cross-lingual capability.

    Authors: We will revise the Ukrainian prompt construction subsection to explicitly map each element of the cultural-grounding rubric to Ukrainian-specific norms. This includes references to post-2022 internal displacement experiences, culturally normative expressions of tension and emotional restraint, and community support structures, drawn from consultations with Ukrainian cultural experts. The revision will clarify that prompts and rubrics were designed for cultural appropriateness rather than surface-level translation, thereby strengthening the interpretation of model differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper introduces the SPLIT benchmark (500 prompts across five crisis categories) and reports LLM performance on three human-defined scoring dimensions for English vs. Ukrainian responses. No equations, fitted parameters, or derivations are present. The central claims (model degradation patterns and human-AI evaluator agreement) are direct empirical measurements on the newly constructed dataset; they do not reduce to prior outputs, self-citations, or ansatzes by construction. The benchmark design itself is presented as an input rather than a derived result, and no load-bearing step relies on self-referential uniqueness theorems or renamed known patterns. This is a standard empirical model-comparison study whose validity rests on external human evaluation rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark study; central claim rests on domain assumptions about what the prompts and scoring dimensions measure rather than on mathematical derivations or new postulated entities.

axioms (1)
  • domain assumption The five categories and 500 prompts adequately represent real-world crisis-related empathy and cultural grounding needs in English and Ukrainian.
    Benchmark design choice stated in abstract; no external validation or justification provided.

pith-pipeline@v0.9.1-grok · 5761 in / 1317 out tokens · 34127 ms · 2026-07-03T14:39:40.966008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Empathic grounding: Explorations using multimodal interaction and large language models with conversational agents

    Mehdi Arjmand, Farnaz Nouraei, Ian Steenstra, and Timothy Bickmore. Empathic grounding: Explorations using multimodal interaction and large language models with conversational agents. InProceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA ’24, New York, NY, USA, 2024. Association for Computing Machinery

  2. [2]

    Multi- lingual routing in mixture-of-experts, 2026

    Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, and Nanyun Peng. Multi- lingual routing in mixture-of-experts, 2026

  3. [3]

    Llms instead of human judges? a large scale empirical study across 20 nlp evalu- ation tasks, 2025

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, et al. Llms instead of human judges? a large scale empirical study across 20 nlp evalu- ation tasks, 2025

  4. [4]

    Judgesense: A benchmark for prompt sensitivity in llm-as-a-judge systems, 2026

    Rohith Reddy Bellibatlu, Edward Raff, and Wen- bin Zhang. Judgesense: A benchmark for prompt sensitivity in llm-as-a-judge systems, 2026

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal- can, and H. Lin, editors,Advances in Neural In- formation Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020

  6. [6]

    Yu, Qiang Yang, and Xing Xie

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models, 2023

  7. [7]

    Benchmarking llm-as-a-judge for long-form output evaluation, 2026

    Junjie Chen, Yuxi Dong, Haitao Li Graves, Wei- hang Su, Yujia Zhou, Min Zhang, Yiqun Liu, and Qinyao Ai. Benchmarking llm-as-a-judge for long-form output evaluation, 2026

  8. [8]

    Reduc- ing tokenization premiums for low-resource lan- guages, 2026

    Geoffrey Churchill and Steven Skiena. Reduc- ing tokenization premiums for low-resource lan- guages, 2026

  9. [9]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

  10. [10]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaeker- mann, Ice Pasupat, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities, 2025

  11. [11]

    Deepseek-v3 technical report, 2025

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, et al. Deepseek-v3 technical report, 2025

  12. [12]

    Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent

    Markus Freitag, Nitika Mathur, Chi-kiu Lo, et al. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors,Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore, December 2023. Association for Computational Linguistics

  13. [13]

    How reliable is multilin- gual llm-as-a-judge?, 2025

    Xiyan Fu and Wei Liu. How reliable is multilin- gual llm-as-a-judge?, 2025

  14. [14]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The llama 3 herd of models, 2024. 16

  15. [15]

    Judge’s verdict: A comprehensive analysis of llm judge capability through human agreement, 2025

    Steve Han, Gilberto Titericz Junior, Tom Balough, and Wenfei Zhou. Judge’s verdict: A comprehensive analysis of llm judge capability through human agreement, 2025

  16. [16]

    Mubench: Assessment of multilingual capabilities of large language models across 61 languages, 2025

    Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Liu, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, and Yin Zheng. Mubench: Assessment of multilingual capabilities of large language models across 61 languages, 2025

  17. [17]

    Xcomps: A mul- tilingual benchmark of conceptual minimal pairs, 2025

    Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, Helmut Schmid, Hinrich Sch¨ utze, and Nima Mesgarani. Xcomps: A mul- tilingual benchmark of conceptual minimal pairs, 2025

  18. [18]

    Op- timizing ai language models: A study of chatgpt-4 vs chatgpt-4o.ODU Digital Commons: Electrical & Computer Engineering Faculty Publications,

    MD Fayaz Bin Hossen, Muhammad Enayetur Rahman, Muhammad Rezaur Rahman, et al. Op- timizing ai language models: A study of chatgpt-4 vs chatgpt-4o.ODU Digital Commons: Electrical & Computer Engineering Faculty Publications,

  19. [19]

    Accessed: June 15, 2026

    Preprint. Accessed: June 15, 2026

  20. [20]

    Evaluating the effectiveness of large language models in automated news article sum- marization, 2025

    Lionel Richy Panlap Houamegni and Fatih Gedikli. Evaluating the effectiveness of large language models in automated news article sum- marization, 2025

  21. [21]

    An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4, 2025

    Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4, 2025

  22. [22]

    Olympicarena medal ranks: Who is the most intelligent ai so far?, 2024

    Zhen Huang, Zengzhi Wang, Shijie Xia, and Pengfei Liu. Olympicarena medal ranks: Who is the most intelligent ai so far?, 2024

  23. [23]

    Improbable bigrams expose vulnerabilities of incomplete to- kens in byte-level tokenizers.arXiv preprint arXiv:2410.23684, 2024

    Eugene Jang, Kimin Lee, Jin-Woo Chung, Ke- untae Park, and Seungwon Shin. Improbable bigrams expose vulnerabilities of incomplete to- kens in byte-level tokenizers.arXiv preprint arXiv:2410.23684, 2024

  24. [24]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral 7b, 2023

  25. [25]

    The state and fate of linguistic diversity and inclusion in the nlp world, 2021

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world, 2021

  26. [26]

    Practicing with language models cultivates human empathic communication, 2026

    Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Bruce Lambert, and Matthew Groh. Practicing with language models cultivates human empathic communication, 2026

  27. [27]

    From generation to judgment: Opportunities and challenges of llm-as-a-judge, 2025

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimo- hammad Beigi, Chengshuai Zhao, Zhen Tan, Am- rita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From generation to judgment: Opportunities and challenges of llm-as-a-judge, 2025

  28. [28]

    Preference leakage: A contamination problem in llm-as-a-judge, 2026

    Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm-as-a-judge, 2026

  29. [30]

    Holistic evaluation of language models, 2023

    Percy Liang, Rishi Bommasani, Tony Lee, et al. Holistic evaluation of language models, 2023

  30. [31]

    Language-specific latent process hinders cross-lingual performance.arXiv preprint arXiv:2505.13141, 2025

    Zheng Wei Lim, Alham Fikri Aji, and Trevor Cohn. Understanding cross-lingual inconsis- tency in large language models.arXiv preprint arXiv:2505.13141, 2025

  31. [32]

    Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings, 2024

    Chen Cecilia Liu, Fajri Koto, Timothy Bald- win, and Iryna Gurevych. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings, 2024. 17

  32. [33]

    Paths not taken: Understand- ing and mending the multilingual factual recall pipeline, 2025

    Meng Lu, Ruochen Zhang, Carsten Eickhoff, and Ellie Pavlick. Paths not taken: Understand- ing and mending the multilingual factual recall pipeline, 2025

  33. [34]

    Emo- tional intelligence in large language models is fragmented across perception, cognition, and in- teraction, 2026

    Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han, and Mengyue Wu. Emo- tional intelligence in large language models is fragmented across perception, cognition, and in- teraction, 2026

  34. [35]

    Indicparam: Bench- mark to evaluate llms on low-resource indic lan- guages, 2026

    Ayush Maheshwari, Kaushal Sharma, Vivek Pa- tel, and Aditya Maheshwari. Indicparam: Bench- mark to evaluate llms on low-resource indic lan- guages, 2026

  35. [36]

    Kar- naze, and Mai ElSherief

    Ananya Malik, Nazanin Sabri, Melissa M. Kar- naze, and Mai ElSherief. Are LLMs empa- thetic to all? investigating the influence of multi-demographic personas on a model’s em- pathy. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 24938...

  36. [37]

    Bridging minds and machines: Toward an integration of ai and cognitive science, 2025

    Rui Mao, Qian Liu, Xiao Li, Erik Cambria, and Amir Hussain. Bridging minds and machines: Toward an integration of ai and cognitive science, 2025

  37. [38]

    Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, and Miguel Rodrigues

    Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, and Miguel Rodrigues. Cul- tural alignment in large language models: An explanatory analysis based on hofstede’s cultural dimensions, 2024

  38. [39]

    Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, and Mark V. Albert. Cross- lingual stability and bias in instruction-tuned language models for humanitarian nlp, 2025

  39. [40]

    Gpt-4 technical report, 2024

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report, 2024

  40. [41]

    Goucher, et al

    OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, et al. Gpt-4o system card, 2024

  41. [42]

    The tokenizer tax across 25 european languages: Domain invariance, cross- lingual few-shot effects, and the ukrainian penalty, 2026

    Volodymyr Ovcharov. The tokenizer tax across 25 european languages: Domain invariance, cross- lingual few-shot effects, and the ukrainian penalty, 2026

  42. [43]

    Samuel J. Paech. Eq-bench: An emotional in- telligence benchmark for large language models, 2024

  43. [44]

    Natural language processing applications for low-resource languages.Natural Language Processing, 31(2):183–197, 2025

    Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-resource languages.Natural Language Processing, 31(2):183–197, 2025

  44. [45]

    Too polite to be human: Evaluating LLM empa- thy in Korean conversations via a DCT-based framework

    Seoyoon Park, Jaehee Kim, and Hansaem Kim. Too polite to be human: Evaluating LLM empa- thy in Korean conversations via a DCT-based framework. In James Hale, Brian Deuksin Kwon, and Ritam Dutt, editors,Proceedings of the Third Workshop on Social Influence in Conversations (SICon 2025), pages 76–89, Vienna, Austria, jul

  45. [46]

    Association for Computational Linguistics

  46. [47]

    Jos´ e Pombal, Ricardo Rei, and Andr´ e F. T. Mar- tins. Self-preference bias in rubric-based evalu- ation of large language models.arXiv preprint arXiv:2604.06996, 2026

  47. [48]

    Multilingual != multicultural: Evaluating gaps between multilingual capabilities and cul- tural alignment in llms, 2025

    Jonathan Rystrøm, Hannah Rose Kirk, and Scott Hale. Multilingual != multicultural: Evaluating gaps between multilingual capabilities and cul- tural alignment in llms, 2025

  48. [49]

    Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool

    Sakib Shahriar, Brady D. Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool. Putting gpt-4o to the sword: A comprehensive evaluation of lan- guage, vision, speech, and multimodal proficiency. Applied Sciences, 14(17), 2024

  49. [50]

    Lin, Adam S

    Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. Towards fa- cilitating empathic conversations in online men- tal health support: A reinforcement learning ap- proach, 2021

  50. [51]

    Julius Sim and Chris C. Wright. The kappa statistic in reliability studies: Use, interpretation, 18 and sample size requirements.Physical Therapy, 85(3):257–268, 2005

  51. [52]

    Judging the judges: A systematic evaluation of bias mitigation strate- gies in llm-as-a-judge pipelines, 2026

    Sadman Kabir Soumik. Judging the judges: A systematic evaluation of bias mitigation strate- gies in llm-as-a-judge pipelines, 2026

  52. [53]

    Beyond the imitation game: Quantify- ing and extrapolating the capabilities of language models, 2023

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantify- ing and extrapolating the capabilities of language models, 2023

  53. [54]

    Empowering smaller models: Tuning llama and gemma with chain-of-thought for ukrainian exam tasks, 2025

    Mykyta Syromiatnikov, Victoria Ruvinskaya, and Nataliia Komleva. Empowering smaller models: Tuning llama and gemma with chain-of-thought for ukrainian exam tasks, 2025

  54. [55]

    Syromiatnikov and Victoria M

    Mykyta V. Syromiatnikov and Victoria M. Ruvin- skaya. Ua-code-bench: A competitive program- ming benchmark for evaluating large language models code generation in ukrainian.Informatics Culture Technology, 2:308–314, nov 2025

  55. [56]

    Comparative analysis of automatic literature review using mistral large language model and human reviewers, 2024

    Hsiao-Ching Tsai, Yueh-Fen Huang, and Chih- Wei Kuo. Comparative analysis of automatic literature review using mistral large language model and human reviewers, 2024. Research Square Preprint

  56. [57]

    Atten- tion is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  57. [58]

    All languages matter: Evalu- ating lmms on culturally diverse 100 languages, 2025

    Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, et al. All languages matter: Evalu- ating lmms on culturally diverse 100 languages, 2025

  58. [59]

    Vygotsky.Mind in Society: The Develop- ment of Higher Psychological Processes

    Lev S. Vygotsky.Mind in Society: The Develop- ment of Higher Psychological Processes. Harvard University Press, Cambridge, MA, 1978

  59. [60]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts, 2024

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts, 2024

  60. [61]

    Hearti- ficial intelligence: Exploring empathy in language models, 2025

    Victoria Williams and Benjamin Rosman. Hearti- ficial intelligence: Exploring empathy in language models, 2025

  61. [62]

    Evaluating knowledge- based cross-lingual inconsistency in large lan- guage models, 2024

    Xiaolin Xing, Zhiwei He, Haoyu Xu, Xing Wang, Rui Wang, and Yu Hong. Evaluating knowledge- based cross-lingual inconsistency in large lan- guage models, 2024

  62. [63]

    Multi-dimensional evaluation of empathetic dialog responses, 2024

    Zhichao Xu and Jiepu Jiang. Multi-dimensional evaluation of empathetic dialog responses, 2024

  63. [64]

    Llama beyond en- glish: An empirical study on language capability transfer, 2024

    Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. Llama beyond en- glish: An empirical study on language capability transfer, 2024

  64. [65]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Sto- ica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  65. [66]

    Un- derstanding and enhancing the planning capabil- ity of language models via multi-token prediction, 2025

    Qimin Zhong, Hao Liao, Siwei Wang, Mingyang Zhou, Xiaoqun Wu, Rui Mao, and Wei Chen. Un- derstanding and enhancing the planning capabil- ity of language models via multi-token prediction, 2025. 19