XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

Hassan Alhuzali; Jimin Huang; Md Mezbaur Rahman; Mohsinul Kabir; Shaoxiong Ji; Sophia Ananiadou; Tasnim Ahmed; Yuechen Jiang

arxiv: 2601.14063 · v2 · pith:W3QA5B4Onew · submitted 2026-01-20 · 💻 cs.CL · cs.AI· cs.CY

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

Mohsinul Kabir , Tasnim Ahmed , Md Mezbaur Rahman , Shaoxiong Ji , Hassan Alhuzali , Yuechen Jiang , Jimin Huang , Sophia Ananiadou This is my paper

classification 💻 cs.CL cs.AIcs.CY

keywords culturalacrosscross-culturalreasoningcsisllmsmodelsxcr-bench

0 comments

read the original abstract

Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited by the lack of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. We introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark containing 4.1k parallel sentences and 1,098 CSIs across three reasoning tasks. XCR-Bench integrates Newmark's CSI framework with Hall's Triad of Culture, enabling evaluation across levels of cultural visibility -- from observable practices to implicit social norms and values. Experiments on eight multilingual LLMs show that state-of-the-art models exhibit consistent weaknesses in identifying and adapting specific categories of CSIs, revealing a gap between surface-level recall and explicit cultural reasoning. Performance declines significantly on culturally sensitive categories and deeper cultural levels (p<0.005, 8/8 models), and adaptation quality varies systematically across target cultures and Bengali regional variants, indicating encoded regional and ethno-religious biases even within a single linguistic setting. We publicly release the corpus and code to support future research on cross-cultural NLP.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs
cs.CL 2026-06 unverdicted novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.