pith. sign in

arxiv: 2503.18018 · v2 · submitted 2025-03-23 · 💻 cs.AI · cs.LG

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Pith reviewed 2026-05-22 22:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords large language modelsmathematical reasoningcultural adaptationGSM8Kmodel evaluationcultural sensitivity
0
0 comments X

The pith

LLMs show lower accuracy on identical math problems when cultural references are unfamiliar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests fourteen models on six culturally rewritten versions of the GSM8K benchmark, each created by swapping names, foods, and places from Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname while leaving every number and operation unchanged. Accuracy falls between 0.3 percent and 5.9 percent on the unfamiliar versions, with the drops confirmed as statistically significant. The authors conclude that mathematical reasoning in current LLMs is not culturally neutral and depends in part on exposure during training.

Core claim

Creating culturally localized GSM8K variants by systematic replacement of cultural entities while preserving all mathematical operations and numerical values produces accuracy reductions ranging from 0.3 percent for Claude 3.5 Sonnet to 5.9 percent for LLaMA 3.1-8B, with p less than 0.01 via McNemar tests across 14 models.

What carries the argument

Culturally adapted GSM8K variants, formed by replacing cultural entities in 1,198 questions while keeping all operations and values fixed.

If this is right

  • Models can outperform larger ones on specific cultural variants when training data includes relevant regions, as Mistral Saba does on Pakistan-adapted items.
  • Cultural adaptation shifts error distributions, with mathematical reasoning errors at 54.7 percent and calculation errors at 34.5 percent of failures.
  • Current training data distributions limit consistent performance across global contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that assume cultural neutrality may give overly optimistic estimates of real-world reliability in diverse populations.
  • The replacement method could be applied to other reasoning benchmarks to measure similar cultural dependence.
  • Training data audits focused on geographic and cultural coverage could become a standard evaluation step.

Load-bearing premise

Systematically swapping names, foods, and places does not change linguistic complexity, ambiguity, or problem difficulty in any other way.

What would settle it

Re-running the fourteen models on the original GSM8K set and the six adapted sets and finding no statistically significant accuracy difference would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.18018 by Aabid Karim, Abdul Karim, Abdul Sattar, Bhoomika Lohana, Jaswinder Singh, Matt Keon.

Figure 1
Figure 1. Figure 1: Cultural Datasets Creation Flow 2.1 Cultural Entities Recognition Initially, we select a representative sample of 200 questions from the 1,319 questions in the GSM8K dataset and manually identify cultural entities through a detailed human evaluation. Subsequently, we manually create symbolic versions of seven randomly chosen questions from this subset, replacing the identified cultural entities with accura… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy Comparison of GSM8K vs culturally variant versions of GSM8K across various models [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance Gap of Models across various culturally adapted GSM8K variants [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

We demonstrate that large language models' (LLMs) mathematical reasoning is culturally sensitive: testing 14 models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and Microsoft across six culturally adapted variants of the GSM8K benchmark, we find accuracy drops ranging from 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B) when math problems are embedded in unfamiliar cultural contexts--even when the underlying mathematical logic remains unchanged. These statistically significant performance reductions (p < 0.01, confirmed through McNemar tests) reveal that mathematical reasoning in LLMs is not culturally neutral. To create these variants for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname, we systematically replaced cultural entities (names, foods, places, etc.) in 1,198 GSM8K questions while preserving all mathematical operations and numerical values. Our quantitative error analysis of 18,887 instances reveals that cultural adaptation affects broader reasoning patterns, with mathematical reasoning errors comprising 54.7% and calculation errors 34.5% of failures. Interestingly, cultural familiarity can enhance performance: Mistral Saba outperforms some larger models on Pakistan-adapted problems due to Middle Eastern and South Asian training data exposure. This study underscores the need for more diverse training data to ensure robust LLM performance across global contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that LLMs exhibit culturally sensitive mathematical reasoning. By creating six culturally adapted variants of 1,198 GSM8K questions (Haiti, Moldova, Pakistan, Solomon Islands, Somalia, Suriname) via systematic replacement of cultural entities while preserving all mathematical operations and numerical values, the authors evaluate 14 models and report accuracy drops of 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B). These differences are statistically significant (p < 0.01) per McNemar tests on 18,887 instances; error analysis attributes 54.7% of failures to mathematical reasoning errors and 34.5% to calculation errors. The work concludes that LLM math reasoning is not culturally neutral and highlights benefits from culturally aligned training data.

Significance. If the cultural replacements isolate familiarity without confounding changes to linguistic complexity or tokenization, the results would provide a large-scale empirical demonstration that current LLMs' math capabilities are not culturally neutral, with direct implications for training data curation. The evaluation spans 14 models from multiple providers, uses paired statistical tests, and separates reasoning from calculation errors; the incidental finding that Mistral Saba outperforms some larger models on Pakistan-adapted items due to training-data overlap is a concrete, falsifiable illustration of cultural alignment effects.

major comments (2)
  1. [Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.
  2. [Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.
minor comments (3)
  1. [Abstract] The abstract refers to 'Mistral Saba' as outperforming larger models on Pakistan-adapted problems; this model name does not appear in the list of 14 evaluated models and should be clarified or corrected.
  2. [Error Analysis] The error-type percentages (54.7 % reasoning, 34.5 % calculation) are presented as aggregates; it is unclear whether they are pooled across all models and variants or computed per condition, which affects interpretability of the claim that cultural adaptation 'affects broader reasoning patterns.'
  3. [Results] Table or figure captions for the per-model accuracy drops should explicitly state the baseline (original GSM8K) against which the percentage-point reductions are measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments highlighting the need for stronger controls on surface-form equivalence and causal isolation. We address each point below and will incorporate the suggested analyses in a revised manuscript.

read point-by-point responses
  1. Referee: [Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.

    Authors: We agree that quantitative verification of linguistic equivalence is important for isolating cultural effects. In the revised manuscript we will add explicit comparisons across all variants, including: (i) mean and distribution of token counts per problem, (ii) average sentence length, (iii) Flesch-Kincaid readability scores, and (iv) cosine distances between sentence embeddings (using a fixed encoder such as all-MiniLM-L6-v2). These statistics will be reported in a new table or appendix to demonstrate that changes are confined to the replaced cultural entities. revision: yes

  2. Referee: [Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.

    Authors: The referee correctly identifies that the current design relies on the assumption that only cultural familiarity is altered. While the replacements were limited to proper nouns, foods, and locations with all numbers and operations unchanged, we acknowledge the absence of explicit controls. In revision we will add an ablation using random non-cultural replacements (e.g., substituting common nouns with frequency-matched but semantically unrelated terms) on a subset of problems and compare performance drops against the cultural variants. This will provide a direct test of whether surface-form changes alone can explain the observed accuracy reductions. We retain the cultural-alignment interpretation as the most parsimonious account given the Mistral Saba result, but will qualify the claim accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct testing

full rationale

The paper conducts an empirical study by creating six culturally adapted GSM8K variants through entity replacement and evaluating 14 LLMs on them, reporting accuracy drops and error breakdowns from 18,887 instances. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the central claims. The McNemar tests and error analysis are direct measurements against external model outputs, rendering the work self-contained against benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that GSM8K measures mathematical reasoning independently of cultural framing and that the chosen replacements isolate cultural familiarity without other confounds.

axioms (1)
  • domain assumption GSM8K benchmark measures mathematical reasoning independent of cultural context.
    The paper uses unmodified GSM8K as the base and assumes adaptations preserve this property.

pith-pipeline@v0.9.0 · 5807 in / 1292 out tokens · 69255 ms · 2026-05-22T22:24:13.802605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.

  2. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems d...

  3. GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

    cs.CL 2026-05 unverdicted novelty 6.0

    GSM-SEM generates reusable, stochastic semantic variants of math reasoning benchmarks that alter underlying facts but preserve answers, producing larger LLM performance drops than prior surface-level variants.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Visual Large Language Models for Generalized and Specialized Applications

    Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765, 2025

  2. [2]

    Visionllama: A unified llama backbone for vision tasks

    Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. In European Conference on Computer Vision , pages 1–18. Springer, 2024

  3. [3]

    Apple intelligence foundation language models

    Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, et al. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024

  4. [4]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023

  5. [5]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 , 2024

  6. [6]

    Fairness in language models beyond English: Gaps and challenges

    Krithika Ramesh, Sunayana Sitaram, and Monojit Choudhury. Fairness in language models beyond English: Gaps and challenges. arXiv preprint arXiv:2302.12578 , 2023

  7. [7]

    Knowledge of cultural moral norms in large language models

    Aida Ramezani and Yang Xu. Knowledge of cultural moral norms in large language models, 2023. URL https://arxiv.org/abs/2306.01857

  8. [8]

    Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, 2023. URL https://arxiv.org/abs/2206.10498

  9. [9]

    When can transformers reason with abstract symbols?, 2024

    Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua Susskind. When can transformers reason with abstract symbols?, 2024. URL https://arxiv.org/abs/2310. 09753

  10. [10]

    Having beer after prayer? measuring cul- tural bias in large language models

    Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models, 2024. URL https://arxiv.org/abs/2305.14456

  11. [11]

    Chain of thought empowers transformers to solve inherently serial problems

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024. URL https://arxiv.org/abs/2402.12875. 13

  12. [12]

    On Limitations of the Transformer Architecture, 2024

    Binghui Peng, Srini Narayanan, and Christos Papadimitriou. On Limitations of the Transformer Architecture, 2024. URL https://arxiv.org/abs/2402.08164

  13. [13]

    Su, Camillo J

    Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners,

  14. [14]

    URL https://arxiv.org/abs/2406.11050

  15. [15]

    Klusowski, Jianqing Fan, and Mengdi Wang

    Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-Layer Transformer Provably Learns One-Nearest Neighbor In Context, 2024. URL https: //arxiv.org/abs/2411.10830

  16. [16]

    Chi, Nathanael Schärli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch¨ arli, and Denny Zhou. Large Language Models Can Be Easily Distracted by Irrelevant Context, 2023. URL https://arxiv.org/abs/2302.00093

  17. [17]

    Impact of pretraining term frequencies on few-shot reasoning

    Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, and Sameer Singh. Impact of Pretraining Term Frequencies on Few-Shot Reasoning, 2022. URL https://arxiv.org/abs/2202.07206

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

  19. [19]

    Large language models can be easily distracted by irrelevant context

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch¨ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023

  20. [20]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

  21. [21]

    Introducing claude 3.5 sonnet, 2024

    Anthropic. Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/ claude-3-5-sonnet . Accessed: 2025-03-22

  22. [22]

    Language model tokenizers introduce unfairness between languages

    Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. Advances in neural information processing systems , 36: 36963–36990, 2023

  23. [23]

    Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

    Thao Anh Dang, Limor Raviv, and Lukas Galke. Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5. arXiv preprint arXiv:2410.11627 , 2024

  24. [24]

    Effects of culture on the balance between mathematics achievement and subjective wellbeing

    Jingyi Meng and Simiao Liu. Effects of culture on the balance between mathematics achievement and subjective wellbeing. Frontiers in Psychology, 13:894774, 2022

  25. [25]

    Differences in Mathematical Problem-Solving Skills Between Japanese and American Children

    Hidetsugu Tajika. Differences in Mathematical Problem-Solving Skills Between Japanese and American Children. PhD thesis, Aichi University of Education, 2004

  26. [26]

    Llms will always hallucinate, and we need to live with this,

    S Banerjee, A Agarwal, and S Singla. LLMs Will Always Hallucinate, and We Need to Live with This. arXiv preprint arXiv:2409.05746 , 2004. 14 Appendix A: Dataset Creation 1 Prompt for Cultural Entities Recognition Figure A1: Prompt for Cultural Entities Recognition 15 Figure A2: Prompt for Recognized Cultural Entities Evaluation 2 Evaluation and Manual Cor...

  27. [27]

    The mathematical logic of the questions remained unchanged

  28. [28]

    During this process, we encounter a small inconsistency

    There were no modifications to numerical values. During this process, we encounter a small inconsistency. While GPT-4o correctly identified cultural entities and verified them using our 5-shot prompt, it used inconsistent names for the same entity. For example, it correctly identified a Person name in each question and replaced it with a placeholder. But ...