Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
Pith reviewed 2026-05-22 22:24 UTC · model grok-4.3
The pith
LLMs show lower accuracy on identical math problems when cultural references are unfamiliar.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Creating culturally localized GSM8K variants by systematic replacement of cultural entities while preserving all mathematical operations and numerical values produces accuracy reductions ranging from 0.3 percent for Claude 3.5 Sonnet to 5.9 percent for LLaMA 3.1-8B, with p less than 0.01 via McNemar tests across 14 models.
What carries the argument
Culturally adapted GSM8K variants, formed by replacing cultural entities in 1,198 questions while keeping all operations and values fixed.
If this is right
- Models can outperform larger ones on specific cultural variants when training data includes relevant regions, as Mistral Saba does on Pakistan-adapted items.
- Cultural adaptation shifts error distributions, with mathematical reasoning errors at 54.7 percent and calculation errors at 34.5 percent of failures.
- Current training data distributions limit consistent performance across global contexts.
Where Pith is reading between the lines
- Benchmarks that assume cultural neutrality may give overly optimistic estimates of real-world reliability in diverse populations.
- The replacement method could be applied to other reasoning benchmarks to measure similar cultural dependence.
- Training data audits focused on geographic and cultural coverage could become a standard evaluation step.
Load-bearing premise
Systematically swapping names, foods, and places does not change linguistic complexity, ambiguity, or problem difficulty in any other way.
What would settle it
Re-running the fourteen models on the original GSM8K set and the six adapted sets and finding no statistically significant accuracy difference would falsify the central claim.
Figures
read the original abstract
We demonstrate that large language models' (LLMs) mathematical reasoning is culturally sensitive: testing 14 models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and Microsoft across six culturally adapted variants of the GSM8K benchmark, we find accuracy drops ranging from 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B) when math problems are embedded in unfamiliar cultural contexts--even when the underlying mathematical logic remains unchanged. These statistically significant performance reductions (p < 0.01, confirmed through McNemar tests) reveal that mathematical reasoning in LLMs is not culturally neutral. To create these variants for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname, we systematically replaced cultural entities (names, foods, places, etc.) in 1,198 GSM8K questions while preserving all mathematical operations and numerical values. Our quantitative error analysis of 18,887 instances reveals that cultural adaptation affects broader reasoning patterns, with mathematical reasoning errors comprising 54.7% and calculation errors 34.5% of failures. Interestingly, cultural familiarity can enhance performance: Mistral Saba outperforms some larger models on Pakistan-adapted problems due to Middle Eastern and South Asian training data exposure. This study underscores the need for more diverse training data to ensure robust LLM performance across global contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit culturally sensitive mathematical reasoning. By creating six culturally adapted variants of 1,198 GSM8K questions (Haiti, Moldova, Pakistan, Solomon Islands, Somalia, Suriname) via systematic replacement of cultural entities while preserving all mathematical operations and numerical values, the authors evaluate 14 models and report accuracy drops of 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B). These differences are statistically significant (p < 0.01) per McNemar tests on 18,887 instances; error analysis attributes 54.7% of failures to mathematical reasoning errors and 34.5% to calculation errors. The work concludes that LLM math reasoning is not culturally neutral and highlights benefits from culturally aligned training data.
Significance. If the cultural replacements isolate familiarity without confounding changes to linguistic complexity or tokenization, the results would provide a large-scale empirical demonstration that current LLMs' math capabilities are not culturally neutral, with direct implications for training data curation. The evaluation spans 14 models from multiple providers, uses paired statistical tests, and separates reasoning from calculation errors; the incidental finding that Mistral Saba outperforms some larger models on Pakistan-adapted items due to training-data overlap is a concrete, falsifiable illustration of cultural alignment effects.
major comments (2)
- [Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.
- [Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.
minor comments (3)
- [Abstract] The abstract refers to 'Mistral Saba' as outperforming larger models on Pakistan-adapted problems; this model name does not appear in the list of 14 evaluated models and should be clarified or corrected.
- [Error Analysis] The error-type percentages (54.7 % reasoning, 34.5 % calculation) are presented as aggregates; it is unclear whether they are pooled across all models and variants or computed per condition, which affects interpretability of the claim that cultural adaptation 'affects broader reasoning patterns.'
- [Results] Table or figure captions for the per-model accuracy drops should explicitly state the baseline (original GSM8K) against which the percentage-point reductions are measured.
Simulated Author's Rebuttal
We thank the referee for these constructive comments highlighting the need for stronger controls on surface-form equivalence and causal isolation. We address each point below and will incorporate the suggested analyses in a revised manuscript.
read point-by-point responses
-
Referee: [Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.
Authors: We agree that quantitative verification of linguistic equivalence is important for isolating cultural effects. In the revised manuscript we will add explicit comparisons across all variants, including: (i) mean and distribution of token counts per problem, (ii) average sentence length, (iii) Flesch-Kincaid readability scores, and (iv) cosine distances between sentence embeddings (using a fixed encoder such as all-MiniLM-L6-v2). These statistics will be reported in a new table or appendix to demonstrate that changes are confined to the replaced cultural entities. revision: yes
-
Referee: [Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.
Authors: The referee correctly identifies that the current design relies on the assumption that only cultural familiarity is altered. While the replacements were limited to proper nouns, foods, and locations with all numbers and operations unchanged, we acknowledge the absence of explicit controls. In revision we will add an ablation using random non-cultural replacements (e.g., substituting common nouns with frequency-matched but semantically unrelated terms) on a subset of problems and compare performance drops against the cultural variants. This will provide a direct test of whether surface-form changes alone can explain the observed accuracy reductions. We retain the cultural-alignment interpretation as the most parsimonious account given the Mistral Saba result, but will qualify the claim accordingly. revision: yes
Circularity Check
No circularity: purely empirical evaluation with direct testing
full rationale
The paper conducts an empirical study by creating six culturally adapted GSM8K variants through entity replacement and evaluating 14 LLMs on them, reporting accuracy drops and error breakdowns from 18,887 instances. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the central claims. The McNemar tests and error analysis are direct measurements against external model outputs, rendering the work self-contained against benchmarks with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GSM8K benchmark measures mathematical reasoning independent of cultural context.
Forward citations
Cited by 3 Pith papers
-
Robust Reasoning Benchmark
Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
-
Robust Reasoning Benchmark
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems d...
-
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
GSM-SEM generates reusable, stochastic semantic variants of math reasoning benchmarks that alter underlying facts but preserve answers, producing larger LLM performance drops than prior surface-level variants.
Reference graph
Works this paper leans on
-
[1]
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765, 2025
-
[2]
Visionllama: A unified llama backbone for vision tasks
Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. In European Conference on Computer Vision , pages 1–18. Springer, 2024
work page 2024
-
[3]
Apple intelligence foundation language models
Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, et al. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024
-
[4]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Fairness in language models beyond English: Gaps and challenges
Krithika Ramesh, Sunayana Sitaram, and Monojit Choudhury. Fairness in language models beyond English: Gaps and challenges. arXiv preprint arXiv:2302.12578 , 2023
-
[7]
Knowledge of cultural moral norms in large language models
Aida Ramezani and Yang Xu. Knowledge of cultural moral norms in large language models, 2023. URL https://arxiv.org/abs/2306.01857
-
[8]
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, 2023. URL https://arxiv.org/abs/2206.10498
-
[9]
When can transformers reason with abstract symbols?, 2024
Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua Susskind. When can transformers reason with abstract symbols?, 2024. URL https://arxiv.org/abs/2310. 09753
work page 2024
-
[10]
Having beer after prayer? measuring cul- tural bias in large language models
Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models, 2024. URL https://arxiv.org/abs/2305.14456
-
[11]
Chain of thought empowers transformers to solve inherently serial problems
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024. URL https://arxiv.org/abs/2402.12875. 13
-
[12]
On Limitations of the Transformer Architecture, 2024
Binghui Peng, Srini Narayanan, and Christos Papadimitriou. On Limitations of the Transformer Architecture, 2024. URL https://arxiv.org/abs/2402.08164
-
[13]
Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners,
- [14]
-
[15]
Klusowski, Jianqing Fan, and Mengdi Wang
Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-Layer Transformer Provably Learns One-Nearest Neighbor In Context, 2024. URL https: //arxiv.org/abs/2411.10830
-
[16]
Chi, Nathanael Schärli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch¨ arli, and Denny Zhou. Large Language Models Can Be Easily Distracted by Irrelevant Context, 2023. URL https://arxiv.org/abs/2302.00093
-
[17]
Impact of pretraining term frequencies on few-shot reasoning
Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, and Sameer Singh. Impact of Pretraining Term Frequencies on Few-Shot Reasoning, 2022. URL https://arxiv.org/abs/2202.07206
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Large language models can be easily distracted by irrelevant context
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch¨ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023
work page 2023
-
[20]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Introducing claude 3.5 sonnet, 2024
Anthropic. Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/ claude-3-5-sonnet . Accessed: 2025-03-22
work page 2024
-
[22]
Language model tokenizers introduce unfairness between languages
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. Advances in neural information processing systems , 36: 36963–36990, 2023
work page 2023
-
[23]
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang, Limor Raviv, and Lukas Galke. Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5. arXiv preprint arXiv:2410.11627 , 2024
-
[24]
Effects of culture on the balance between mathematics achievement and subjective wellbeing
Jingyi Meng and Simiao Liu. Effects of culture on the balance between mathematics achievement and subjective wellbeing. Frontiers in Psychology, 13:894774, 2022
work page 2022
-
[25]
Differences in Mathematical Problem-Solving Skills Between Japanese and American Children
Hidetsugu Tajika. Differences in Mathematical Problem-Solving Skills Between Japanese and American Children. PhD thesis, Aichi University of Education, 2004
work page 2004
-
[26]
Llms will always hallucinate, and we need to live with this,
S Banerjee, A Agarwal, and S Singla. LLMs Will Always Hallucinate, and We Need to Live with This. arXiv preprint arXiv:2409.05746 , 2004. 14 Appendix A: Dataset Creation 1 Prompt for Cultural Entities Recognition Figure A1: Prompt for Cultural Entities Recognition 15 Figure A2: Prompt for Recognized Cultural Entities Evaluation 2 Evaluation and Manual Cor...
-
[27]
The mathematical logic of the questions remained unchanged
-
[28]
During this process, we encounter a small inconsistency
There were no modifications to numerical values. During this process, we encounter a small inconsistency. While GPT-4o correctly identified cultural entities and verified them using our 5-shot prompt, it used inconsistent names for the same entity. For example, it correctly identified a Person name in each question and replaced it with a placeholder. But ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.