Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim; Abdul Karim; Abdul Sattar; Bhoomika Lohana; Jaswinder Singh; Matt Keon

arxiv: 2503.18018 · v2 · submitted 2025-03-23 · 💻 cs.AI · cs.LG

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim , Abdul Karim , Bhoomika Lohana , Matt Keon , Jaswinder Singh , Abdul Sattar This is my paper

Pith reviewed 2026-05-22 22:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords large language modelsmathematical reasoningcultural adaptationGSM8Kmodel evaluationcultural sensitivity

0 comments

The pith

LLMs show lower accuracy on identical math problems when cultural references are unfamiliar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests fourteen models on six culturally rewritten versions of the GSM8K benchmark, each created by swapping names, foods, and places from Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname while leaving every number and operation unchanged. Accuracy falls between 0.3 percent and 5.9 percent on the unfamiliar versions, with the drops confirmed as statistically significant. The authors conclude that mathematical reasoning in current LLMs is not culturally neutral and depends in part on exposure during training.

Core claim

Creating culturally localized GSM8K variants by systematic replacement of cultural entities while preserving all mathematical operations and numerical values produces accuracy reductions ranging from 0.3 percent for Claude 3.5 Sonnet to 5.9 percent for LLaMA 3.1-8B, with p less than 0.01 via McNemar tests across 14 models.

What carries the argument

Culturally adapted GSM8K variants, formed by replacing cultural entities in 1,198 questions while keeping all operations and values fixed.

If this is right

Models can outperform larger ones on specific cultural variants when training data includes relevant regions, as Mistral Saba does on Pakistan-adapted items.
Cultural adaptation shifts error distributions, with mathematical reasoning errors at 54.7 percent and calculation errors at 34.5 percent of failures.
Current training data distributions limit consistent performance across global contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that assume cultural neutrality may give overly optimistic estimates of real-world reliability in diverse populations.
The replacement method could be applied to other reasoning benchmarks to measure similar cultural dependence.
Training data audits focused on geographic and cultural coverage could become a standard evaluation step.

Load-bearing premise

Systematically swapping names, foods, and places does not change linguistic complexity, ambiguity, or problem difficulty in any other way.

What would settle it

Re-running the fourteen models on the original GSM8K set and the six adapted sets and finding no statistically significant accuracy difference would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.18018 by Aabid Karim, Abdul Karim, Abdul Sattar, Bhoomika Lohana, Jaswinder Singh, Matt Keon.

**Figure 1.** Figure 1: Cultural Datasets Creation Flow 2.1 Cultural Entities Recognition Initially, we select a representative sample of 200 questions from the 1,319 questions in the GSM8K dataset and manually identify cultural entities through a detailed human evaluation. Subsequently, we manually create symbolic versions of seven randomly chosen questions from this subset, replacing the identified cultural entities with accura… view at source ↗

**Figure 3.** Figure 3: Accuracy Comparison of GSM8K vs culturally variant versions of GSM8K across various models [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Performance Gap of Models across various culturally adapted GSM8K variants [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

We demonstrate that large language models' (LLMs) mathematical reasoning is culturally sensitive: testing 14 models from Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and Microsoft across six culturally adapted variants of the GSM8K benchmark, we find accuracy drops ranging from 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B) when math problems are embedded in unfamiliar cultural contexts--even when the underlying mathematical logic remains unchanged. These statistically significant performance reductions (p < 0.01, confirmed through McNemar tests) reveal that mathematical reasoning in LLMs is not culturally neutral. To create these variants for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname, we systematically replaced cultural entities (names, foods, places, etc.) in 1,198 GSM8K questions while preserving all mathematical operations and numerical values. Our quantitative error analysis of 18,887 instances reveals that cultural adaptation affects broader reasoning patterns, with mathematical reasoning errors comprising 54.7% and calculation errors 34.5% of failures. Interestingly, cultural familiarity can enhance performance: Mistral Saba outperforms some larger models on Pakistan-adapted problems due to Middle Eastern and South Asian training data exposure. This study underscores the need for more diverse training data to ensure robust LLM performance across global contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports small but significant accuracy drops on culturally adapted GSM8K problems across 14 models, yet provides no checks that the adaptations left linguistic surface features unchanged.

read the letter

The core finding is that swapping names, foods, and places in GSM8K questions for six lower-resource countries produces accuracy drops between 0.3% and 5.9% on 14 models, with McNemar tests confirming the differences at p < 0.01. The authors also break down errors on nearly 19,000 failures and note that one smaller model performs better on the Pakistan variant, which they tie to training data overlap. That is the new quantitative piece: specific drops and error splits for these particular adaptations are not in the earlier literature they cite.

Referee Report

2 major / 3 minor

Summary. The paper claims that LLMs exhibit culturally sensitive mathematical reasoning. By creating six culturally adapted variants of 1,198 GSM8K questions (Haiti, Moldova, Pakistan, Solomon Islands, Somalia, Suriname) via systematic replacement of cultural entities while preserving all mathematical operations and numerical values, the authors evaluate 14 models and report accuracy drops of 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B). These differences are statistically significant (p < 0.01) per McNemar tests on 18,887 instances; error analysis attributes 54.7% of failures to mathematical reasoning errors and 34.5% to calculation errors. The work concludes that LLM math reasoning is not culturally neutral and highlights benefits from culturally aligned training data.

Significance. If the cultural replacements isolate familiarity without confounding changes to linguistic complexity or tokenization, the results would provide a large-scale empirical demonstration that current LLMs' math capabilities are not culturally neutral, with direct implications for training data curation. The evaluation spans 14 models from multiple providers, uses paired statistical tests, and separates reasoning from calculation errors; the incidental finding that Mistral Saba outperforms some larger models on Pakistan-adapted items due to training-data overlap is a concrete, falsifiable illustration of cultural alignment effects.

major comments (2)

[Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.
[Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.

minor comments (3)

[Abstract] The abstract refers to 'Mistral Saba' as outperforming larger models on Pakistan-adapted problems; this model name does not appear in the list of 14 evaluated models and should be clarified or corrected.
[Error Analysis] The error-type percentages (54.7 % reasoning, 34.5 % calculation) are presented as aggregates; it is unclear whether they are pooled across all models and variants or computed per condition, which affects interpretability of the claim that cultural adaptation 'affects broader reasoning patterns.'
[Results] Table or figure captions for the per-model accuracy drops should explicitly state the baseline (original GSM8K) against which the percentage-point reductions are measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments highlighting the need for stronger controls on surface-form equivalence and causal isolation. We address each point below and will incorporate the suggested analyses in a revised manuscript.

read point-by-point responses

Referee: [Variant Creation] The section describing variant creation states that cultural entities were replaced while 'preserving all mathematical operations and numerical values,' yet reports no quantitative checks (token counts per problem, average sentence length, Flesch-Kincaid scores, or embedding-distance comparisons) confirming that linguistic complexity, ambiguity, or surface statistics remained unchanged across the six variants. This is load-bearing for the central claim, because any systematic increase in tokenization difficulty or rarity of proper nouns in the Haiti/Somalia variants could produce the observed 0.3–5.9 % drops independently of cultural familiarity.

Authors: We agree that quantitative verification of linguistic equivalence is important for isolating cultural effects. In the revised manuscript we will add explicit comparisons across all variants, including: (i) mean and distribution of token counts per problem, (ii) average sentence length, (iii) Flesch-Kincaid readability scores, and (iv) cosine distances between sentence embeddings (using a fixed encoder such as all-MiniLM-L6-v2). These statistics will be reported in a new table or appendix to demonstrate that changes are confined to the replaced cultural entities. revision: yes
Referee: [Results and Discussion] The McNemar tests establish that accuracy differs between original and adapted versions, but the paper provides no ablation or control (e.g., random non-cultural replacements or length-matched controls) that would isolate the causal factor as cultural familiarity rather than surface-form changes. Without such isolation, the inference that 'mathematical reasoning in LLMs is not culturally neutral' rests on an unverified assumption.

Authors: The referee correctly identifies that the current design relies on the assumption that only cultural familiarity is altered. While the replacements were limited to proper nouns, foods, and locations with all numbers and operations unchanged, we acknowledge the absence of explicit controls. In revision we will add an ablation using random non-cultural replacements (e.g., substituting common nouns with frequency-matched but semantically unrelated terms) on a subset of problems and compare performance drops against the cultural variants. This will provide a direct test of whether surface-form changes alone can explain the observed accuracy reductions. We retain the cultural-alignment interpretation as the most parsimonious account given the Mistral Saba result, but will qualify the claim accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct testing

full rationale

The paper conducts an empirical study by creating six culturally adapted GSM8K variants through entity replacement and evaluating 14 LLMs on them, reporting accuracy drops and error breakdowns from 18,887 instances. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the central claims. The McNemar tests and error analysis are direct measurements against external model outputs, rendering the work self-contained against benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that GSM8K measures mathematical reasoning independently of cultural framing and that the chosen replacements isolate cultural familiarity without other confounds.

axioms (1)

domain assumption GSM8K benchmark measures mathematical reasoning independent of cultural context.
The paper uses unmodified GSM8K as the base and assumes adaptations preserve this property.

pith-pipeline@v0.9.0 · 5807 in / 1292 out tokens · 69255 ms · 2026-05-22T22:24:13.802605+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems d...
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
cs.CL 2026-05 unverdicted novelty 6.0

GSM-SEM generates reusable, stochastic semantic variants of math reasoning benchmarks that alter underlying facts but preserve answers, producing larger LLM performance drops than prior surface-level variants.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Visual Large Language Models for Generalized and Specialized Applications

Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765, 2025

work page arXiv 2025
[2]

Visionllama: A unified llama backbone for vision tasks

Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. In European Conference on Computer Vision , pages 1–18. Springer, 2024

work page 2024
[3]

Apple intelligence foundation language models

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, et al. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024

work page arXiv 2024
[4]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Fairness in language models beyond English: Gaps and challenges

Krithika Ramesh, Sunayana Sitaram, and Monojit Choudhury. Fairness in language models beyond English: Gaps and challenges. arXiv preprint arXiv:2302.12578 , 2023

work page arXiv 2023
[7]

Knowledge of cultural moral norms in large language models

Aida Ramezani and Yang Xu. Knowledge of cultural moral norms in large language models, 2023. URL https://arxiv.org/abs/2306.01857

work page arXiv 2023
[8]

Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, 2023. URL https://arxiv.org/abs/2206.10498

work page arXiv 2023
[9]

When can transformers reason with abstract symbols?, 2024

Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua Susskind. When can transformers reason with abstract symbols?, 2024. URL https://arxiv.org/abs/2310. 09753

work page 2024
[10]

Having beer after prayer? measuring cul- tural bias in large language models

Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models, 2024. URL https://arxiv.org/abs/2305.14456

work page arXiv 2024
[11]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024. URL https://arxiv.org/abs/2402.12875. 13

work page arXiv 2024
[12]

On Limitations of the Transformer Architecture, 2024

Binghui Peng, Srini Narayanan, and Christos Papadimitriou. On Limitations of the Transformer Architecture, 2024. URL https://arxiv.org/abs/2402.08164

work page arXiv 2024
[13]

Su, Camillo J

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners,

work page
[14]

URL https://arxiv.org/abs/2406.11050

work page arXiv
[15]

Klusowski, Jianqing Fan, and Mengdi Wang

Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-Layer Transformer Provably Learns One-Nearest Neighbor In Context, 2024. URL https: //arxiv.org/abs/2411.10830

work page arXiv 2024
[16]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch¨ arli, and Denny Zhou. Large Language Models Can Be Easily Distracted by Irrelevant Context, 2023. URL https://arxiv.org/abs/2302.00093

work page arXiv 2023
[17]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, and Sameer Singh. Impact of Pretraining Term Frequencies on Few-Shot Reasoning, 2022. URL https://arxiv.org/abs/2202.07206

work page arXiv 2022
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch¨ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023

work page 2023
[20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Introducing claude 3.5 sonnet, 2024

Anthropic. Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/ claude-3-5-sonnet . Accessed: 2025-03-22

work page 2024
[22]

Language model tokenizers introduce unfairness between languages

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. Advances in neural information processing systems , 36: 36963–36990, 2023

work page 2023
[23]

Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

Thao Anh Dang, Limor Raviv, and Lukas Galke. Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5. arXiv preprint arXiv:2410.11627 , 2024

work page arXiv 2024
[24]

Effects of culture on the balance between mathematics achievement and subjective wellbeing

Jingyi Meng and Simiao Liu. Effects of culture on the balance between mathematics achievement and subjective wellbeing. Frontiers in Psychology, 13:894774, 2022

work page 2022
[25]

Differences in Mathematical Problem-Solving Skills Between Japanese and American Children

Hidetsugu Tajika. Differences in Mathematical Problem-Solving Skills Between Japanese and American Children. PhD thesis, Aichi University of Education, 2004

work page 2004
[26]

Llms will always hallucinate, and we need to live with this,

S Banerjee, A Agarwal, and S Singla. LLMs Will Always Hallucinate, and We Need to Live with This. arXiv preprint arXiv:2409.05746 , 2004. 14 Appendix A: Dataset Creation 1 Prompt for Cultural Entities Recognition Figure A1: Prompt for Cultural Entities Recognition 15 Figure A2: Prompt for Recognized Cultural Entities Evaluation 2 Evaluation and Manual Cor...

work page arXiv 2004
[27]

The mathematical logic of the questions remained unchanged

work page
[28]

During this process, we encounter a small inconsistency

There were no modifications to numerical values. During this process, we encounter a small inconsistency. While GPT-4o correctly identified cultural entities and verified them using our 5-shot prompt, it used inconsistent names for the same entity. For example, it correctly identified a Person name in each question and replaced it with a placeholder. But ...

work page

[1] [1]

Visual Large Language Models for Generalized and Specialized Applications

Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765, 2025

work page arXiv 2025

[2] [2]

Visionllama: A unified llama backbone for vision tasks

Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. In European Conference on Computer Vision , pages 1–18. Springer, 2024

work page 2024

[3] [3]

Apple intelligence foundation language models

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, et al. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024

work page arXiv 2024

[4] [4]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Fairness in language models beyond English: Gaps and challenges

Krithika Ramesh, Sunayana Sitaram, and Monojit Choudhury. Fairness in language models beyond English: Gaps and challenges. arXiv preprint arXiv:2302.12578 , 2023

work page arXiv 2023

[7] [7]

Knowledge of cultural moral norms in large language models

Aida Ramezani and Yang Xu. Knowledge of cultural moral norms in large language models, 2023. URL https://arxiv.org/abs/2306.01857

work page arXiv 2023

[8] [8]

Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, 2023. URL https://arxiv.org/abs/2206.10498

work page arXiv 2023

[9] [9]

When can transformers reason with abstract symbols?, 2024

Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua Susskind. When can transformers reason with abstract symbols?, 2024. URL https://arxiv.org/abs/2310. 09753

work page 2024

[10] [10]

Having beer after prayer? measuring cul- tural bias in large language models

Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models, 2024. URL https://arxiv.org/abs/2305.14456

work page arXiv 2024

[11] [11]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024. URL https://arxiv.org/abs/2402.12875. 13

work page arXiv 2024

[12] [12]

On Limitations of the Transformer Architecture, 2024

Binghui Peng, Srini Narayanan, and Christos Papadimitriou. On Limitations of the Transformer Architecture, 2024. URL https://arxiv.org/abs/2402.08164

work page arXiv 2024

[13] [13]

Su, Camillo J

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners,

work page

[14] [14]

URL https://arxiv.org/abs/2406.11050

work page arXiv

[15] [15]

Klusowski, Jianqing Fan, and Mengdi Wang

Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-Layer Transformer Provably Learns One-Nearest Neighbor In Context, 2024. URL https: //arxiv.org/abs/2411.10830

work page arXiv 2024

[16] [16]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch¨ arli, and Denny Zhou. Large Language Models Can Be Easily Distracted by Irrelevant Context, 2023. URL https://arxiv.org/abs/2302.00093

work page arXiv 2023

[17] [17]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, and Sameer Singh. Impact of Pretraining Term Frequencies on Few-Shot Reasoning, 2022. URL https://arxiv.org/abs/2202.07206

work page arXiv 2022

[18] [18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch¨ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023

work page 2023

[20] [20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Introducing claude 3.5 sonnet, 2024

Anthropic. Introducing claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/ claude-3-5-sonnet . Accessed: 2025-03-22

work page 2024

[22] [22]

Language model tokenizers introduce unfairness between languages

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. Advances in neural information processing systems , 36: 36963–36990, 2023

work page 2023

[23] [23]

Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

Thao Anh Dang, Limor Raviv, and Lukas Galke. Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5. arXiv preprint arXiv:2410.11627 , 2024

work page arXiv 2024

[24] [24]

Effects of culture on the balance between mathematics achievement and subjective wellbeing

Jingyi Meng and Simiao Liu. Effects of culture on the balance between mathematics achievement and subjective wellbeing. Frontiers in Psychology, 13:894774, 2022

work page 2022

[25] [25]

Differences in Mathematical Problem-Solving Skills Between Japanese and American Children

Hidetsugu Tajika. Differences in Mathematical Problem-Solving Skills Between Japanese and American Children. PhD thesis, Aichi University of Education, 2004

work page 2004

[26] [26]

Llms will always hallucinate, and we need to live with this,

S Banerjee, A Agarwal, and S Singla. LLMs Will Always Hallucinate, and We Need to Live with This. arXiv preprint arXiv:2409.05746 , 2004. 14 Appendix A: Dataset Creation 1 Prompt for Cultural Entities Recognition Figure A1: Prompt for Cultural Entities Recognition 15 Figure A2: Prompt for Recognized Cultural Entities Evaluation 2 Evaluation and Manual Cor...

work page arXiv 2004

[27] [27]

The mathematical logic of the questions remained unchanged

work page

[28] [28]

During this process, we encounter a small inconsistency

There were no modifications to numerical values. During this process, we encounter a small inconsistency. While GPT-4o correctly identified cultural entities and verified them using our 5-shot prompt, it used inconsistent names for the same entity. For example, it correctly identified a Person name in each question and replaced it with a placeholder. But ...

work page