pith. sign in

hub

Language Models are Multilingual Chain-of-Thought Reasoners

32 Pith papers cite this work. Polarity classification is still indexing.

32 Pith papers citing it
abstract

We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.

hub tools

citation-role summary

dataset 3 background 1

citation-polarity summary

representative citing papers

Rethinking the Multilingual Reasoning Gap with Layer Swap

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Fine-tuning on matched native and English-pivoted multilingual reasoning datasets across six languages reduces the native reasoning gap to 1.9-3.5%; layer swap of English mid-layers largely closes the remaining gap while preserving target-language CoT.

Multilingual Safety Alignment via Self-Distillation

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

x1: Learning to Think Adaptively Across Languages and Cultures

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.

Sensitivity-Positional Co-Localization in GQA Transformers

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU, GPQA, HumanEval+, MATH, MGSM and ARC.

Emergent Abilities of Large Language Models

cs.CL · 2022-06-15 · unverdicted · novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

citing papers explorer

Showing 32 of 32 citing papers.