pith. sign in

arxiv: 2601.02996 · v2 · submitted 2026-01-06 · 💻 cs.CL

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Pith reviewed 2026-05-16 17:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords latent reasoningmultilingual reasoningchain-of-thoughtlarge reasoning modelsrepresentational analysiscross-lingual consistencyEnglish-centered pathwaytruncation strategy
0
0 comments X

The pith

Large reasoning models form predictions internally in a way that aligns with English even when reasoning in other languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large reasoning models perform latent reasoning in languages other than English by tracking when the correct answer appears in hidden states as partial chain-of-thought traces are provided. It finds clear signs of latent reasoning across 11 languages, but the strength varies with language resources and task difficulty. Representational analyses show that the step-by-step evolution of these internal predictions stays highly consistent across languages and tracks the English pattern closely. A reader would care because this points to a shared internal mechanism rather than fully independent language-specific reasoning systems, which could explain why performance gaps persist in multilingual settings.

Core claim

Using a truncation-based strategy that supplies only partial reasoning traces, the authors observe that the correct answer frequently emerges in hidden states before the full textual chain-of-thought is completed. This occurs across the 11 languages studied, though more reliably in high-resource languages and on easier benchmarks. Representational similarity measures further show that the trajectory of internal predictions evolves in a highly consistent manner across languages and aligns closely with the English trajectory, indicating an English-centered latent reasoning pathway.

What carries the argument

Truncation-based probing that measures the stepwise emergence of correct answers from hidden states during partial chain-of-thought generation, paired with representational analyses of prediction evolution across languages.

If this is right

  • Latent reasoning exists in multiple languages but emerges more reliably in resource-rich ones.
  • The internal mechanism is shared and English-aligned rather than built from separate language-specific processes.
  • Harder benchmarks exhibit weaker latent reasoning, implying it scales with task complexity.
  • Gaps in low-resource performance stem from weaker expression of the shared pathway rather than fundamentally different internal computations.
  • Strengthening multilingual reasoning may require targeted improvements to the consistency of this English-centered latent process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Truly language-agnostic reasoning may require new training objectives that reduce reliance on English-dominated internal states.
  • Forcing more explicit non-English reasoning steps could potentially shift or diversify the latent pathway.
  • If the English alignment persists across larger model scales, it may reflect a structural outcome of current pretraining distributions.
  • Testing the same truncation method on additional languages or entirely different model families would clarify how general the centering pattern is.

Load-bearing premise

The truncation-based strategy accurately isolates latent reasoning formation without being confounded by language-specific tokenization differences or training data imbalances across the 11 languages.

What would settle it

An experiment that controls for tokenization length and shows that internal prediction trajectories still diverge markedly across languages, or a result where low-resource languages exhibit no measurable alignment with English hidden-state patterns on the same tasks.

Figures

Figures reproduced from arXiv: 2601.02996 by Hinrich Sch\"utze, Michael A. Hedderich, Raoyuan Zhao, Yihong Liu.

Figure 1
Figure 1. Figure 1: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen-32B. High accuracy with a low gold-in-trace rate indicates latent reasoning. The model shows strong evidence of latent reasoning in high-resource languages (e.g., English) on MGSM, but it is less detectable on Multilingual AIME. 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 0.0 0.2 0.4 0.6 0.8 1… view at source ↗
Figure 2
Figure 2. Figure 2: Causal decomposition of newly correct pre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise rank of the gold answer obtained via logit lens across languages on MGSM (left three panels) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aggregated cosine similarity between hidden states in each language and English (reference), averaged [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of cosine similarity with English [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen-7B on MGSM. The model shows stronger latent reasoning in high-resource languages (e.g., English). 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.4 0.6 0.8 1.0 pass@k MGSM | R1-Qwen-14B | Bengali pass@1 pass@5 pass@10 Gold-in-Trace (pass@1) Gold-in-trace Rate 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.… view at source ↗
Figure 7
Figure 7. Figure 7: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen-14B on MGSM. The model shows stronger latent reasoning in high-resource languages (e.g., English). 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.4 0.6 0.8 1.0 pass@k MGSM | R1-Qwen-32B | Bengali pass@1 pass@5 pass@10 Gold-in-Trace (pass@1) Gold-in-trace Rate 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0… view at source ↗
Figure 8
Figure 8. Figure 8: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen-32B on MGSM. The model shows stronger latent reasoning in high-resource languages (e.g., English) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen-7B on Multilingual AIME. Latent reasoning is less pronounced compared to MGSM. 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.4 0.6 0.8 1.0 pass@k Multilingual AIME | R1-Qwen-14B | Bengali pass@1 pass@5 pass@10 Gold-in-Trace (pass@1) Gold-in-trace Rate 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.4 … view at source ↗
Figure 10
Figure 10. Figure 10: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen￾14B on Multilingual AIME. Latent reasoning is less pronounced compared to MGSM. 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.4 0.6 0.8 1.0 pass@k Multilingual AIME | R1-Qwen-32B | Bengali pass@1 pass@5 pass@10 Gold-in-Trace (pass@1) Gold-in-trace Rate 0.0 0.2 0.4 0.6 0.8 1.0 Truncation Ratio 0.0 0.2 0.… view at source ↗
Figure 11
Figure 11. Figure 11: Pass@k accuracy (k = 1, 5, 10) and gold-in-trace rate under reasoning-trace truncation for R1-Qwen￾32B on Multilingual AIME. Latent reasoning is less pronounced compared to MGSM [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Causal decomposition of newly correct predictions across truncation intervals on [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Aggregated cosine similarity on MGSM between hidden states in each language and English (reference), averaged over both reasoning steps and layers. High-resource languages show consistently higher similarity to English, suggesting convergence toward an English-centered latent reasoning pathway. 0 5 10 15 20 25 Layer index 0.5 0.6 0.7 0.8 0.9 1.0 Cosine similarity Multilingual AIME | R1-Qwen-7B | ref=Engli… view at source ↗
Figure 19
Figure 19. Figure 19: Aggregated cosine similarity on Multilingual AIME between hidden states in each language and English (reference), averaged over both reasoning steps and layers. High-resource languages show consistently higher similarity to English, suggesting convergence toward an English-centered latent reasoning pathway [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of cosine similarity with English versus average similarity with other languages, shown [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Comparison of cosine similarity with English versus average similarity with other languages, shown [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of cosine similarity with English versus average similarity with other languages, shown [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt used to generate meaning-preserving paraphrases using [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt used to evaluate the solvability of original and counterfactual MGSM questions using [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Language-specific prompt templates (containing the explicit language instruction) used for controlling [PITH_FULL_IMAGE:figures/full_fig_p027_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Language-specific prompt-hacking prefixes [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Language-specific answer-elicitation pre [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗
read the original abstract

Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks, often attributed to their capability to generate explicit chain-of-thought (CoT) explanations. However, recent work shows that LRMs often arrive at the correct answer before completing these textual reasoning steps, indicating the presence of latent reasoning -- internal, non-verbal computation encoded in hidden states. While this phenomenon has been explored in English, its multilingual behavior remains largely unknown. In this paper, we conduct a systematic investigation of multilingual latent reasoning in LRMs across 11 languages. Using a truncation-based strategy, we examine how the correct answer emerges as the model is given only partial reasoning traces, allowing us to measure stepwise latent prediction formation. Our results reveal clear evidence of multilingual latent reasoning, though unevenly: strong in resource-rich languages, weaker in low-resource ones, and broadly less observable on harder benchmarks. To understand whether these differences reflect distinct internal mechanisms, we further perform representational analyses. Despite surface-level disparities, we find that the internal evolution of predictions is highly consistent across languages and broadly aligns with English -- a pattern suggesting an English-centered latent reasoning pathway.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that large reasoning models (LRMs) exhibit multilingual latent reasoning across 11 languages. Using a truncation-based strategy on partial chain-of-thought traces, it measures the emergence of correct answers in hidden states and finds stronger evidence in resource-rich languages, weaker in low-resource ones, and reduced observability on harder benchmarks. Representational analyses indicate that internal prediction evolution is highly consistent across languages and aligns with English, suggesting an English-centered latent reasoning pathway despite surface-level disparities.

Significance. If the central findings hold after addressing alignment issues, the work would be significant for extending latent reasoning research beyond English and highlighting potential language biases in internal model computations. The truncation approach combined with representational similarity analyses offers a useful empirical probe for non-verbal reasoning processes, with implications for multilingual model design and evaluation.

major comments (1)
  1. [Truncation-based strategy (methods and results sections)] Truncation-based strategy (methods and results sections): The approach of truncating at fixed reasoning-step counts does not account for large cross-lingual differences in subword tokenization efficiency. The same nominal step index therefore corresponds to unequal amounts of semantic content and hidden-state computation (e.g., English vs. low-resource languages), which risks confounding the reported consistency of internal prediction evolution and the claim of an English-centered pathway.
minor comments (1)
  1. [Abstract and experimental description] Abstract and experimental description: No details are provided on statistical tests, controls for tokenization effects or training-data imbalances, or confidence intervals around the consistency measures, making it difficult to assess the robustness of the cross-language alignment claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our truncation-based strategy. We address the concern point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Truncation-based strategy (methods and results sections)] Truncation-based strategy (methods and results sections): The approach of truncating at fixed reasoning-step counts does not account for large cross-lingual differences in subword tokenization efficiency. The same nominal step index therefore corresponds to unequal amounts of semantic content and hidden-state computation (e.g., English vs. low-resource languages), which risks confounding the reported consistency of internal prediction evolution and the claim of an English-centered pathway.

    Authors: We appreciate the referee highlighting this methodological consideration. Our truncation is performed at the boundaries of the discrete reasoning steps explicitly generated in the model's chain-of-thought output; these steps are defined by the model's own segmentation of the reasoning process rather than by token count. While we acknowledge that subword tokenization efficiency differs across languages and that the same step index may therefore involve varying numbers of tokens and hidden-state updates, the fact that we still observe highly consistent internal prediction evolution across languages (including alignment with English) suggests that the latent reasoning dynamics are not primarily driven by these surface-level tokenization disparities. To strengthen the analysis and directly address the potential confound, we will add a supplementary experiment in the revision that re-runs the truncation analysis using token-count normalization (i.e., truncating at equivalent cumulative token positions across languages) and compare the resulting consistency metrics to the original step-based results. This will allow readers to assess whether the English-centered pathway claim holds under token-equivalent conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements of latent reasoning formation

full rationale

The paper reports direct empirical measurements via truncation of partial CoT traces and subsequent representational analyses across languages. No equations, fitted parameters, or self-citations are invoked as load-bearing premises that reduce the central claims to tautologies or inputs by construction. The truncation strategy and consistency findings are presented as observable outcomes that could be falsified by the data; they do not redefine or presuppose the reported patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full paper would likely list additional assumptions about model internals and language sampling.

axioms (1)
  • domain assumption Truncation of reasoning traces reveals the timing of latent answer formation without introducing artifacts from incomplete context
    Central to the measurement strategy described in the abstract.

pith-pipeline@v0.9.0 · 5507 in / 1117 out tokens · 26631 ms · 2026-05-16T17:20:47.852763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 119–136, Abu Dhabi, United Arab Emirates

    The geometry of multilingual language model representations. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 119–136, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, ...

  2. [2]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dong- mei Zhang, and Jia Li. 2024. Breaking language barriers in multilingual mathematical reasoning: In- sights and observations. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 7001–7016, Miami, Florida...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhi- hong Shao, Zhuoshu Li, Ziyi Gao, and 181 others

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. From explicit cot to implicit cot: Learning to inter- nalize cot step by step.Preprint, arXiv:2405.14838. Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. Ho...

  5. [5]

    Francis Begnaud Hildebrand

    OpenReview.net. Francis Begnaud Hildebrand. 1987.Introduction to numerical analysis. Courier Corporation. Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not all languages are created equal in LLMs: Improv- ing multilingual capability by cross-lingual-thought prompting. InFindings of the Association for Com- p...

  6. [6]

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury

    Beyond english-centric training: How rein- forcement learning improves cross-lingual reasoning in llms.Preprint, arXiv:2509.23657. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. InProceedings of the 58th Annual Meeting of the Association for...

  7. [7]

    Early Stopping Chain-of-thoughts in Large Language Models

    Implicit reasoning in transformers is reasoning through shortcuts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9470–9487, Vienna, Austria. Association for Compu- tational Linguistics. Yihong Liu, Chunlan Ma, Haotian Ye, and Hinrich Schuetze. 2024. TransliCo: A contrastive learning framework to address the script barrier in...

  8. [8]

    Why think step by step? reasoning emerges from the locality of experience. InAdvances in Neu- ral Information Processing Systems 36: Annual Con- ference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023. Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle Bitterman, and Arianna Bisazza. 202...

  9. [9]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Layer by layer: Uncovering hidden representa- tions in language models. InForty-second Interna- tional Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenRe- view.net. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. 2024. Scaling llm test-time compute optimally can be more effective than scaling model parame...

  10. [10]

    Benchmarking benchmark leakage in large language mod- els

    Benchmarking benchmark leakage in large language models.Preprint, arXiv:2404.18824. Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. 2025a. SoftCoT: Soft chain-of-thought for efficient reasoning with LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23336– 23351, Vienna, Austria. ...

  11. [11]

    Preserve ALL numbers exactly (character-for-character)

  12. [12]

    Preserve ALL LaTeX math exactly as-is (anything inside $...$ must appear unchanged)

  13. [13]

    Keep the question asking for the same final quantity; the problem must be logically equivalent

  14. [14]

    Reduce lexical overlap by paraphrasing and reordering sentences outside math mode

  15. [15]

    Do NOT include any solution steps, explanations, or the final answer

  16. [16]

    paraphrase

    Do NOT add or remove any facts, entities, units, or constraints. Return ONLY valid JSON with exactly these keys: {“paraphrase”: “...”, “changes”: “...”} Original problem: {problem} Figure 23: Prompt used to generate meaning-preserving paraphrases using Gemini-2.5-Flash. Placeholders {language_name} and {problem} are substituted per instance. Benjamin et a...