pith. sign in

arxiv: 2605.31220 · v2 · pith:4H274LXBnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.LG

Shared Doubt: Zero-Shot Cross-Lingual Confidence Estimation for Language Models

Pith reviewed 2026-06-28 22:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords confidence estimationzero-shot cross-linguallanguage modelslinear probeintermediate representationsmultilingual LLMsanswer correctnessopen-ended QA
0
0 comments X

The pith

A linear probe trained only on English can predict answer reliability in other languages without any retraining or target data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether confidence signals in large language models can transfer across languages. It trains a linear probe on intermediate representations from one language to predict whether an answer is correct, then applies it zero-shot to many others. This matters because most confidence methods either stay English-only or demand new training data per language. The probe succeeds on typologically diverse languages, with features concentrating in middle layers to suggest a shared subspace. The result supplies a practical baseline that works without language-specific adaptation.

Core claim

Multilingual LLMs encode shared, language-transferable confidence features in their intermediate representations. A lightweight linear probe trained to predict answer correctness directly from these representations on one language generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and ablations show that confidence features concentrate in middle layers across languages, indicating a shared confidence subspace. The probe matches or exceeds other popular confidence estimation methods while requiring no retraining.

What carries the argument

Linear probe on intermediate hidden states that predicts answer correctness, with learned per-layer weights that highlight middle-layer contributions.

If this is right

  • Confidence estimation becomes feasible for languages lacking labeled data or resources for retraining.
  • Deployment of LLMs in multilingual settings gains a lightweight, reusable reliability signal.
  • Model analysis can focus on middle layers to study where reliability information is represented.
  • Zero-shot transfer reduces the need for parallel confidence datasets across languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar probes might transfer other properties such as factuality or uncertainty beyond binary correctness.
  • The middle-layer concentration could be tested in monolingual models to check if the subspace is inherently multilingual.
  • Performance dependence on language similarity suggests pairing source and target languages by typological distance for best results.
  • The method could be extended to generation tasks other than open-ended QA without new supervision.

Load-bearing premise

Multilingual models contain confidence features in intermediate layers that remain consistent enough across languages for a probe trained in one language to extract them in others without adaptation.

What would settle it

The probe achieving near-chance accuracy on a language family distant from the training language, or ablations showing no consistent middle-layer concentration of predictive features across models.

Figures

Figures reproduced from arXiv: 2605.31220 by Athina Kyriakou, Dennis Ulmer, Ivan Titov.

Figure 1
Figure 1. Figure 1: (Left) PCA scatter plots of (in-)correct sam [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Organization of latent representations in Qwen 3 8B on Global-MMLU, using the last input token. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-lingual generalization of the probe trained on French (fr*) for Qwen 3 8B. Each panel compares the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spearman’s ρ be￾tween layer importance and per￾layer ablation impact for last token answer for Qwen 3 8B. (Top): weight-based; (Bottom): representation-based. (a) Layer weights for last query token. (b) Layer weights for last answer token [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of learned layer weights from probes trained with different random seeds on the hidden [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of layer ablation on zero-shot AUROC and ECE (avg. over all languages) for Qwen3 8B. Bars [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation effect of layers on (a) AUROC and (b) ECE by few-shot (source language) and zero-shot results [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-lingual generalization of the probe trained on French (fr*) for Llama 3.1 8B. Each panel compares [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of learned layer weights from probes trained with different random seeds on the hidden states [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of layer ablation on zero-shot AUROC and ECE (avg. over all languages) for Qwen3 8B. Bars [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of layer ablation on zero-shot AUROC and ECE (avg. over all languages) for Llama 3.1 8B. Bars [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of the cross-lingual to the uniform probe for Qwen 3 8B (both probes trained in fr). [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of the cross-lingual to the uniform probe for Llama 3.1 8B (both probes trained in fr). [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: System prompt used by the LLM-as-a-judge to assess stem answerability on a 1-10 scale. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: All system prompts used for evaluation, grouped by type (P [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: System prompt for LLM judge used to assess answer correctness. All prompts cover 6 languages: [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompts used for P(True) (Kadavath et al., 2022). All prompts cover 6 languages: English (en), French (fr), Spanish (es), Polish (pl), Russian (ru), and Japanese (ja). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompts to ask for verbalized confidence. All prompts cover 6 languages: English (en), French (fr), [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
read the original abstract

Confidence estimation (CE), i.e., quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features in open-ended question answering. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that multilingual LLMs encode shared, language-transferable confidence features in intermediate representations for open-ended QA. A lightweight linear probe trained monolingually on one language's representations predicts answer correctness and generalizes zero-shot to typologically diverse unseen languages without any target-language supervision or adaptation. Ablations and learned layer weights indicate that these confidence features concentrate in middle layers across languages, forming a shared subspace; zero-shot performance varies with source-target similarity but the probe serves as a strong baseline compared to other CE methods.

Significance. If the empirical results hold, the work provides a simple, parameter-light method for cross-lingual confidence estimation that avoids retraining or target supervision, directly addressing the multilingual gap in CE research. The layer-concentration finding and ablations offer evidence for a shared confidence subspace in multilingual models, which could inform future probing and interpretability studies. The zero-shot generalization across diverse languages is a notable strength if supported by the full experimental details.

minor comments (3)
  1. [§4] §4 (Experiments): clarify the exact set of languages used for training vs. zero-shot testing and report the typological diversity metrics (e.g., language family, script, resource level) to allow readers to assess the strength of the generalization claim.
  2. [Table 2] Table 2 or equivalent results table: include standard deviations or confidence intervals over multiple runs/seeds for the zero-shot accuracies, as single-point estimates make it hard to judge whether the probe reliably outperforms the compared CE baselines.
  3. [§5.2] §5.2 (Layer analysis): the statement that features 'concentrate in middle layers' would benefit from a quantitative definition (e.g., threshold on normalized weights or entropy of the weight distribution) rather than relying solely on visualization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work, the detailed summary, and the recommendation for minor revision. No specific major comments appear in the report, so we have no points requiring point-by-point response or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that trains a linear probe on intermediate LLM representations in one language and evaluates zero-shot transfer to other languages. The abstract and summary describe experimental results, layer-weight analysis, and ablations without any equations, derivations, or self-citations that reduce the reported generalization to a fitted quantity by construction. The central claim rests on observed performance differences across languages and layers rather than on any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of language-agnostic confidence signals in LLM activations; the abstract provides no theoretical derivation or external validation for this assumption beyond the empirical probe results.

free parameters (1)
  • linear probe weights
    Weights are fitted on monolingual English data to predict correctness; their values are not reported in the abstract.
axioms (1)
  • domain assumption Intermediate representations of multilingual LLMs contain confidence features that are transferable across languages
    Invoked to justify zero-shot generalization of the probe without target-language supervision.

pith-pipeline@v0.9.1-grok · 5692 in / 1237 out tokens · 29507 ms · 2026-06-28T22:26:43.532292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    InProceedings of the 7th Black- boxNLP Workshop: Analyzing and Interpreting Neu- ral Networks for NLP, pages 88–104, Miami, Florida, US

    LLM internal states reveal hallucination risk faced with a query. InProceedings of the 7th Black- boxNLP Workshop: Analyzing and Interpreting Neu- ral Networks for NLP, pages 88–104, Miami, Florida, US. Association for Computational Linguistics. Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, 10 Pascale ...

  2. [2]

    InProceedings of the workshop on multimodal, multilingual natu- ral language generation and multilingual WebNLG Challenge (MM-NLG 2023), pages 1–9

    Confidently wrong: Exploring the calibration and expression of (un) certainty of large language models in a multilingual setting. InProceedings of the workshop on multimodal, multilingual natu- ral language generation and multilingual WebNLG Challenge (MM-NLG 2023), pages 1–9. Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigati...

  3. [3]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    A unifying view on dataset shift in classifica- tion.Pattern recognition, 45(1):521–530. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated proba- bilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, volume 29. OpenAI. 2025. Introducing GPT-4.1 in the API. https: //open...

  4. [4]

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu

    Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evalua- tion.Preprint, arXiv:2412.03304. Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. 2025. The curse of depth in large language models.CoRR, abs/2502.05795. Yuqiao Tan, Shizhu He, Kang Liu, and Jun Zhao. 2025. Neural incompatibility:...

  5. [5]

    InFindings of the Association for Computa- tional Linguistics: EMNLP 2022, pages 2707–2735

    Exploring predictive uncertainty and calibra- tion in nlp: A study on the impact of method & data scarcity. InFindings of the Association for Computa- tional Linguistics: EMNLP 2022, pages 2707–2735. Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Oh. 2024. Calibrating large language models using their generations only. InProceedings of the...

  6. [6]

    Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunky- oung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo

    On the calibration of multilingual question answering llms.arXiv preprint arXiv:2311.08669. Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunky- oung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. 2025. Reason- ing models better express their confidence.arXiv preprint arXiv:2505.14489. Hongchuan Zeng, Senyu Han, Lu Chen, and Kai Yu

  7. [7]

    following

    Converging to a lingua franca: Evolution of linguistic regions and semantics alignment in mul- tilingual large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 10602–10617, Abu Dhabi, UAE. Association for Computational Linguistics. Caiqi Zhang, Chang Shu, Ehsan Shareghi, and Nigel Collier. 2025a. All ...