arxiv: 2605.06308 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Marc Boubnovski Martell , Josefa Lia Stoisser , Kaspar M\"artens , Jialin Yu , Robert Kitchen , Philip Torr , Jesper Ferkinghoff-Borg

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords black-box confidence estimationchain-of-thought reasoningtrajectory geometryself-consistencycoverageverbalized confidencelarge language models

0 comments

The pith

A sliding-window geometry score on reasoning trajectories, fused with coverage and verbalized confidence, yields better black-box calibration at K=4 than self-consistency at K=8.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create a practical confidence estimator for chain-of-thought outputs that works through ordinary text APIs and does not require model internals. It does so by turning each trace into a sequence of sliding-window embeddings, then scoring how tightly those windows converge on external answer anchors using a single-parameter softmax. If the resulting geometric signal carries real information, it can be combined with simple coverage and verbalization measures to produce stronger uncertainty estimates while cutting the number of required samples in half. Experiments across medical and general-knowledge benchmarks with two frontier models show the fused K=4 estimator outperforming K=8 self-consistency in every setting, with geometry and coverage supplying largely independent signals. The geometry channel remains stable when the judge model is changed and peaks reliably in the penultimate window of the trace.

Core claim

By representing a chain-of-thought as a sliding-window trajectory and measuring its convergence to external answer anchors with a one-parameter softmax, the authors obtain a black-box geometry score. When this score is fused with a judge-mediated coverage prior and a conditional verbalization channel, the combined estimator at four samples achieves higher AUC than self-consistency at eight samples across all six benchmark-reasoner pairs. Coverage and geometry together provide independent signal in nearly every configuration, the geometry signal inverts at the final window on the hardest benchmark, and the entire construction requires no logits, hidden states, or supervised calibration.

What carries the argument

sliding-window trajectory embedding plus one-parameter softmax convergence to external answer anchors, which turns the internal geometry of a reasoning trace into a scalar confidence contribution

If this is right

Four samples suffice for higher calibration accuracy than the eight-sample self-consistency baseline in every tested domain and model.
The geometry channel supplies signal orthogonal to coverage in sixteen of eighteen proposer-judge combinations.
Swapping the judge model leaves the geometry-only AUC essentially unchanged while coverage AUC varies by at most two points.
The geometry contribution reaches its maximum in the penultimate window across all benchmarks and reasoners.
Verbalization adds measurable value on top of the other two channels in only a minority of settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-embedding technique could be applied to non-text reasoning traces such as code or formal proofs to test whether geometric convergence remains predictive.
Because the method halves the sample budget while improving performance, it could be inserted into production pipelines that currently rely on repeated sampling for uncertainty quantification.
The observed terminal-window inversion on the hardest benchmark suggests a possible early-stopping rule for detecting when a trace is about to diverge.
If the geometric signal proves robust, it might reduce the need for expensive post-training calibration data in safety-critical applications.

Load-bearing premise

The geometric convergence captured by the sliding-window embedding and softmax is a stable, generalizable property of correct reasoning rather than an artifact of particular embedders or judge models.

What would settle it

A replication on a fresh benchmark or with a third model family in which the fused four-sample estimator shows lower AUC than eight-sample self-consistency would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.06308 by Jesper Ferkinghoff-Borg, Jialin Yu, Josefa Lia Stoisser, Kaspar M\"artens, Marc Boubnovski Martell, Philip Torr, Robert Kitchen.

**Figure 1.** Figure 1: Fitted softmax slope βˆ all (Eq. 3) vs single-window position k on GPQA Diamond (k=−1: final window; k=−8: eight windows back). Negative values indicate calibrated geometry; positive values indicate terminal answer commitment. Bars are 95% bootstrap CIs. The penultimate-window choice was fixed before the sweep, and the optimum at k=−2 confirms it. answers into the benchmark’s answer space) in 9/10 benchmar… view at source ↗

read the original abstract

Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior (C), within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|<=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a black-box method for estimating confidence in chain-of-thought reasoning by embedding CoT traces as sliding-window trajectories and measuring convergence to external answer anchors via a one-parameter softmax (geometry score G). This is fused with coverage (C) and verbalized-confidence (V) channels. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, the fused score at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71). Additional results separate C, G, and V signals, show geometry peaking in the penultimate window (with terminal inversion on GPQA), and demonstrate robustness via E5 cross-embedder replication, judge swaps, and fixed-pick controls.

Significance. If the central results hold, the work offers a more sample-efficient alternative to self-consistency for black-box confidence estimation that exploits geometric properties of reasoning traces. Credit is due for the explicit controls (fixed-pick +0.060, E5 replication leaving G-only AUC unchanged, judge swap with |delta|<=0.013 for G), the decomposition into independent channels (C and G independent in 18/18 settings), and the internal face validity from geometry peaking patterns. These elements address potential artifacts and support the claim of generalizable signal beyond single-vendor effects.

major comments (2)

[Methods] Methods: the exact procedure for choosing external answer anchors, fitting the one-parameter softmax scale, and computing sliding-window convergence is not detailed; this is load-bearing for verifying that the geometry score captures an independent property rather than an artifact of the fitting process or data selection.
[Results] Results: the reported AUC improvements (median deltaAUC=+0.075, 6/6 settings) and channel independence claims are presented without error bars, confidence intervals, or statistical significance tests; this weakens the strength of the consistent-outperformance claim given the modest sample of settings.

minor comments (2)

[Abstract] Abstract: the phrase 'three unscaffolded regimes' is introduced without a brief definition or reference to the relevant section; adding one sentence would improve clarity for readers.
[Abstract] Abstract: the number of questions or instances per benchmark-setting is not stated; including these values would help contextualize the AUC metrics and fusion results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and recommendation for minor revision. The comments highlight opportunities to improve reproducibility and statistical presentation, which we address below.

read point-by-point responses

Referee: [Methods] Methods: the exact procedure for choosing external answer anchors, fitting the one-parameter softmax scale, and computing sliding-window convergence is not detailed; this is load-bearing for verifying that the geometry score captures an independent property rather than an artifact of the fitting process or data selection.

Authors: We agree that greater explicitness is needed for full verification and to rule out fitting artifacts. In the revised manuscript we will expand the Methods section (currently 3.2) with a complete algorithmic specification: external anchors are the top-5 answers from an independent proposer run on the same prompt; the single softmax scale is fit via grid search on a 20% held-out validation split to maximize AUC against ground-truth correctness; sliding windows use fixed token length 4 with stride 2, cosine similarity to each anchor, followed by softmax normalization to produce G. Pseudocode and the exact grid range will be added to Appendix B. These additions will make clear that G is computed from geometry after scale fitting on separate data, supporting its independence from C and V. revision: yes
Referee: [Results] Results: the reported AUC improvements (median deltaAUC=+0.075, 6/6 settings) and channel independence claims are presented without error bars, confidence intervals, or statistical significance tests; this weakens the strength of the consistent-outperformance claim given the modest sample of settings.

Authors: We accept that uncertainty quantification would strengthen the claims. In revision we will add instance-level bootstrap confidence intervals (1,000 resamples) for every AUC, deltaAUC, and the reported median, shown both per setting and aggregated. For the independence results (C and G independent in 18/18 settings), we will report the fraction of settings in which the fused score exceeds each single channel together with bootstrap CIs. While a single omnibus significance test is inappropriate across heterogeneous benchmarks and only six settings, the consistent direction of improvement (6/6) and the per-setting CIs will be presented explicitly so readers can assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a trajectory-confidence score via sliding-window embeddings and a one-parameter softmax convergence to external anchors, then empirically validates its fusion with coverage and verbalization channels through direct AUC measurements on six benchmark-reasoner pairs. No load-bearing claim reduces by construction to its inputs: the reported Pareto improvements (K=4 fusion vs K=8 self-consistency), independence of C/G/V channels (18/18 and 16/18 settings), and robustness under embedder/judge swaps are measured outcomes, not tautological re-statements of the definition. Cross-embedder (E5) and fixed-pick controls further isolate the geometric signal without relying on self-citation or ansatz smuggling. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of embedding-based trajectory geometry and the independence of the three channels; it introduces one explicit fitted parameter and relies on standard embedding assumptions.

free parameters (1)

softmax scale parameter
One-parameter softmax used to quantify convergence of the sliding-window trajectory to answer anchors.

axioms (1)

domain assumption Embedding models such as E5 capture semantically meaningful distances between reasoning steps
The geometry channel presupposes that vector proximity in the chosen embedding space reflects reasoning convergence.

pith-pipeline@v0.9.0 · 5688 in / 1280 out tokens · 82095 ms · 2026-05-08T10:03:13.458057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Entropy-lens: The information signature of transformer computations.arXiv preprint arXiv:2502.16570,

Riccardo Ali, Francesco Caso, Christopher Irwin, and Pietro Liò. Entropy-lens: The information signature of transformer computations.arXiv preprint arXiv:2502.16570,

work page arXiv
[2]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976,

2023
[3]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review arXiv
[4]

Selectively answering ambiguous questions

Jeremy Cole, Michael Zhang, Dan Gillick, Julian Eisenschlos, Bhuwan Dhingra, and Jacob Eisenstein. Selectively answering ambiguous questions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 530–543,

2023
[5]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review arXiv
[6]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,

work page internal anchor Pith review arXiv
[7]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernan- dez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702,

work page Pith review arXiv
[8]

Teaching models to express their uncertainty in words, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334,

work page arXiv
[9]

arXiv preprint arXiv:2305.19187 , year=

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,

work page arXiv
[10]

A scalable LLM framework for therapeutic biomarker discovery: Grounding Q/A generation in knowledge graphs and literature

Marc Boubnovski Martell, Kaspar Märtens, Lawrence Phillips, Daniel Keitley, Maria Dermit, and Julien Fauqueur. A scalable LLM framework for therapeutic biomarker discovery: Grounding Q/A generation in knowledge graphs and literature. InICLR 2025 Workshop on Machine Learning for Genomics Explorations,

2025
[11]

Mechpert: Mechanistic consensus as an inductive bias for unseen perturbation prediction.arXiv preprint arXiv:2602.13791,

Marc Boubnovski Martell, Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Robert Kitchen, Jesper Ferkinghoff-Borg, Jialin Yu, Philip Torr, and Kaspar Märtens. Mechpert: Mechanistic consensus as an inductive bias for unseen perturbation prediction.arXiv preprint arXiv:2602.13791,

work page arXiv
[12]

Large language model confidence estimation via black-box access.arXiv preprint arXiv:2406.04370,

10 Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, and Prasanna Sattigeri. Large language model confidence estimation via black-box access.arXiv preprint arXiv:2406.04370,

work page arXiv
[13]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review arXiv
[14]

Town, Rory M

Josefa Lia Stoisser, Marc Boubnovski Martell, Kaspar Märtens, Lawrence Phillips, Stephen Michael Town, Rory Donovan-Maiye, and Julien Fauqueur. Query, don’t train: Privacy-preserving tabular prediction from ehr data via sql queries.arXiv preprint arXiv:2505.21801, 2025a. Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Le...

work page arXiv
[15]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages ...

2023
[16]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of...

work page internal anchor Pith review arXiv
[17]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

work page internal anchor Pith review arXiv
[18]

the plurality-vote pick is correct,

11 Table 3:Anchor-design ablation on GPQA Diamond.All three variants use the same trajectory and the same softmax scoring; only the per-option anchor differs. Geo AUC measures selective prediction from the single-feature softmax probability; PickAcc is top-1 accuracy of the argmin-distance rule. Anchor design ModelNGeo AUC Composite∆(Geo) Answer (condense...

2000
[19]

MMLU-Pro Nmatches Section 3.1 (N=200per setting,N=199for prop / Gemini)

for GPQA, and gated geometric AUC (per-setting 5-fold CV) for MMLU-Pro. MMLU-Pro Nmatches Section 3.1 (N=200per setting,N=199for prop / Gemini). Benchmark Cond. ModelSCK=4Acc SCK=4AUC Our AUC GPQA Diamond — Gemini 3.1 Pro0.778 0.6900.701 GPQA Diamond — Claude Sonnet 4.60.6990.7470.727 MMLU-Pro MCQ Gemini 3.1 Pro0.915 0.6480.690 MMLU-Pro MCQ Claude Sonnet ...

work page arXiv
[20]

embedder prior

Table 10: Third-generator scaling: open-weight Llama 3.3 70B Instruct paired with the same Sonnet 4.6 option anchors used for the published Gemini combinations. Top-1 accuracy is the rate at which the geometric arg min matches the gold option (closed-set discrimination from open-ended generation). Across all three benchmarks ˆβ remains strongly negative (...

1930
[21]

confidently-wrong with no path to gold

GPQA / Gemini35.8 [25.6,50.8] 29.1 [14.3,50.9] 26.2 [ 7.4,45.6] GPQA / Claude38.4 [28.2,52.3] 25.0 [ 9.9,55.8] 20.7 [ 1.7,43.2] MedQA / Gemini34.1 [29.6,41.8] 53.7 [41.1,80.6] 33.0 [22.5,45.6] MedQA / Claude33.3 [27.5,41.1] 46.5 [34.5,65.7] 34.3 [23.8,47.4] MMLU-Pro / Gemini25.7 [20.6,32.5] 25.0 [13.5,38.6] 23.5 [12.1,37.8] MMLU-Pro / Claude32.6 [26.0,40....

2021