pith. sign in

arxiv: 2606.22179 · v1 · pith:SNDYARQInew · submitted 2026-06-20 · 💻 cs.CL

The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions

Pith reviewed 2026-06-26 11:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM confidenceselective predictionscore granularityverbalized confidenceblack-box classificationconfidence calibrationmulti-query aggregationthreshold resolution
0
0 comments X

The pith

Single-shot verbalized confidence from LLMs ranks cases well but supplies only a handful of distinct values for setting risk thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the resolution at which confidence scores from black-box LLMs can be thresholded for selective prediction, where uncertain cases are routed to humans. It compares seven constructions of confidence scores, including verbalized numbers, token probabilities, and multi-query aggregations, across 25 model-dataset pairs. The central finding is that converting a single verbalized confidence to a class probability yields surprisingly strong ranking of examples yet restricts the operator to only a few coarse thresholds regardless of ranking quality. Multi-query methods widen the set of usable thresholds for weaker models but can narrow effective ranking for stronger ones, producing concrete deployment trade-offs between granularity, cost, and performance.

Core claim

The score granularity gap measures how many distinct thresholdable values a confidence construction actually supplies once mapped to class probabilities. Single-shot verbalized confidence, after correct conversion, ranks cases effectively across the tested setups but collapses to only a handful of distinct values. Token-probability constructions and multi-query aggregations produce more distinct values at different inference costs, with the latter improving weak models while sometimes harming already-strong ones. These patterns hold across the nine LLMs and three benchmarks examined.

What carries the argument

The score granularity gap: the effective number of distinct, usable threshold points a confidence score supplies after conversion to class probabilities.

If this is right

  • Multi-query aggregation widens granularity for weaker models but can reduce ranking quality for stronger models.
  • Token-probability scores supply finer threshold resolution than single verbalized numbers.
  • Operators must trade off the number of available thresholds against ranking performance and inference cost when choosing a construction.
  • Proper conversion of verbalized confidence to probabilities is required to realize its ranking strength despite low granularity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that combine verbalized scores with token probabilities could achieve both strong ranking and finer control without always incurring multi-query cost.
  • The handful-of-values pattern may limit selective prediction on tasks requiring many risk levels even if ranking remains good.
  • Deployment pipelines could pre-compute the distinct values for each construction to decide thresholds in advance rather than assuming continuous scores.

Load-bearing premise

The 25 model-dataset pairs and seven chosen confidence constructions represent the space of black-box LLM classification deployments that use selective prediction.

What would settle it

An experiment on additional LLMs or benchmarks in which single-shot verbalized confidence produces many more than a handful of distinct values after conversion to probabilities would falsify the claimed coarseness.

Figures

Figures reproduced from arXiv: 2606.22179 by Ao Sun, Jiaxing Geng, Tian sun.

Figure 1
Figure 1. Figure 1: The confidence constructions we compare. Single-shot (top, 1 call): the model’s verbalized confidence is converted to a class probability (Verb), or first-token Yes/No log-probabilities are used (logprob). Multi-signal (bottom, m=10 calls): the model answers m interpretable sub-tasks or m paraphrases of the task; the m scalar scores form a feature vector that is aggregated (simple average or learned logist… view at source ↗
Figure 2
Figure 2. Figure 2: The score granularity gap (MNLI, GPT-4o-mini). (a) Verb produces only 4 distinct score values despite strong ranking (PR-AUC .963). (b) Sub-cal produces 174 distinct values at comparable ranking (.966). (c) Verb’s coarse values yield a stair￾case risk–coverage curve with only 4 usable operating points; Sub-cal yields a near￾continuous one. (d) Across all 25 pairs, Verb scores cluster at low granularity (1–… view at source ↗
Figure 3
Figure 3. Figure 3: Verb accuracy vs. PR-AUC gain from aggregation (n=22 pairs; 3 GPT-5-nano pairs excluded, G(Verb)=1). Gains are negatively correlated with model strength— aggregation helps weak models and can hurt strong ones. Colors: BoolQ (red), MNLI (blue), PubMedQA (green). “aggregation improves confidence”: aggregation reliably buys more operat￾ing points (finer control), and it improves ranking only for weak models, … view at source ↗
Figure 4
Figure 4. Figure 4: Risk–coverage curves for MNLI (left) and BoolQ (right). Verbalized confidence (dashed, staircase-shaped) vs. calibrated sub-task aggregation (solid, smooth). 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Actual Llama-70B Verb Sub-cal 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Gemini-Flash 0.0 0.2 0.4 0.6 0.8 1.0 Predicted 0.0 0.2 0.4 0.6 0.8 1.0 Actual Claude-Haiku 0.0 0.2 0.4 0.6 0.8 1.0 Predicted … view at source ↗
Figure 5
Figure 5. Figure 5: Reliability diagrams for MNLI (left) and BoolQ (right). Calibrated sub-task aggregation generally tracks the diagonal more closely than verbalized confidence [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Aggregation comparison (average vs. logistic regression × sub-task vs. multi￾prompt). On MNLI, simple averaging collapses to near chance due to polarity conflicts among sub-task signals. −0.005 0.000 0.005 0.010 0.015 0.020 0.025 semantic_entail quantifier_mismatch constituent relative_clause conjunction_present lexical_overlap subsequence subject_swap negation_flip passive_active Llama-70B (PR-AUC=0.954) … view at source ↗
Figure 7
Figure 7. Figure 7: Leave-one-out ablation on MNLI: semantic_entail is the most impactful single feature for most strong models [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ROC curves for MNLI (left) and BoolQ (right) across confidence constructions [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed as black-box classifiers in pipelines that automate confident decisions and route uncertain ones to human review. Such selective prediction needs a confidence score that an operator can threshold at a chosen risk level. Prior work asks whether LLM confidence is well calibrated or well ranked; we ask a complementary, deployment-oriented question that has been largely overlooked: at what resolution can the score be thresholded? We call the answer the score granularity gap. Through a controlled comparison of seven ways to build a confidence score, from a single verbalized number, to token probabilities, to querying the model many times and combining the answers, across 25 model-dataset pairs (9 LLMs, 3 benchmarks), we find that single-shot verbalized confidence, once correctly converted to a class probability, ranks cases surprisingly well, yet takes only a handful of distinct values. It therefore offers an operator only a few coarse thresholds, no matter how well it ranks. We show which constructions widen this gap, at what inference cost, and with what effect on ranking, notably that multi-query aggregation helps weak models but can degrade already-strong ones. We translate these trade-offs into concrete deployment guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a controlled empirical comparison of seven confidence constructions (single-shot verbalized, token probabilities, multi-query aggregation, etc.) for black-box LLM classification on 25 model-dataset pairs (9 LLMs, 3 benchmarks). It claims that single-shot verbalized confidence, after conversion to class probability, ranks cases well yet supplies only a handful of distinct values and thus only coarse thresholds for selective prediction; multi-query methods widen granularity at varying inference cost and can improve weak models while degrading strong ones.

Significance. If the granularity-gap finding is robust, the work supplies a deployment-oriented insight that complements calibration and ranking studies, with direct measurements across multiple constructions and a translation into concrete guidance on cost-granularity-ranking trade-offs. The purely empirical design with no fitted parameters or self-referential predictions is a methodological strength.

major comments (2)
  1. [Experimental setup] Experimental setup (likely §3 or §4): no selection criteria, ablation on model scale/family, or sensitivity analysis are provided for the 9 LLMs and 3 benchmarks. The central claim that verbalized confidence yields only a handful of distinct values (and therefore coarse thresholds) is load-bearing on these setups being representative; without such checks the observed gap could be an artifact of the chosen pool rather than a general property of black-box verbalized scores.
  2. [Methods] Methods (likely §3.2–3.3): the exact conversion procedure from raw verbalized strings to class probabilities, the statistical tests for ranking quality, error-bar computation, and any data-exclusion rules are not described at the level needed to reproduce or verify the ranking-versus-granularity results.
minor comments (2)
  1. [Figures] Figure captions and axis labels should explicitly state the exact metric (e.g., AUC, number of distinct values, or threshold count) and whether error bars represent standard deviation across seeds or datasets.
  2. [Section 3] A short table summarizing the seven constructions (input format, output format, inference cost) would improve readability before the main results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point by point below, indicating where the manuscript will be revised to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Experimental setup] Experimental setup (likely §3 or §4): no selection criteria, ablation on model scale/family, or sensitivity analysis are provided for the 9 LLMs and 3 benchmarks. The central claim that verbalized confidence yields only a handful of distinct values (and therefore coarse thresholds) is load-bearing on these setups being representative; without such checks the observed gap could be an artifact of the chosen pool rather than a general property of black-box verbalized scores.

    Authors: The nine LLMs were selected to cover distinct families (GPT, Llama, Mistral, and others) and scales (7B to >100B parameters), while the three benchmarks are standard classification tasks used in prior LLM evaluation work. We acknowledge that explicit selection criteria and sensitivity checks were not stated in the submitted version. We will add a dedicated paragraph in Section 3 describing the rationale for model and dataset diversity and include an appendix reporting results on a held-out model scale and an additional benchmark to verify that the granularity gap persists. revision: yes

  2. Referee: [Methods] Methods (likely §3.2–3.3): the exact conversion procedure from raw verbalized strings to class probabilities, the statistical tests for ranking quality, error-bar computation, and any data-exclusion rules are not described at the level needed to reproduce or verify the ranking-versus-granularity results.

    Authors: We agree that the current description is insufficient for full reproducibility. We will expand Sections 3.2–3.3 with (i) the precise string-to-probability mapping (including rules for non-numeric or out-of-range verbalizations), (ii) the ranking metric and associated statistical test, (iii) the bootstrap procedure used for error bars, and (iv) explicit data-exclusion criteria. Pseudocode for the conversion and evaluation pipeline will be added to the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparative measurements

full rationale

The paper performs direct empirical comparisons of seven confidence constructions across 25 model-dataset pairs. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. All reported findings (ranking quality, number of distinct values, granularity) are measurements taken from the experiments themselves, with no reduction to inputs by construction. The representativeness concern is an external-validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study. No free parameters are fitted to produce the central claims. The only domain assumption is representativeness of the chosen models and benchmarks.

axioms (1)
  • domain assumption The 25 model-dataset pairs and seven constructions are representative of black-box LLM selective prediction deployments.
    Generalization from the specific 9 LLMs and 3 benchmarks to broader deployment practice.

pith-pipeline@v0.9.1-grok · 5743 in / 1312 out tokens · 24867 ms · 2026-06-26T11:42:28.067683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    Angelopoulos, A.N., Bates, S., Fisch, A., et al.: Conformal Risk Control (Jun 2025).https://doi.org/10.48550/arXiv.2208.02814,http://arxiv.org/abs/ 2208.02814, arXiv:2208.02814 [stat]

  2. [2]

    Clark, C., Lee, K., Chang, M.W., et al.: BoolQ: Exploring the Surprising Dif- ficulty of Natural Yes/No Questions (May 2019).https://doi.org/10.48550/ arXiv.1905.10044,http://arxiv.org/abs/1905.10044, arXiv:1905.10044 [cs]

  3. [3]

    Geifman, Y., El-Yaniv, R.: Selective Classification for Deep Neural Networks (Jun 2017).https://doi.org/10.48550/arXiv.1705.08500,http://arxiv.org/abs/ 1705.08500, arXiv:1705.08500 [cs]

  4. [4]

    Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On Calibration of Modern Neural Networks (Aug 2017).https://doi.org/10.48550/arXiv.1706.04599,http:// arxiv.org/abs/1706.04599, arXiv:1706.04599 [cs]

  5. [5]

    Jin,Q.,Dhingra,B.,Liu,Z.,etal.:PubMedQA:ADatasetforBiomedicalResearch Question Answering (Sep 2019).https://doi.org/10.48550/arXiv.1909.06146, http://arxiv.org/abs/1909.06146, arXiv:1909.06146 [cs]

  6. [6]

    Kadavath,S.,Conerly,T.,Askell,A.,etal.:LanguageModels(Mostly)KnowWhat They Know (Nov 2022),http://arxiv.org/abs/2207.05221, arXiv:2207.05221 [cs]

  7. [7]

    Khot, T., Trivedi, H., Finlayson, M., et al.: Decomposed Prompting: A Modular Approach for Solving Complex Tasks (Apr 2023).https://doi.org/10.48550/ arXiv.2210.02406,http://arxiv.org/abs/2210.02406, arXiv:2210.02406 [cs]

  8. [8]

    Kuhn, L., Gal, Y., Farquhar, S.: Semantic Uncertainty: Linguistic Invari- ances for Uncertainty Estimation in Natural Language Generation (Apr 2023).https://doi.org/10.48550/arXiv.2302.09664,http://arxiv.org/abs/ 2302.09664, arXiv:2302.09664 [cs]

  9. [9]

    In: Pro- ceedings of Machine Learning Research

    Kull, M., Silva Filho, T., Flach, P.: Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In: Pro- ceedings of Machine Learning Research. vol. 54, pp. 623–635 (2017)

  10. [10]

    Lin, S., Hilton, J., Evans, O.: Teaching Models to Express Their Uncertainty in Words (Jun 2022).https://doi.org/10.48550/arXiv.2205.14334,http:// arxiv.org/abs/2205.14334, arXiv:2205.14334 [cs]

  11. [11]

    McCoy, R.T., Pavlick, E., Linzen, T.: Right for the Wrong Reasons: Diagnos- ing Syntactic Heuristics in Natural Language Inference (Jun 2019).https: //doi.org/10.48550/arXiv.1902.01007,http://arxiv.org/abs/1902.01007, arXiv:1902.01007 [cs]

  12. [12]

    Trip-bench: A benchmark for long-horizon interactive agents in real-world scenarios.CoRR, abs/2602.01675, 2026

    Pedapati, T., Dhurandhar, A., Ghosh, S., et al.: Large Language Model Confidence Estimation via Black-Box Access (Jul 2025).https://doi.org/10.48550/arXiv. 2406.04370,http://arxiv.org/abs/2406.04370, arXiv:2406.04370 [cs] The Score Granularity Gap 15

  13. [13]

    In: Proceedings of the 3rd Workshop on Trustworthy Natural Language Process- ing (TrustNLP 2023)

    Portillo Wightman, G., Delucia, A., Dredze, M.: Strength in Numbers: Es- timating Confidence of Large Language Models by Prompt Agreement. In: Proceedings of the 3rd Workshop on Trustworthy Natural Language Process- ing (TrustNLP 2023). pp. 326–362. Association for Computational Linguistics, Toronto, Canada (2023).https://doi.org/10.18653/v1/2023.trustnlp...

  14. [14]

    & Zhu, C

    Tian, K., Mitchell, E., Zhou, A., et al.: Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 5433–5442. Association for Computational Linguistics, Singapore (2023).https://doi.org...

  15. [15]

    https://doi.org/10.48550/ARXIV.2308.01222,https://arxiv.org/abs/2308

    Wang, C.: Calibration in Deep Learning: A Survey of the State-of-the-Art (2023). https://doi.org/10.48550/ARXIV.2308.01222,https://arxiv.org/abs/2308. 01222, version Number: 4

  16. [16]

    48550/arXiv.2402.11279,http://arxiv.org/abs/2402.11279, arXiv:2402.11279 [cs]

    Wang,P.,Wang,Y.,Diao,M.,etal.:Multi-PerspectiveConsistencyEnhancesCon- fidence Estimation in Large Language Models (Feb 2024).https://doi.org/10. 48550/arXiv.2402.11279,http://arxiv.org/abs/2402.11279, arXiv:2402.11279 [cs]

  17. [17]

    Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Jan 2023).https://doi.org/10.48550/ arXiv.2201.11903,http://arxiv.org/abs/2201.11903, arXiv:2201.11903 [cs]

  18. [18]

    Wen, B., Yao, J., Feng, S., et al.: Know Your Limits: A Survey of Abstention in Large Language Models

  19. [19]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, A., Nangia, N., Bowman, S.: A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In: Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 1112–1122. Association for Computational Linguistics, New O...

  20. [20]

    Xiong, M., Hu, Z., Lu, X., et al.: CAN LLMS EXPRESS THEIR UNCERTAINTY? AN EMPIRICAL EVALUATION OF CONFIDENCE ELICI- TATION IN LLMS (2024)

  21. [21]

    Xue, B., Wang, H., Wang, R., et al.: MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models (May 2025).https://doi.org/10.48550/arXiv.2402.13606,http://arxiv.org/abs/ 2402.13606, arXiv:2402.13606 [cs]

  22. [22]

    Yang, D., Tsai, Y.H.H., Yamada, M.: On Verbalized Confidence Scores for LLMs

  23. [23]

    48550/arXiv.2602.00977,http://arxiv.org/abs/2602.00977, arXiv:2602.00977 [cs]

    Yang, P., Wen, J., Jin, H., et al.: Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals (Feb 2026).https://doi.org/10. 48550/arXiv.2602.00977,http://arxiv.org/abs/2602.00977, arXiv:2602.00977 [cs]

  24. [24]

    Zhao, T.Z., Wallace, E., Feng, S., et al.: Calibrate Before Use:Improving Few-Shot Performance of Language Models

  25. [25]

    Sun et al

    Zhou, K., Hwang, J.D., Ren, X., Sap, M.: Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty (Jul 2024).https://doi.org/10.48550/arXiv.2401.06730,http://arxiv.org/abs/ 2401.06730, arXiv:2401.06730 [cs] 16 A. Sun et al. A Full Per-Pair Results Table 4 reports every model–dataset pair. Verb is single-shot verbalized c...