arxiv: 2605.02038 · v1 · submitted 2026-05-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

Jayita Chatterjee, Ranit Karmakar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsbenchmarkingprompt sensitivitycalibrationreliabilityevaluation metricschain-of-thought

0 comments

The pith

Reliability conclusions for language models hinge on the evaluation pipeline as much as on the model itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that single-prompt accuracy benchmarks overlook important reliability failures in language models. By testing multiple models across several benchmarks and prompt variants while tracking accuracy, calibration, parse rates, and robustness to perturbations, it reveals how evaluation choices alter apparent performance. For example, evaluator mismatches can slash reported accuracy by over 70 percent even when the model is capable. This matters because practitioners rely on benchmarks to assess if models are trustworthy for real tasks. The findings push for more comprehensive reporting of these factors in reliability claims.

Core claim

Switching from single-prompt to multi-variant evaluations shows that reliability metrics depend heavily on the pipeline. Changing the definition of expected calibration error shifts per-cell scores by a mean absolute 0.149. Pairing chain-of-thought prompts with first-character evaluators on ARC-Challenge drops accuracy by 72-88 percent due to evaluator failures, recoverable by repairs. Verbal confidence on MMLU-Pro exceeds actual accuracy and token probabilities for all models, with parse rates sometimes collapsing. Prompt-perturbation spread shows weak or inconsistent correlation with model size, ranging from negative to positive values across benchmarks.

What carries the argument

The multi-variant audit framework that applies five prompt variants to each model-dataset pair and computes accuracy alongside token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread.

If this is right

Changing calibration definitions materially changes reliability assessments.
Evaluator logic in prompts can cause large apparent accuracy drops that are not model faults.
Verbal confidence signals are often miscalibrated relative to token probabilities and actual accuracy.
Model size does not reliably predict robustness to prompt changes.
Explicit reporting of calibration, parseability, and robustness is needed for valid reliability claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model rankings based on single-prompt tests may not hold under varied real-world prompts.
Evaluation protocols could incorporate automatic checks for parse rates and calibration to catch hidden issues.
Future work might explore whether these pipeline dependencies affect larger models differently.
Practitioners should test models with multiple variants before deployment decisions.

Load-bearing premise

The selected five benchmarks and five prompt variants along with the chosen metrics sufficiently capture the reliability failures relevant in practice.

What would settle it

Observing consistent reliability conclusions across models when using only single-prompt accuracy on a broad set of additional benchmarks and variants would falsify the dependence on evaluation pipeline.

Figures

Figures reproduced from arXiv: 2605.02038 by Jayita Chatterjee, Ranit Karmakar.

**Figure 1.** Figure 1: ARC-Challenge accuracy collapse under chain-of-thought elicitation (f view at source ↗

**Figure 2.** Figure 2: Verbal confidence systematically overstates both accuracy and token-probability confidence on MMLU-Pro. view at source ↗

**Figure 3.** Figure 3: Prompt-perturbation spread on SST-2 vs. model parameter count for 10 instruct models. Spread does not view at source ↗

**Figure 4.** Figure 4: MMLU-Pro reliability diagrams, one per primary model, under view at source ↗

read the original abstract

Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary models; two independent repair procedures recover 93.8% and 102.7% of the lost performance, indicating an evaluator-side rather than model-side failure. Second, confidence signals are fragile. On MMLU-Pro, every primary model verbally reports confidence substantially above both its accuracy and its token-probability confidence on the same rows, and verbal parse rate can collapse for a single model on a single prompt variant. Third, prompt robustness does not track parameter count reliably. Across 10 instruct models, the correlation between model size and prompt-perturbation spread ranges from -0.244 to 0.474 across benchmarks. Taken together, these results show that reliability conclusions for small language models depend not only on the model being evaluated, but also on the evaluation pipeline used to measure it. We argue that calibration definitions, evaluator logic, verbal parseability, and prompt robustness should be reported explicitly when making reliability claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evaluation pipelines can swing reliability numbers enough to change conclusions about the same models, and this paper gives quantified examples of how.

read the letter

The paper shows that single-prompt accuracy can miss real reliability problems because the way you set up the evaluation changes what you measure. They test ten instruct models across five benchmarks with five prompt variants each, tracking accuracy, token calibration, verbal calibration, parse rate, and perturbation spread in every cell. The main results are that switching the ECE definition moves calibration error by a mean absolute 0.149, that chain-of-thought plus a first-character evaluator drops ARC-Challenge accuracy 72-88% but two repairs recover most or all of it, that verbal confidence sits well above both accuracy and token confidence on MMLU-Pro, and that size-robustness correlations range from -0.244 to 0.474 depending on the benchmark. These are presented as new empirical observations rather than restatements of prior work. The joint audit across metrics and variants is more systematic than the usual single-metric checks, and the concrete numbers make the pipeline-dependence argument easy to see. The size correlations are weak and flip sign, which mainly shows there is no simple scaling rule rather than proving a strong negative result. The five variants are chosen to exhibit the issue, not to claim they cover every possible failure mode, so the scope is appropriately limited. Full methods, exact prompt text, and statistical details would let a reader verify the numbers directly, but the reported shifts are large enough that the direction holds up. This is useful for anyone who runs or cites benchmarks for reliability or safety claims. A reader focused on evaluation practices gets concrete failure modes to watch for. It deserves peer review because the demonstrations are sharp and the practical warning is backed by data rather than assertion.

Referee Report

0 major / 4 minor

Summary. The paper claims that single-prompt accuracy is insufficient for assessing language model reliability because it can miss failures that depend on the evaluation pipeline. The authors audit 10 instruct models from a 15-model open-weight corpus on five benchmarks using five prompt variants each. They report metrics including accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread. Their results show that changing the ECE definition alters calibration by a mean absolute 0.149, that CoT prompts with first-character evaluators cause large accuracy drops (72-88%) on ARC-Challenge due to evaluator issues (recoverable by repairs), that verbal confidence is overconfident on MMLU-Pro, and that size-robustness correlations vary widely (-0.244 to 0.474). They conclude that reliability conclusions depend on both the model and the pipeline.

Significance. This work is significant because it provides concrete, quantified evidence that common benchmarking practices can lead to incorrect reliability assessments. The specific examples of pipeline sensitivity, including the large accuracy drops attributable to evaluator logic rather than the model, and the fragility of verbal confidence, offer clear illustrations of the problem. These findings, if they hold under scrutiny, could prompt the community to adopt more comprehensive evaluation standards that include multiple variants and explicit reporting of calibration and robustness metrics.

minor comments (4)

The abstract reports numerical results (e.g., 0.149 MAE shift, 72-88% accuracy drop) without referencing the corresponding tables, figures, or sections in the main text where methods, statistical tests, and derivations of these values are detailed.
Exact prompt texts for the five variants, full evaluator logic (including the first-character and repair procedures), and any code or raw data for the (model x dataset x variant) cells should be included in an appendix to support verification and replication of the reported metrics.
The correlation range (-0.244 to 0.474) between model size and prompt-perturbation spread would benefit from accompanying p-values, confidence intervals, or per-benchmark breakdowns to strengthen the claim that robustness does not track parameter count reliably.
Clarify the exact composition of the 15-model open-weight corpus relative to the 10 instruct models used for the primary reliability analyses, including any selection criteria.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its significance, and the recommendation for minor revision. The referee's description accurately reflects our core claims regarding the sensitivity of reliability assessments to prompt variants, evaluator choices, and calibration definitions.

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements

full rationale

The paper conducts an empirical audit of language model reliability across 15 models, five benchmarks, and five prompt variants, directly measuring accuracy, token-probability calibration (ECE), verbal-confidence calibration, parse rates, and perturbation spread for each (model × dataset × variant) combination. No derivations, equations, fitted parameters, or self-citations are used to generate the reported quantities; all results are computed from raw model outputs and evaluator logic applied to the data. The central claim—that reliability conclusions depend on the evaluation pipeline—is substantiated by concrete, quantified differences (e.g., ECE normalization shift of 0.149, accuracy drops of 72-88% recoverable by evaluator repairs) rather than by any internal reduction or ansatz. This is a standard empirical study with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical audit study with no free parameters, no new postulated entities, and reliance on standard domain assumptions about benchmark validity and metric definitions.

axioms (1)

domain assumption The five selected benchmarks and five prompt variants adequately sample the space of reliability-relevant behaviors.
Generalization from the observed cells to broader reliability conclusions rests on this untested representativeness claim.

pith-pipeline@v0.9.0 · 5624 in / 1283 out tokens · 25480 ms · 2026-05-08T19:25:35.354127+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost, ECE in RS sense is unrelated) N/A — paper's ECE is binned reliability statistic, not RS J-cost unclear
Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 10 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review arXiv
[2]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737,

work page internal anchor Pith review arXiv
[3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review arXiv
[5]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review arXiv
[6]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

work page internal anchor Pith review arXiv
[7]

Teaching models to express their uncertainty in words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334,

work page arXiv
[8]

Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324,

work page arXiv
[9]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.ArXiv, abs/2503.19786,

work page internal anchor Pith review arXiv
[10]

Language Model Cascades: Token-Level Uncertainty and Beyond

URL https://doi.org/10.48550/arXiv. 2503.19786. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empiri...

work page internal anchor Pith review doi:10.48550/arxiv 2023
[11]

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300–2344,

2022
[12]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

work page internal anchor Pith review arXiv
[13]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[14]

Answer the following multiple-choice question. {input} Answer with only the letter of the correct option. Answer:

Reasoning prompt variants(placeholders: {input}= question with lettered choice list; {label_list}= comma- separated labels;{fewshot_examples}=rendered few-shot block): •surface_paraphrase (baseline): “Answer the following multiple-choice question. {input} Answer with only the letter of the correct option. Answer:” •instruction_reorder (instruction placed ...

work page arXiv 2040