Recognition: unknown
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
Pith reviewed 2026-05-14 21:09 UTC · model grok-4.3
The pith
Small language models can route student scoring tasks to larger models using their verbalized numerical confidence, matching large-model accuracy at 76 percent lower cost and 61 percent lower latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Verbalized confidence serves as an effective routing signal in cascade scoring systems when small language models produce sufficiently varied confidence values; the best such cascades reach kappa 0.802 versus 0.819 for the large model alone, at 76 percent lower cost and 61 percent lower latency. Confidence discrimination varies sharply across small models, with the strongest reaching AUROC 0.857 and the weakest producing a near-degenerate distribution. Lower confidence also aligns with items where human annotators disagreed or took longer to score.
What carries the argument
Verbalized numerical confidence as a routing signal that decides whether a small language model handles a scoring task or escalates it to a larger model.
If this is right
- Small language models with strong confidence variance enable practitioners to move along a cost-accuracy frontier by adjusting the escalation threshold.
- Small language models whose confidence is nearly constant cannot produce cascades that close the accuracy gap no matter what threshold is chosen.
- Confidence values track human scoring difficulty, so lower-confidence items are also the ones that take annotators longer and produce more disagreement.
- Cascades built from the strongest small models incur no statistically detectable kappa loss relative to always using the large model.
Where Pith is reading between the lines
- The same routing logic could be tested on other text-based judgment tasks such as content moderation or clinical note review where cost and latency constraints are similar.
- Improving confidence calibration in small models would directly widen the set of tasks for which cheap cascades become viable.
- Production systems could monitor the variance of confidence scores on incoming data as a quick diagnostic for whether a given small model remains useful for routing.
Load-bearing premise
The verbalized numerical confidence produced by small language models is a stable signal that reliably tracks actual correctness across different model families and student response data.
What would settle it
A new collection of student responses in which small-model confidence shows no correlation with actual scoring errors or with human annotator disagreement and scoring time.
Figures
read the original abstract
Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates verbalized numerical confidence from small language models as a routing signal in cascade scoring systems for educational assessment. On 2,100 expert-scored decisions from student-AI math conversations, it tests model pairs (GPT-5.4, Claude 4.5+, Gemini 3.1) and reports that the best small LM achieves AUROC 0.857 for confidence discrimination; cascades using strong discriminators reach kappa 0.802 (vs. 0.819 for the large LM alone) at 76% lower cost and 61% lower latency, with no statistically detectable accuracy loss, while weak discriminators cannot close the gap.
Significance. If the central empirical result holds under proper validation, the work demonstrates a practical route to cost- and latency-efficient automated scoring by exploiting small-LM confidence variance. The use of real expert annotations, concrete AUROC/kappa/cost metrics, and the observation that confidence tracks human scoring difficulty are strengths. However, the headline claim of retained accuracy at reduced cost is load-bearing on the threshold-selection procedure, which is not detailed in the provided abstract and risks optimistic bias if performed on the full evaluation set.
major comments (2)
- [Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.
- [Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.
minor comments (2)
- [Abstract] Abstract: the phrase 'the two small LMs with meaningful confidence variance' should be replaced by the specific model names or identifiers for clarity.
- [Results] The paper should report the exact number of candidate thresholds examined and whether any correction for multiple testing was applied when claiming 'no detectable loss'.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of methodological transparency. We address each major comment below and have revised the manuscript to provide the requested details on threshold selection, data splits, prompting, and statistical definitions.
read point-by-point responses
-
Referee: [Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.
Authors: We agree the original description was insufficient and could raise concerns about selection bias. In the revised manuscript we now specify that threshold selection was performed via nested cross-validation: an outer 5-fold CV loop for final reporting, with an inner loop on each training partition used to select the threshold maximizing kappa subject to no significant loss versus the large model alone. We have also updated the statistical procedure to a paired bootstrap test with Bonferroni correction across the three candidate thresholds, confirming no detectable loss (adjusted p > 0.05). These changes eliminate the risk of optimistic bias from data reuse. revision: yes
-
Referee: [Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.
Authors: We have expanded the Methods section and added a new appendix. Data splits are now stated as a 70/30 train/test partition with 5-fold cross-validation performed only on the training portion for threshold tuning. Full prompting templates for verbalized confidence (including the exact instruction to output a numerical score from 0-100) are reproduced verbatim. The definition of no statistically detectable kappa loss is clarified as a McNemar test on paired predictions with a pre-specified power analysis (85% power to detect a kappa difference of 0.03 at alpha = 0.05). These additions allow direct assessment of generalizability. revision: yes
Circularity Check
No circularity: purely empirical evaluation against external annotations
full rationale
The paper reports an empirical study that evaluates cascade routing performance by comparing small-LM verbalized confidence against 2,100 independently expert-scored decisions. No equations, fitted parameters, or self-citation chains are used to derive the headline kappa or cost figures; thresholds and AUROC values are computed directly from the held-out human labels. The analysis contains no self-definitional steps, no renaming of known results, and no load-bearing reliance on prior author work that would reduce the central claims to the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence threshold
axioms (1)
- domain assumption Verbalized confidence from small LMs correlates with actual correctness and human scoring difficulty
Reference graph
Works this paper leans on
-
[1]
Api Pricing
OpenAI. Api Pricing. https://developers.openai.com/api/docs/pricing, 2026. URLhttps://developers.openai.com/api/docs/ pricing. Accessed 2026-03-22
2026
-
[2]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Con- fidence Elicitation in LLMs. InInternational Confer- ence on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2306.13063. arXiv:2306.13063
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.13063 2024
-
[3]
Revisiting Uncertainty Esti- mation and Calibration of Large Language Models
Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting Uncertainty Esti- mation and Calibration of Large Language Models. 2025. doi: 10.48550/arXiv.2505.23854. arXiv:2505.23854
-
[4]
Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, and Chang Xu. Can Large Language Models Express Uncer- tainty Like Human? 2025. doi: 10.48550/arXiv.2509. 24202. arXiv:2509.24202
-
[5]
Scott Frohn, Tyler Burleigh, and Jing Chen. Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In Preprint– DoSmallLanguageModelsKnowWhenThey’reWrong? Confidence-BasedCascadeScoring forEducational Assessment 11 Artificial Intelligence in Education, volume VI ofLecture Notes in Artificial Inte...
-
[7]
Chaitanya Ramineni and David M. Williamson. Au- tomated Essay Scoring: Psychometric Guidelines and Practices.Assessing Writing, 18(1):25–39, 2013. doi: 10.1016/j.asw.2012.10.004
-
[8]
Foltz, Lynn A
Peter W. Foltz, Lynn A. Streeter, Karen E. Lochbaum, and Thomas K. Landauer.Automated Scoring of Essays with the Intelligent Essay Assessor, pages 68–88. Routledge,
-
[9]
doi: 10.4324/9780203122761
-
[10]
Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation
Hiroaki Funayama, Shota Sasaki, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, Masato Mita, and Kentaro Inui. Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation. InProceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 237–243. Association for C...
-
[11]
Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, and Kentaro Inui. Balancing Cost and Quality: An Exploration of Human- in-the-Loop Frameworks for Automated Short Answer Scoring. InInternational Conference on Artificial Intelli- gence in Education, pages 465–476. Springer, 2022. doi: 10.48550/arXiv.2206.08288. arXiv:2206.08288
-
[12]
Self-regulated learning processes in secondary education: A network analysis of trace-based measures
Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. Human- AI Collaborative Essay Scoring: A Dual-Process Frame- work with LLMs. InProceedings of the 15th International Learning Analytics and Knowledge Conference, pages 293–305, 2025. doi: 10.1145/3706468.3706507
-
[13]
A Survey of Con- fidence Estimation and Calibration in Large Language Models
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A Survey of Con- fidence Estimation and Calibration in Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL), pages 6577–6595, 2024. doi: 10.18653/v1/2024.naacl-long.366
-
[14]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugal- gpt: How to Use Large Language Models While Re- ducing Cost and Improving Performance. 2023. doi: 10.48550/arXiv.2305.05176. arXiv:2305.05176
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.05176 2023
-
[15]
Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing
Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, and Xia Hu. Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing. 2025. doi: 10.48550/arXiv.2502.04428. arXiv:2502.04428
-
[16]
Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data
Tyler Burleigh, Jing Chen, and Kristen DiCerbo. Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data. InProceedings of the Ar- tificial Intelligence in Measurement and Education Con- ference (AIME-Con): Coordinated Session Papers, pages 61–68. National Council on Measurement in Education (NCME), 2025. ISBN 979-8-218-...
2025
-
[17]
Seyma Yildirim-Erbasli and Okan Bulut. Innovating As- sessment with Conversational Agents: A Technology- Enhanced Approach to Formative Assessments. In2023 IEEE International Conference on Advanced Learning Technologies (ICALT), pages 331–335, 2023. doi: 10. 1109/ICALT58122.2023.00103
-
[18]
Joseph L. Fleiss. Measuring Nominal Scale Agreement Among Many Raters.Psychological Bulletin, 76(5):378– 382, 1971. doi: 10.1037/h0031619
-
[19]
J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. doi: 10.2307/2529310
-
[20]
Lawrence Erlbaum Associates, 2nd edi- tion, 1988
Jacob Cohen.Statistical Power Analysis for the Behav- ioral Sciences. Lawrence Erlbaum Associates, 2nd edi- tion, 1988. ISBN 978-0-8058-0283-2
1988
-
[21]
On Verbalized Confidence Scores for LLMs
Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Ya- mada. On Verbalized Confidence Scores for LLMs. 2024. doi: 10.48550/arXiv.2412.14737. arXiv:2412.14737
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14737 2024
-
[22]
Changye Xu, Bingbing Wen, Bohan Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe. Do Language Models Mirror Human Confidence? InFindings of the Associa- tion for Computational Linguistics: ACL 2025, 2025. doi: 10.18653/v1/2025.findings-acl.1316. arXiv:2506.00582
-
[23]
Hosmer, Stanley Lemeshow, and Rodney X
David W. Hosmer, Stanley Lemeshow, and Rodney X. Sturdivant.Assessing the Fit of the Model, pages 153–
-
[24]
ISBN 978-1-118-54838-
John Wiley & Sons, 2013. ISBN 978-1-118-54838-
2013
-
[25]
doi: 10.1002/9781118548387.ch5
-
[26]
Wein- berger
Chuan Guo, GeoffPleiss, Yu Sun, and Kilian Q. Wein- berger. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70 ofPMLR, pages 1321–1330, 2017
2017
-
[27]
Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20 (1):37–46, 1960. doi: 10.1177/001316446002000104
-
[28]
Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrit- tum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty and Beyond. 2024. doi: 10.48550/arXiv. 2404.10136. arXiv:2404.10136
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[29]
Michael J. Zellinger and Matt Thomson. Efficiently De- ploying LLMs with Controlled Risk. 2024. doi: 10. 48550/arXiv.2410.02173. arXiv:2410.02173
-
[30]
Morris H. DeGroot and Stephen E. Fienberg. The Com- parison and Evaluation of Forecasters.The Statistician, 32(1/2):12–22, 1983. doi: 10.2307/2987588
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.