pith. sign in

arxiv: 2606.00251 · v1 · pith:OFHYFHOPnew · submitted 2026-05-29 · 💻 cs.AI

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Pith reviewed 2026-06-28 22:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords capability self-assessmentreinforcement learninglarge language modelsself-knowledgepolicy learningmodel limitationsdelegation
0
0 comments X

The pith

Reinforcement learning trains large language models to recognize their own limits and delegate unsolvable tasks while keeping original capabilities intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern large language models consistently attempt problems beyond their competence. The paper frames the missing skill as Capability Self-Assessment and casts it as a policy that decides whether to solve or hand off a query. Reinforcement learning produces this policy more successfully than supervised fine-tuning, avoids damaging the underlying model, and transfers to unseen task distributions. The resulting behavior supports practical decisions such as routing work between local and cloud models and selecting useful training examples.

Core claim

Large language models overestimate their competence across families and scales. Treating self-assessment as a policy-learning problem, reinforcement learning instills accurate capability recognition, outperforms supervised fine-tuning, and leaves the model's original abilities undamaged, whereas supervised fine-tuning degrades the very capabilities being assessed. The acquired behavior generalizes out of distribution and improves inference-time delegation and training-data selection.

What carries the argument

Reinforcement learning applied to a policy that outputs a solve-or-delegate decision for each query, optimized to improve self-assessment accuracy without capability loss.

If this is right

  • At inference time, models can route queries they cannot solve to stronger external systems instead of producing incorrect answers.
  • During continued training, self-assessment scores can be used to prioritize examples that the model currently cannot handle.
  • The learned policy transfers to new task families without retraining, indicating that self-assessment is a general trait rather than task-specific memorization.
  • Supervised fine-tuning on self-assessment labels harms downstream performance, so any future alignment method that uses labeled self-knowledge must avoid this degradation path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same reinforcement-learning approach succeeds on agentic tasks that involve tool use or multi-step planning, self-assessment could become a standard safety layer before external actions are taken.
  • The contrast between reinforcement learning and supervised fine-tuning suggests that capability self-assessment may be more naturally acquired through outcome-based feedback than through direct imitation of human judgments.
  • Because the behavior generalizes out of distribution, one could test whether a single reinforcement-learning run on a narrow domain produces useful self-assessment on entirely different modalities such as code or mathematics.

Load-bearing premise

The measured preservation of original capabilities and the superiority of reinforcement learning over supervised fine-tuning depend on evaluation protocols free of post-hoc data exclusions or test-set choices that favor the reinforcement-learning condition.

What would settle it

On a held-out collection of tasks drawn from the same distribution as the training problems, measure whether the reinforcement-learning model still attempts a higher fraction of unsolvable queries than a supervised-fine-tuning model or the base model.

Figures

Figures reproduced from arXiv: 2606.00251 by Haoyan Yang, Heng Huang, Jiawei Zhou, Reza Shirkavand, Shangqian Gao, Yukai Jin.

Figure 1
Figure 1. Figure 1: Current models lack CSA, across model families and sizes, evaluated on a combined math [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework for instilling CSA into a language model. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qwen3-4B trained over three epochs on Science. We track problem-solving accuracy on the original task, and M-F1 computed against two label sets: original labels probed from the base model and used during training, and re-probed labels freshly generated by the current checkpoint at each epoch. Left: SFTLabel. Right: RLVR. SFTLabel memorizes a fixed rule; RLVR learns CSA. As shown in [PITH_FULL_IMAGE:figure… view at source ↗
Figure 5
Figure 5. Figure 5: Ground-truth and predicted SELF-SOLVE rate of RLVR-trained Qwen3 models on Math across datasets of increasing difficulty [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of RLVR with SFT variants on Math. Left: decision accuracy, i.e., correct SELF-SOLVE/DELEGATE rate, and calibration error, i.e., gap between predicted and true SELF￾SOLVE rates. Right: Qwen3-4B case study com￾paring each method’s ground-truth, predicted, and correctness distributions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Number of samples per model size on Math whose 16 rollouts during DFW yield both SELF-SOLVE and DELEGATE responses [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full reasoning traces for the math case study: vanilla Qwen3-4B vs. RLVR-trained. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case study on a science question: vanilla Qwen3-4B vs. RLVR-trained. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of self-routing. Upon receiving a user query x, a local model equipped with CSA assesses whether the query falls within its reliable solving capability and decides between two actions: SELF-SOLVE, where the local model answers directly, or DELEGATE, where the query is routed to a cloud model for inference. Training Dataset Model with CSA query 𝑥 Delegate 1 query 𝑥2 query 𝑥𝑛 … Self-solve Deleg… view at source ↗
Figure 12
Figure 12. Figure 12: Overview of CSA-based data selection. A CSA-equipped model scores a fresh candidate [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model's original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Capability Self-Assessment (CSA) as the ability of LLMs to recognize their own limitations and decide whether to solve a query or delegate. It shows that current models systematically overestimate competence across families and scales. CSA is formulated as a policy-learning problem; experiments indicate that reinforcement learning trains effective CSA, significantly outperforming supervised fine-tuning while preserving original capabilities, whereas SFT degrades the very capabilities being assessed. The learned behavior generalizes out-of-distribution, and CSA improves local-cloud routing at inference and supplies a signal for targeted data selection during training.

Significance. If the comparative results and preservation claims are robust, the work would be significant for reliable LLM deployment: it supplies a concrete mechanism for models to know when to abstain or delegate, directly addressing overconfidence. The observation that RL preserves capabilities while SFT harms them is a useful distinction for alignment and post-training methods. OOD generalization and the two practical use-cases (inference routing, data selection) broaden the potential impact.

major comments (2)
  1. [Evaluation / results section] Evaluation / results section: the central claim that RL significantly outperforms SFT while preserving original capabilities (and that SFT severely degrades them) rests on the test-set construction and any filtering steps. The manuscript must explicitly state whether queries were excluded, re-weighted, or selected post-training on the basis of model performance, and must report the exact protocol used to measure capability preservation (including dataset sizes, number of runs, and statistical tests).
  2. [Results section] Results section: the claim of strong out-of-distribution generalization for the learned CSA policy requires a precise definition of the OOD distribution, the quantitative metrics used to measure it, and a comparison against the in-distribution baseline; without these details the generalization statement cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: high-level outcome statements are given without any numerical values, dataset sizes, or controls; moving at least one concrete metric (e.g., accuracy delta or capability retention percentage) into the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications on experimental protocols.

read point-by-point responses
  1. Referee: [Evaluation / results section] Evaluation / results section: the central claim that RL significantly outperforms SFT while preserving original capabilities (and that SFT severely degrades them) rests on the test-set construction and any filtering steps. The manuscript must explicitly state whether queries were excluded, re-weighted, or selected post-training on the basis of model performance, and must report the exact protocol used to measure capability preservation (including dataset sizes, number of runs, and statistical tests).

    Authors: We agree that the test-set construction and capability preservation protocol require more explicit description. The revised manuscript now states clearly that no queries were excluded, re-weighted, or selected post-training on the basis of model performance. We have added the exact protocol details, including dataset sizes, number of runs, and statistical tests used to measure preservation of original capabilities. revision: yes

  2. Referee: [Results section] Results section: the claim of strong out-of-distribution generalization for the learned CSA policy requires a precise definition of the OOD distribution, the quantitative metrics used to measure it, and a comparison against the in-distribution baseline; without these details the generalization statement cannot be assessed.

    Authors: We agree that the OOD generalization claim needs a more precise definition and supporting details for assessment. The revised manuscript now includes an explicit definition of the OOD distribution, the quantitative metrics employed, and a direct comparison to the in-distribution baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical comparisons with no derivations or self-referential reductions

full rationale

The paper formulates CSA as a policy-learning problem and reports empirical results comparing reinforcement learning to supervised fine-tuning on capability preservation and out-of-distribution generalization. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims rest on experimental outcomes rather than any chain that reduces by construction to its own inputs, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no free parameters, axioms, or invented entities; all claims rest on unspecified experimental comparisons.

pith-pipeline@v0.9.1-grok · 5690 in / 993 out tokens · 23039 ms · 2026-06-28T22:24:11.606982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 21 canonical work pages · 11 internal anchors

  1. [1]

    M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al

    Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024

  2. [2]

    Conformal prediction: A gentle introduction

    Anastasios N Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

  3. [3]

    The internal state of an llm knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023

  4. [4]

    Barkan, Sidney Black, and Oliver Sourbut

    Casey O. Barkan, Sidney Black, and Oliver Sourbut. Do large language models know what they are capable of? InThe Fourteenth International Conference on Learning Representations, 2026

  5. [5]

    Classification with a reject option using a hinge loss

    Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008

  6. [6]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

  7. [7]

    Conformal prediction for natural language processing: A survey.Transactions of the Association for Computational Linguistics, 12:1497–1516, 2024

    Margarida Campos, António Farinhas, Chrysoula Zerva, Mário AT Figueiredo, and André FT Martins. Conformal prediction for natural language processing: A survey.Transactions of the Association for Computational Linguistics, 12:1497–1516, 2024

  8. [8]

    No Answer Needed: Predicting LLM Answer Accuracy from QQuestion-Only Linear Probes.arXiv preprint arXiv:2509.10625, 2025

    Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, and Lorenzo Pacchiardi. No answer needed: Predicting llm answer accuracy from question-only linear probes.arXiv preprint arXiv:2509.10625, 2025

  9. [9]

    Finetuning language models to emit linguistic expressions of uncertainty, 2024

    Arslan Chaudhry, Sridhar Thiagarajan, and Dilan Gorur. Finetuning language models to emit linguistic expressions of uncertainty, 2024

  10. [10]

    Large language model validity via enhanced conformal prediction methods.Advances in Neural Information Processing Systems, 37:114812–114842, 2024

    John J Cherian, Isaac Gibbs, and Emmanuel J Candès. Large language model validity via enhanced conformal prediction methods.Advances in Neural Information Processing Systems, 37:114812–114842, 2024

  11. [11]

    Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.arXiv preprint arXiv:2502.11028, 2025

    Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.arXiv preprint arXiv:2502.11028, 2025

  12. [12]

    On optimum recognition error and reject tradeoff.IEEE Transactions on information theory, 16(1):41–46, 2003

    C Chow. On optimum recognition error and reject tradeoff.IEEE Transactions on information theory, 16(1):41–46, 2003

  13. [13]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  14. [14]

    Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

    Yu Du, Fangyun Wei, and Hongyang Zhang. Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

  15. [15]

    On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

    Ran El-Yaniv et al. On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

  16. [16]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. 10

  17. [17]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  18. [18]

    Selectivenet: A deep neural network with an integrated reject option

    Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. InInternational conference on machine learning, pages 2151–2159. PMLR, 2019

  19. [19]

    Conformal prediction with condi- tional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4):1100–1126, 2025

    Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Conformal prediction with condi- tional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4):1100–1126, 2025

  20. [20]

    Strictly proper scoring rules, prediction, and estimation

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007

  21. [21]

    Gemini 3.1 Pro model card

    Google DeepMind. Gemini 3.1 Pro model card. Technical report, Google DeepMind, 2 2026

  22. [22]

    The llama 3 herd of models, 2024

    Aaron Grattafiori et al. The llama 3 herd of models, 2024

  23. [23]

    Overconfidence is key: Verbalized uncertainty evaluation in large language and vision-language models

    Tobias Groot and Matias Valdenegro-Toro. Overconfidence is key: Verbalized uncertainty evaluation in large language and vision-language models. InProceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pages 145–171, 2024

  24. [24]

    Data selection via optimal control for language models.arXiv preprint arXiv:2410.07064, 2024

    Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models.arXiv preprint arXiv:2410.07064, 2024

  25. [25]

    On the evaluation of neural selective prediction methods for natural language processing

    Zhengyao Gu and Mark Hopkins. On the evaluation of neural selective prediction methods for natural language processing. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7888–7899, 2023

  26. [26]

    MASH: Modeling Abstention via Selective Help-Seeking

    Mustafa Omer Gul, Claire Cardie, and Tanya Goyal. Pay-per-search models are abstention models.arXiv preprint arXiv:2510.01152, 2025

  27. [27]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

  28. [28]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  29. [29]

    Uncertainty distillation: Teaching language models to express semantic confidence, 2025

    Sophia Hager, David Mueller, Kevin Duh, and Nicholas Andrews. Uncertainty distillation: Teaching language models to express semantic confidence, 2025

  30. [30]

    Simple factuality probes detect hallucinations in long-form natural language generation

    Jiatong Han, Neil Band, Muhammed Razzak, Jannik Kossen, Tim GJ Rudner, and Yarin Gal. Simple factuality probes detect hallucinations in long-form natural language generation. Findings of the Association for Computational Linguistics: EMNLP, pages 16209–16226, 2025

  31. [31]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  32. [32]

    Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

    Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

  33. [33]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  34. [34]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. 11

  35. [35]

    Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

  36. [36]

    Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927, 2024

  37. [37]

    Conformal prediction with large language models for multi-choice question answering.arXiv preprint arXiv:2305.18404, 2023

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering.arXiv preprint arXiv:2305.18404, 2023

  38. [38]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

  39. [39]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  40. [40]

    Selective generation for control- lable language models.Advances in Neural Information Processing Systems, 37:50494–50527, 2024

    Minjae Lee, Kyungmin Kim, Taesoo Kim, and Sangdon Park. Selective generation for control- lable language models.Advances in Neural Information Processing Systems, 37:50494–50527, 2024

  41. [41]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  42. [42]

    Deep gamblers: Learning to abstain with portfolio theory.Advances in Neural Information Processing Systems, 32, 2019

    Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory.Advances in Neural Information Processing Systems, 32, 2019

  43. [43]

    Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

    David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

  44. [44]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023

  45. [45]

    Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022

    Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022

  46. [46]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  47. [47]

    2 olmo 2 furious, 2025

    Team OLMo et al. 2 olmo 2 furious, 2025

  48. [48]

    GPT-5 system card

    OpenAI. GPT-5 system card. Technical report, OpenAI, 8 2025

  49. [49]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

  50. [50]

    Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

  51. [51]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020. 12

  52. [52]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  53. [53]

    A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

    Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

  54. [54]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  55. [55]

    Cost-aware contrastive routing for LLMs

    Reza Shirkavand, Shangqian Gao, Peiran Yu, and Heng Huang. Cost-aware contrastive routing for LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  56. [56]

    The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models

    Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3607–3625, 2023

  57. [57]

    Api is enough: Conformal prediction for large language models without logit-access

    Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. Api is enough: Conformal prediction for large language models without logit-access. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 979–995, 2024

  58. [58]

    arXiv:2505.02151

    Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. Large language models are overconfident and amplify human bias.arXiv preprint arXiv:2505.02151, 2025

  59. [59]

    Gemma 3 technical report, 2025

    Gemma Team et al. Gemma 3 technical report, 2025

  60. [60]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages ...

  61. [61]

    Aime problem set 1983-2024, 2023

    Hemish Veeraboina. Aime problem set 1983-2024, 2023

  62. [62]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

  63. [63]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

  64. [64]

    Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556, 2025

    Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556, 2025

  65. [65]

    Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023

  66. [66]

    The art of abstention: Selective prediction and error regularization for natural language processing

    Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. The art of abstention: Selective prediction and error regularization for natural language processing. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1040–1051, 2021

  67. [67]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023. 13

  68. [68]

    On the Tool Manipulation Capability of Open -source Large Language Models , May 2023

    Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504, 2023

  69. [69]

    Qwen3 technical report, 2025

    An Yang et al. Qwen3 technical report, 2025

  70. [70]

    On Verbalized Confidence Scores for LLMs

    Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada. On verbalized confidence scores for llms.arXiv preprint arXiv:2412.14737, 2024

  71. [71]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  72. [72]

    Do large language models know what they don’t know? InFindings of the association for Computational Linguistics: ACL 2023, pages 8653–8665, 2023

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuan-Jing Huang. Do large language models know what they don’t know? InFindings of the association for Computational Linguistics: ACL 2023, pages 8653–8665, 2023

  73. [73]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  74. [74]

    learning with rejection

    Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘i don’t know’. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7113...