arxiv: 2604.17293 · v1 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty

Jingyi Ren , Ante Wang , Yunghwei Lai , Xiaolong Wang , Linlu Gong , Weitao Li , Weizhi Ma , Yang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM uncertaintydata uncertaintymodel uncertaintyuncertainty attributionUA-Benchself-awarenessabstention

0 comments

The pith

Even top LLMs cannot reliably distinguish ambiguous inputs from their own knowledge limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates UA-Bench to test whether LLMs can correctly identify when uncertainty comes from unclear questions versus their own capability gaps. Testing 18 frontier models shows they perform poorly at this discrimination task, and models that give accurate answers do not necessarily attribute uncertainty correctly. The authors then demonstrate a data synthesis plus reinforcement learning approach that raises attribution performance on two smaller models while keeping answer accuracy stable. This distinction matters because it determines whether a model should ask for clarification, invoke a tool, or simply refuse.

Core claim

Current frontier LLMs do not reliably attribute uncertainty to either data ambiguity or model limitations, as measured by explicit classification performance on UA-Bench; high answer accuracy shows no necessary correlation with strong uncertainty attribution ability.

What carries the argument

UA-Bench, a set of over 3,500 questions drawn from six datasets that requires models to explicitly classify each instance of uncertainty as data uncertainty or model uncertainty.

If this is right

Generic refusal phrases are insufficient because they do not trigger the right downstream action.
Targeted training can raise uncertainty attribution without reducing answer accuracy.
Reliable tool use or clarification requests depend on accurate uncertainty type identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attribution gap may appear in other self-evaluation settings such as confidence calibration or error detection.
Applying the synthesis-plus-RL recipe to larger models or additional tasks could produce similar gains.
Better attribution would allow safer agent designs that decide when to query humans versus external sources.

Load-bearing premise

The benchmark questions and their labels correctly isolate cases of input ambiguity from cases of missing model capability without systematic construction bias.

What would settle it

A replication in which the same 18 models achieve classification accuracy well above chance when labeling uncertainty types on UA-Bench would undermine the central finding.

Figures

Figures reproduced from arXiv: 2604.17293 by Ante Wang, Jingyi Ren, Linlu Gong, Weitao Li, Weizhi Ma, Xiaolong Wang, Yang Liu, Yunghwei Lai.

**Figure 2.** Figure 2: Composition of UA-Bench by task category [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy changes under different prompting [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Overview of our uncertainty-aware RL pipeline. We synthesize training data from [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don't know'', failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs are weak at separating data uncertainty from model uncertainty and offers a benchmark plus a simple RL fix, but the labeling process looks like the main weak point.

read the letter

This paper points out that treating all LLM refusals as a generic 'I don't know' misses the difference between ambiguous questions and questions the model simply can't answer. They built UA-Bench with over 3500 items from six datasets to test explicit attribution of those two uncertainty types, ran it on 18 models, and found that even strong models do not discriminate reliably and that high accuracy does not guarantee good attribution. They also show a lightweight data-synthesis plus RL approach that lifts attribution on two Qwen3 variants while keeping answer accuracy steady, and they released the code and data.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge- and reasoning-intensive tasks, to evaluate LLMs' explicit attribution of uncertainty to either input ambiguity (data uncertainty) or capability limitations (model uncertainty). Evaluation across 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between these uncertainty types and that high answer accuracy does not necessarily imply strong attribution ability. The authors further propose a lightweight data synthesis plus reinforcement learning strategy that improves attribution performance on Qwen3-4B-Instruct-2507 and Qwen3-8B (thinking mode) while preserving answer accuracy, with code and data released publicly.

Significance. If the results hold, the work is significant because distinguishing data from model uncertainty enables more precise downstream actions (clarification requests versus tool invocation), addressing a practical limitation in reliable LLM deployment. The multi-model empirical evaluation and the proposed RL improvement provide concrete evidence of both the gap and a mitigation path. Public release of code and data is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[§3] §3 (UA-Bench construction and labeling): Attribution of questions to 'model uncertainty' assumes that each of the 18 LLMs possesses the requisite capability for the input while the question remains unambiguous; however, the manuscript provides no independent verification of these capability boundaries (e.g., via training-data inspection, capability probes, or human-expert validation). This assumption is load-bearing for the central claim that models fail to discriminate, because systematic mislabeling would confound attribution failures with benchmark errors.
[§4.2] §4.2 (RL experiments on Qwen3 variants): The reported gains in uncertainty attribution lack accompanying statistical significance tests, confidence intervals, or ablation controls for the data-synthesis step. Without these, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation set.

minor comments (2)

[§2] §2 (Related work): The discussion of prior uncertainty benchmarks could more explicitly contrast UA-Bench's explicit attribution requirement with earlier refusal-only evaluations.
[Figure 1, Table 1] Figure 1 and Table 1: The uncertainty-type taxonomy diagram and dataset statistics table would benefit from an additional column or annotation clarifying how each source dataset contributes to data- versus model-uncertainty labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough and constructive review. We appreciate the opportunity to clarify key aspects of our work and strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (UA-Bench construction and labeling): Attribution of questions to 'model uncertainty' assumes that each of the 18 LLMs possesses the requisite capability for the input while the question remains unambiguous; however, the manuscript provides no independent verification of these capability boundaries (e.g., via training-data inspection, capability probes, or human-expert validation). This assumption is load-bearing for the central claim that models fail to discriminate, because systematic mislabeling would confound attribution failures with benchmark errors.

Authors: We thank the referee for this important observation. UA-Bench labels 'model uncertainty' questions by selecting items from established datasets (e.g., GSM8K, TriviaQA) whose ground-truth answers are unambiguous and fall within the knowledge/reasoning scope of frontier LLMs; 'data uncertainty' items are drawn from sources containing inherent ambiguity. While we did not conduct per-model training-data audits or additional capability probes across all 18 models, preliminary model accuracy checks on the selected questions showed that frontier models achieve non-trivial performance, supporting that the items are generally within capability. In the revised manuscript we will (1) expand the description of the labeling protocol with explicit selection criteria, (2) report per-model accuracy on UA-Bench to empirically corroborate capability assumptions, and (3) add a limitations paragraph discussing the absence of exhaustive per-model verification. These additions will make the assumption more transparent without altering the core empirical finding that attribution failures are widespread. revision: partial
Referee: [§4.2] §4.2 (RL experiments on Qwen3 variants): The reported gains in uncertainty attribution lack accompanying statistical significance tests, confidence intervals, or ablation controls for the data-synthesis step. Without these, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation set.

Authors: We agree that additional statistical rigor and ablations are needed. The current results show consistent gains on two Qwen3 variants, yet formal tests and controls for the synthesis component were omitted. In the revision we will add (1) bootstrap confidence intervals around all reported metrics, (2) paired statistical significance tests (e.g., McNemar’s test) comparing baseline and RL-augmented models, and (3) ablation experiments that isolate the contribution of the data-synthesis stage versus RL fine-tuning alone. These analyses will be included in §4.2 and the appendix to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark and evaluation

full rationale

The paper constructs UA-Bench by sampling over 3,500 questions from six existing external datasets, performs direct empirical evaluations of 18 frontier LLMs on uncertainty attribution tasks, and describes a separate lightweight synthesis-plus-RL improvement method tested on Qwen models. No load-bearing claims reduce by construction to self-referential fits, renamings, or self-citation chains; results are grounded in observable model outputs on the held-out benchmark rather than definitions or predictions that presuppose the target quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Empirical ML paper; primary assumptions concern validity of uncertainty labels and generalizability of the RL method. No new physical entities or mathematical axioms beyond standard supervised learning assumptions.

free parameters (1)

RL training hyperparameters
Reinforcement learning strategy requires choices such as reward scaling or learning rate that are tuned to achieve the reported improvement.

axioms (1)

domain assumption Synthetic data and human or model-generated labels for data versus model uncertainty are sufficiently accurate and unbiased for training and evaluation.
The benchmark and RL method rest on the assumption that the constructed distinction between uncertainty types is valid.

pith-pipeline@v0.9.0 · 5505 in / 1222 out tokens · 36269 ms · 2026-05-10T05:39:03.004814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages

[1]

The Twelfth International Conference on Learning Representations,

Gr. The Twelfth International Conference on Learning Representations,
[2]

Transactions of the Association for Computational Linguistics , pages =

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , doi =. Transactions of the Association for Computational Linguistics , pages =
[3]

Do Large Language Models Know What They Don

Yin, Zhangyue and Sun, Qiushi and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Huang, Xuanjing , booktitle =. Do Large Language Models Know What They Don. doi:10.18653/v1/2023.findings-acl.551 , editor =

work page doi:10.18653/v1/2023.findings-acl.551 2023
[4]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[5]

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? , url =

Fan, Chenrui and Li, Ming and Sun, Lichao and Zhou, Tianyi , journal =. Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? , url =
[6]

Training verifiers to solve math word problems , url =

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others , journal =. Training verifiers to solve math word problems , url =
[7]

Measuring mathematical problem solving with the math dataset , url =

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , journal =. Measuring mathematical problem solving with the math dataset , url =
[8]

Dapo: An open-source llm reinforcement learning system at scale , url =

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. Dapo: An open-source llm reinforcement learning system at scale , url =
[9]

Hybridflow: A flexible and efficient rlhf framework , url =

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , booktitle =. Hybridflow: A flexible and efficient rlhf framework , url =
[10]

Metacognition: Answered and unanswered questions , volume =

Garner, Ruth and Alexander, Patricia A , journal =. Metacognition: Answered and unanswered questions , volume =
[11]

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem. Benchmarking Uncertainty Quantification Method...

work page doi:10.1162/tacl_a_00737 2025
[12]

Deliberative alignment: Reasoning enables safer language models , url =

Guan, Melody Y and Joglekar, Manas and Wallace, Eric and Jain, Saachi and Barak, Boaz and Helyar, Alec and Dias, Rachel and Vallone, Andrea and Ren, Hongyu and Wei, Jason and others , journal =. Deliberative alignment: Reasoning enables safer language models , url =
[13]

Does Biomedical Training Lead to Better Medical Performance?

Dada, Amin and Kora s , Osman Alperen and Bauer, Marie and Corbeil, Jean-Philippe and Contreras, Amanda Butler and Seibold, Constantin Marc and Smith, Kaleb E and Friedrich, Julian and Kleesiek, Jens. Does Biomedical Training Lead to Better Medical Performance?. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

2025
[14]

A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , url =

Yang Deng and Wenqiang Lei and Wai Lam and Tat. A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , url =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. doi:10.24963/IJCAI.2023/738 , pages =

work page doi:10.24963/ijcai.2023/738 2023
[15]

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions , url =

Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J , journal =. AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions , url =
[16]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=
[17]

Don ' t Just Say `` I don ' t know''! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations

Deng, Yang and Zhao, Yong and Li, Moxin and Ng, See-Kiong and Chua, Tat-Seng. Don ' t Just Say `` I don ' t know''! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.757

work page doi:10.18653/v1/2024.emnlp-main.757 2024
[18]

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability , url =

Gong, Linlu and Wang, Ante and Lai, Yunghwei and Ma, Weizhi and Liu, Yang , journal =. The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability , url =
[19]

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning , url =

Lai, Yunghwei and Liu, Kaiming and Wang, Ziyue and Ma, Weizhi and Liu, Yang , journal =. Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning , url =
[20]

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , url =

Jin, Bowen and Zeng, Hansi and Yue, Zhenrui and Yoon, Jinsung and Arik, Sercan and Wang, Dong and Zamani, Hamed and Han, Jiawei , journal =. Search-r1: Training llms to reason and leverage search engines with reinforcement learning , url =
[21]

The Twelfth International Conference on Learning Representations,

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Nan Duan and Weizhu Chen , bibsource =. The Twelfth International Conference on Learning Representations,
[22]

Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability , url =

Ren, Jingyi and Xu, Yekun and Wang, Xiaolong and Li, Weitao and Ma, Weizhi and Liu, Yang , journal =. Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability , url =
[23]

Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

Li, Wenjun and Li, Dexun and Dong, Kuicai and Zhang, Cong and Zhang, Hao and Liu, Weiwen and Wang, Yasheng and Tang, Ruiming and Liu, Yong. Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.655

work page doi:10.18653/v1/2025.acl-long.655 2025
[24]

Edelman , bibsource =

Gustaf Ahdritz and Tian Qin and Nikhil Vyas and Boaz Barak and Benjamin L. Edelman , bibsource =. Distinguishing the Knowable from the Unknowable with Language Models , url =. Forty-first International Conference on Machine Learning,
[25]

W i C ke D : A Simple Method to Make Multiple Choice Benchmarks More Challenging

Elhady, Ahmed and Agirre, Eneko and Artetxe, Mikel. W i C ke D : A Simple Method to Make Multiple Choice Benchmarks More Challenging. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.94

work page doi:10.18653/v1/2025.acl-short.94 2025
[26]

None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering

Tam, Zhi Rui and Wu, Cheng-Kuang and Lin, Chieh-Yen and Chen, Yun-Nung. None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1031

work page doi:10.18653/v1/2025.findings-acl.1031 2025
[27]

Asking Clarification Questions to Handle Ambiguity in Open-Domain

Lee, Dongryeol and Kim, Segwang and Lee, Minwoo and Lee, Hwanhee and Park, Joonsuk and Lee, Sang-Woo and Jung, Kyomin , booktitle =. Asking Clarification Questions to Handle Ambiguity in Open-Domain. doi:10.18653/v1/2023.findings-emnlp.772 , editor =

work page doi:10.18653/v1/2023.findings-emnlp.772 2023
[28]

CLAMBER : A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Zhang, Tong and Qin, Peixin and Deng, Yang and Huang, Chen and Lei, Wenqiang and Liu, Junhong and Jin, Dingnan and Liang, Hongru and Chua, Tat-Seng. CLAMBER : A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page doi:10.18653/v1/2024.acl-long.578 2024
[29]

Benchmarking Hallucination in Large Language Models Based on Unanswerable Math Word Problem , url =

Sun, YuHong and Yin, Zhangyue and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Zhao, Hui , booktitle =. Benchmarking Hallucination in Large Language Models Based on Unanswerable Math Word Problem , url =
[30]

ArXiv preprint , title =

Benchekroun, Youssef and Dervishi, Megi and Ibrahim, Mark and Gaya, Jean-Baptiste and Martinet, Xavier and Mialon, Gr. ArXiv preprint , title =
[31]

Sorodoc, Ionut Teodor and Ribeiro, Leonardo F. R. and Blloshmi, Rexhina and Davis, Christopher and de Gispert, Adri \`a. G a RAG e: A Benchmark with Grounding Annotations for RAG Evaluation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.875

work page doi:10.18653/v1/2025.findings-acl.875 2025
[32]

Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models

Amayuelas, Alfonso and Wong, Kyle and Pan, Liangming and Chen, Wenhu and Wang, William Yang. Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.383

work page doi:10.18653/v1/2024.findings-acl.383 2024
[33]

Wong and Emine Yilmaz and Shuming Shi and Zhaopeng Tu , bibsource =

Fanghua Ye and Mingming Yang and Jianhui Pang and Longyue Wang and Derek F. Wong and Emine Yilmaz and Shuming Shi and Zhaopeng Tu , bibsource =. Benchmarking LLMs via Uncertainty Quantification , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada...

2024
[34]

UR ^2 : Unify RAG and Reasoning through Reinforcement Learning , url =

Li, Weitao and Xiang, Boran and Wang, Xiaolong and Gou, Zhinan and Ma, Weizhi and Liu, Yang , journal =. UR ^2 : Unify RAG and Reasoning through Reinforcement Learning , url =
[35]

S ay S elf: Teaching LLM s to Express Confidence with Self-Reflective Rationales

Xu, Tianyang and Wu, Shujin and Diao, Shizhe and Liu, Xiaoze and Wang, Xingyao and Chen, Yangyi and Gao, Jing. S ay S elf: Teaching LLM s to Express Confidence with Self-Reflective Rationales. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.343

work page doi:10.18653/v1/2024.emnlp-main.343 2024
[36]

A Survey of Confidence Estimation and Calibration in Large Language Models , url =

Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna , booktitle =. A Survey of Confidence Estimation and Calibration in Large Language Models , url =
[37]

The curious case of hallucinatory (un)answerability: Finding truths in the hidden states of over-confident large language models

Slobodkin, Aviv and Goldman, Omer and Caciularu, Avi and Dagan, Ido and Ravfogel, Shauli , booktitle =. The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models , url =. doi:10.18653/v1/2023.emnlp-main.220 , editor =

work page doi:10.18653/v1/2023.emnlp-main.220 2023
[38]

Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution , url =

Wang, Ante and Ma, Weizhi and Liu, Yang , journal =. Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution , url =
[39]

Grace: A generative approach to better confidence elicitation in large language models , url =

Zhang, Zhaohan and Liu, Ziquan and Patras, Ioannis , journal =. Grace: A generative approach to better confidence elicitation in large language models , url =
[40]

Large Language Models Must Be Taught to Know What They Don't Know , url =

Sanyam Kapoor and Nate Gruver and Manley Roberts and Katie Collins and Arka Pal and Umang Bhatt and Adrian Weller and Samuel Dooley and Micah Goldblum and Andrew Gordon Wilson , bibsource =. Large Language Models Must Be Taught to Know What They Don't Know , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informati...

2024
[41]

Beyond binary rewards: Training lms to reason about their uncertainty , url =

Damani, Mehul and Puri, Isha and Slocum, Stewart and Shenfeld, Idan and Choshen, Leshem and Kim, Yoon and Andreas, Jacob , journal =. Beyond binary rewards: Training lms to reason about their uncertainty , url =
[42]

Knowrl: Exploring knowledgeable reinforcement learning for factuality , url =

Ren, Baochang and Qiao, Shuofei and Zheng, Da and Chen, Huajun and Zhang, Ningyu , journal =. Knowrl: Exploring knowledgeable reinforcement learning for factuality , url =
[43]

KnowRL: Teaching Language Models to Know What They Know , url =

Kale, Sahil and Dhami, Devendra Singh , journal =. KnowRL: Teaching Language Models to Know What They Know , url =
[44]

Qwen3 technical report , url =

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , journal =. Qwen3 technical report , url =
[45]

AI at Meta , title =
[46]

GPT-4o System Card , year =

OpenAI , howpublished =. GPT-4o System Card , year =
[47]

GPT-5 System Card , year =

OpenAI , howpublished =. GPT-5 System Card , year =
[48]

Introducing GPT-OSS: Open Weights for Advanced Reasoning , year =

OpenAI , howpublished =. Introducing GPT-OSS: Open Weights for Advanced Reasoning , year =
[49]

The Claude 4 Model Family: Opus, Sonnet, and Haiku , year =

Anthropic , howpublished =. The Claude 4 Model Family: Opus, Sonnet, and Haiku , year =
[50]

Gemini 3 Flash , year =
[51]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , url =

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. Deepseekmath: Pushing the limits of mathematical reasoning in open language models , url =
[52]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , editor =

2023
[53]

Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization , url =

Dong, Yihong and Jiang, Xue and Tao, Yongding and Liu, Huanyu and Zhang, Kechi and Mou, Lili and Cao, Rongyu and Ma, Yingwei and Chen, Jue and Li, Binhua and others , journal =. Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization , url =