Recognition: unknown
Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
Pith reviewed 2026-05-10 05:39 UTC · model grok-4.3
The pith
Even top LLMs cannot reliably distinguish ambiguous inputs from their own knowledge limits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current frontier LLMs do not reliably attribute uncertainty to either data ambiguity or model limitations, as measured by explicit classification performance on UA-Bench; high answer accuracy shows no necessary correlation with strong uncertainty attribution ability.
What carries the argument
UA-Bench, a set of over 3,500 questions drawn from six datasets that requires models to explicitly classify each instance of uncertainty as data uncertainty or model uncertainty.
If this is right
- Generic refusal phrases are insufficient because they do not trigger the right downstream action.
- Targeted training can raise uncertainty attribution without reducing answer accuracy.
- Reliable tool use or clarification requests depend on accurate uncertainty type identification.
Where Pith is reading between the lines
- The same attribution gap may appear in other self-evaluation settings such as confidence calibration or error detection.
- Applying the synthesis-plus-RL recipe to larger models or additional tasks could produce similar gains.
- Better attribution would allow safer agent designs that decide when to query humans versus external sources.
Load-bearing premise
The benchmark questions and their labels correctly isolate cases of input ambiguity from cases of missing model capability without systematic construction bias.
What would settle it
A replication in which the same 18 models achieve classification accuracy well above chance when labeling uncertainty types on UA-Bench would undermine the central finding.
Figures
read the original abstract
Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don't know'', failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge- and reasoning-intensive tasks, to evaluate LLMs' explicit attribution of uncertainty to either input ambiguity (data uncertainty) or capability limitations (model uncertainty). Evaluation across 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between these uncertainty types and that high answer accuracy does not necessarily imply strong attribution ability. The authors further propose a lightweight data synthesis plus reinforcement learning strategy that improves attribution performance on Qwen3-4B-Instruct-2507 and Qwen3-8B (thinking mode) while preserving answer accuracy, with code and data released publicly.
Significance. If the results hold, the work is significant because distinguishing data from model uncertainty enables more precise downstream actions (clarification requests versus tool invocation), addressing a practical limitation in reliable LLM deployment. The multi-model empirical evaluation and the proposed RL improvement provide concrete evidence of both the gap and a mitigation path. Public release of code and data is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [§3] §3 (UA-Bench construction and labeling): Attribution of questions to 'model uncertainty' assumes that each of the 18 LLMs possesses the requisite capability for the input while the question remains unambiguous; however, the manuscript provides no independent verification of these capability boundaries (e.g., via training-data inspection, capability probes, or human-expert validation). This assumption is load-bearing for the central claim that models fail to discriminate, because systematic mislabeling would confound attribution failures with benchmark errors.
- [§4.2] §4.2 (RL experiments on Qwen3 variants): The reported gains in uncertainty attribution lack accompanying statistical significance tests, confidence intervals, or ablation controls for the data-synthesis step. Without these, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation set.
minor comments (2)
- [§2] §2 (Related work): The discussion of prior uncertainty benchmarks could more explicitly contrast UA-Bench's explicit attribution requirement with earlier refusal-only evaluations.
- [Figure 1, Table 1] Figure 1 and Table 1: The uncertainty-type taxonomy diagram and dataset statistics table would benefit from an additional column or annotation clarifying how each source dataset contributes to data- versus model-uncertainty labels.
Simulated Author's Rebuttal
Thank you for your thorough and constructive review. We appreciate the opportunity to clarify key aspects of our work and strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (UA-Bench construction and labeling): Attribution of questions to 'model uncertainty' assumes that each of the 18 LLMs possesses the requisite capability for the input while the question remains unambiguous; however, the manuscript provides no independent verification of these capability boundaries (e.g., via training-data inspection, capability probes, or human-expert validation). This assumption is load-bearing for the central claim that models fail to discriminate, because systematic mislabeling would confound attribution failures with benchmark errors.
Authors: We thank the referee for this important observation. UA-Bench labels 'model uncertainty' questions by selecting items from established datasets (e.g., GSM8K, TriviaQA) whose ground-truth answers are unambiguous and fall within the knowledge/reasoning scope of frontier LLMs; 'data uncertainty' items are drawn from sources containing inherent ambiguity. While we did not conduct per-model training-data audits or additional capability probes across all 18 models, preliminary model accuracy checks on the selected questions showed that frontier models achieve non-trivial performance, supporting that the items are generally within capability. In the revised manuscript we will (1) expand the description of the labeling protocol with explicit selection criteria, (2) report per-model accuracy on UA-Bench to empirically corroborate capability assumptions, and (3) add a limitations paragraph discussing the absence of exhaustive per-model verification. These additions will make the assumption more transparent without altering the core empirical finding that attribution failures are widespread. revision: partial
-
Referee: [§4.2] §4.2 (RL experiments on Qwen3 variants): The reported gains in uncertainty attribution lack accompanying statistical significance tests, confidence intervals, or ablation controls for the data-synthesis step. Without these, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation set.
Authors: We agree that additional statistical rigor and ablations are needed. The current results show consistent gains on two Qwen3 variants, yet formal tests and controls for the synthesis component were omitted. In the revision we will add (1) bootstrap confidence intervals around all reported metrics, (2) paired statistical significance tests (e.g., McNemar’s test) comparing baseline and RL-augmented models, and (3) ablation experiments that isolate the contribution of the data-synthesis stage versus RL fine-tuning alone. These analyses will be included in §4.2 and the appendix to demonstrate robustness. revision: yes
Circularity Check
No significant circularity in empirical benchmark and evaluation
full rationale
The paper constructs UA-Bench by sampling over 3,500 questions from six existing external datasets, performs direct empirical evaluations of 18 frontier LLMs on uncertainty attribution tasks, and describes a separate lightweight synthesis-plus-RL improvement method tested on Qwen models. No load-bearing claims reduce by construction to self-referential fits, renamings, or self-citation chains; results are grounded in observable model outputs on the held-out benchmark rather than definitions or predictions that presuppose the target quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption Synthetic data and human or model-generated labels for data versus model uncertainty are sufficiently accurate and unbiased for training and evaluation.
Reference graph
Works this paper leans on
-
[1]
The Twelfth International Conference on Learning Representations,
Gr. The Twelfth International Conference on Learning Representations,
-
[2]
Transactions of the Association for Computational Linguistics , pages =
Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , doi =. Transactions of the Association for Computational Linguistics , pages =
-
[3]
Do Large Language Models Know What They Don
Yin, Zhangyue and Sun, Qiushi and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Huang, Xuanjing , booktitle =. Do Large Language Models Know What They Don. doi:10.18653/v1/2023.findings-acl.551 , editor =
-
[4]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...
-
[5]
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? , url =
Fan, Chenrui and Li, Ming and Sun, Lichao and Zhou, Tianyi , journal =. Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? , url =
-
[6]
Training verifiers to solve math word problems , url =
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others , journal =. Training verifiers to solve math word problems , url =
-
[7]
Measuring mathematical problem solving with the math dataset , url =
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , journal =. Measuring mathematical problem solving with the math dataset , url =
-
[8]
Dapo: An open-source llm reinforcement learning system at scale , url =
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. Dapo: An open-source llm reinforcement learning system at scale , url =
-
[9]
Hybridflow: A flexible and efficient rlhf framework , url =
Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , booktitle =. Hybridflow: A flexible and efficient rlhf framework , url =
-
[10]
Metacognition: Answered and unanswered questions , volume =
Garner, Ruth and Alexander, Patricia A , journal =. Metacognition: Answered and unanswered questions , volume =
-
[11]
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph
Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem. Benchmarking Uncertainty Quantification Method...
-
[12]
Deliberative alignment: Reasoning enables safer language models , url =
Guan, Melody Y and Joglekar, Manas and Wallace, Eric and Jain, Saachi and Barak, Boaz and Helyar, Alec and Dias, Rachel and Vallone, Andrea and Ren, Hongyu and Wei, Jason and others , journal =. Deliberative alignment: Reasoning enables safer language models , url =
-
[13]
Does Biomedical Training Lead to Better Medical Performance?
Dada, Amin and Kora s , Osman Alperen and Bauer, Marie and Corbeil, Jean-Philippe and Contreras, Amanda Butler and Seibold, Constantin Marc and Smith, Kaleb E and Friedrich, Julian and Kleesiek, Jens. Does Biomedical Training Lead to Better Medical Performance?. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025
2025
-
[14]
A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , url =
Yang Deng and Wenqiang Lei and Wai Lam and Tat. A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , url =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. doi:10.24963/IJCAI.2023/738 , pages =
-
[15]
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions , url =
Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J , journal =. AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions , url =
-
[16]
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
Uncertainty quantification and confidence calibration in large language models: A survey , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=
-
[17]
Deng, Yang and Zhao, Yong and Li, Moxin and Ng, See-Kiong and Chua, Tat-Seng. Don ' t Just Say `` I don ' t know''! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.757
-
[18]
The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability , url =
Gong, Linlu and Wang, Ante and Lai, Yunghwei and Ma, Weizhi and Liu, Yang , journal =. The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability , url =
-
[19]
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning , url =
Lai, Yunghwei and Liu, Kaiming and Wang, Ziyue and Ma, Weizhi and Liu, Yang , journal =. Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning , url =
-
[20]
Search-r1: Training llms to reason and leverage search engines with reinforcement learning , url =
Jin, Bowen and Zeng, Hansi and Yue, Zhenrui and Yoon, Jinsung and Arik, Sercan and Wang, Dong and Zamani, Hamed and Han, Jiawei , journal =. Search-r1: Training llms to reason and leverage search engines with reinforcement learning , url =
-
[21]
The Twelfth International Conference on Learning Representations,
Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Nan Duan and Weizhu Chen , bibsource =. The Twelfth International Conference on Learning Representations,
-
[22]
Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability , url =
Ren, Jingyi and Xu, Yekun and Wang, Xiaolong and Li, Weitao and Ma, Weizhi and Liu, Yang , journal =. Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability , url =
-
[23]
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Li, Wenjun and Li, Dexun and Dong, Kuicai and Zhang, Cong and Zhang, Hao and Liu, Weiwen and Wang, Yasheng and Tang, Ruiming and Liu, Yong. Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.655
-
[24]
Edelman , bibsource =
Gustaf Ahdritz and Tian Qin and Nikhil Vyas and Boaz Barak and Benjamin L. Edelman , bibsource =. Distinguishing the Knowable from the Unknowable with Language Models , url =. Forty-first International Conference on Machine Learning,
-
[25]
W i C ke D : A Simple Method to Make Multiple Choice Benchmarks More Challenging
Elhady, Ahmed and Agirre, Eneko and Artetxe, Mikel. W i C ke D : A Simple Method to Make Multiple Choice Benchmarks More Challenging. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.94
-
[26]
Tam, Zhi Rui and Wu, Cheng-Kuang and Lin, Chieh-Yen and Chen, Yun-Nung. None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1031
-
[27]
Asking Clarification Questions to Handle Ambiguity in Open-Domain
Lee, Dongryeol and Kim, Segwang and Lee, Minwoo and Lee, Hwanhee and Park, Joonsuk and Lee, Sang-Woo and Jung, Kyomin , booktitle =. Asking Clarification Questions to Handle Ambiguity in Open-Domain. doi:10.18653/v1/2023.findings-emnlp.772 , editor =
-
[28]
Zhang, Tong and Qin, Peixin and Deng, Yang and Huang, Chen and Lei, Wenqiang and Liu, Junhong and Jin, Dingnan and Liang, Hongru and Chua, Tat-Seng. CLAMBER : A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...
-
[29]
Benchmarking Hallucination in Large Language Models Based on Unanswerable Math Word Problem , url =
Sun, YuHong and Yin, Zhangyue and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Zhao, Hui , booktitle =. Benchmarking Hallucination in Large Language Models Based on Unanswerable Math Word Problem , url =
-
[30]
ArXiv preprint , title =
Benchekroun, Youssef and Dervishi, Megi and Ibrahim, Mark and Gaya, Jean-Baptiste and Martinet, Xavier and Mialon, Gr. ArXiv preprint , title =
-
[31]
Sorodoc, Ionut Teodor and Ribeiro, Leonardo F. R. and Blloshmi, Rexhina and Davis, Christopher and de Gispert, Adri \`a. G a RAG e: A Benchmark with Grounding Annotations for RAG Evaluation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.875
-
[32]
Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models
Amayuelas, Alfonso and Wong, Kyle and Pan, Liangming and Chen, Wenhu and Wang, William Yang. Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.383
-
[33]
Wong and Emine Yilmaz and Shuming Shi and Zhaopeng Tu , bibsource =
Fanghua Ye and Mingming Yang and Jianhui Pang and Longyue Wang and Derek F. Wong and Emine Yilmaz and Shuming Shi and Zhaopeng Tu , bibsource =. Benchmarking LLMs via Uncertainty Quantification , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada...
2024
-
[34]
UR ^2 : Unify RAG and Reasoning through Reinforcement Learning , url =
Li, Weitao and Xiang, Boran and Wang, Xiaolong and Gou, Zhinan and Ma, Weizhi and Liu, Yang , journal =. UR ^2 : Unify RAG and Reasoning through Reinforcement Learning , url =
-
[35]
S ay S elf: Teaching LLM s to Express Confidence with Self-Reflective Rationales
Xu, Tianyang and Wu, Shujin and Diao, Shizhe and Liu, Xiaoze and Wang, Xingyao and Chen, Yangyi and Gao, Jing. S ay S elf: Teaching LLM s to Express Confidence with Self-Reflective Rationales. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.343
-
[36]
A Survey of Confidence Estimation and Calibration in Large Language Models , url =
Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna , booktitle =. A Survey of Confidence Estimation and Calibration in Large Language Models , url =
-
[37]
Slobodkin, Aviv and Goldman, Omer and Caciularu, Avi and Dagan, Ido and Ravfogel, Shauli , booktitle =. The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models , url =. doi:10.18653/v1/2023.emnlp-main.220 , editor =
-
[38]
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution , url =
Wang, Ante and Ma, Weizhi and Liu, Yang , journal =. Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution , url =
-
[39]
Grace: A generative approach to better confidence elicitation in large language models , url =
Zhang, Zhaohan and Liu, Ziquan and Patras, Ioannis , journal =. Grace: A generative approach to better confidence elicitation in large language models , url =
-
[40]
Large Language Models Must Be Taught to Know What They Don't Know , url =
Sanyam Kapoor and Nate Gruver and Manley Roberts and Katie Collins and Arka Pal and Umang Bhatt and Adrian Weller and Samuel Dooley and Micah Goldblum and Andrew Gordon Wilson , bibsource =. Large Language Models Must Be Taught to Know What They Don't Know , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informati...
2024
-
[41]
Beyond binary rewards: Training lms to reason about their uncertainty , url =
Damani, Mehul and Puri, Isha and Slocum, Stewart and Shenfeld, Idan and Choshen, Leshem and Kim, Yoon and Andreas, Jacob , journal =. Beyond binary rewards: Training lms to reason about their uncertainty , url =
-
[42]
Knowrl: Exploring knowledgeable reinforcement learning for factuality , url =
Ren, Baochang and Qiao, Shuofei and Zheng, Da and Chen, Huajun and Zhang, Ningyu , journal =. Knowrl: Exploring knowledgeable reinforcement learning for factuality , url =
-
[43]
KnowRL: Teaching Language Models to Know What They Know , url =
Kale, Sahil and Dhami, Devendra Singh , journal =. KnowRL: Teaching Language Models to Know What They Know , url =
-
[44]
Qwen3 technical report , url =
Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , journal =. Qwen3 technical report , url =
-
[45]
AI at Meta , title =
-
[46]
GPT-4o System Card , year =
OpenAI , howpublished =. GPT-4o System Card , year =
-
[47]
GPT-5 System Card , year =
OpenAI , howpublished =. GPT-5 System Card , year =
-
[48]
Introducing GPT-OSS: Open Weights for Advanced Reasoning , year =
OpenAI , howpublished =. Introducing GPT-OSS: Open Weights for Advanced Reasoning , year =
-
[49]
The Claude 4 Model Family: Opus, Sonnet, and Haiku , year =
Anthropic , howpublished =. The Claude 4 Model Family: Opus, Sonnet, and Haiku , year =
-
[50]
Gemini 3 Flash , year =
-
[51]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , url =
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. Deepseekmath: Pushing the limits of mathematical reasoning in open language models , url =
-
[52]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =
Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , editor =
2023
-
[53]
Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization , url =
Dong, Yihong and Jiang, Xue and Tao, Yongding and Liu, Huanyu and Zhang, Kechi and Mou, Lili and Cao, Rongyu and Ma, Yingwei and Chen, Jue and Li, Binhua and others , journal =. Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.