Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.
Do llms know when to not answer? investigating abstention abilities of large language models.arXiv preprint arXiv:2407.16221
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
Retrieved clauses yield macro-F1 within 0.02 of gold clauses (wide CI) despite 7% rank-1 exact match, indicating exact-match recall underestimates policy utility in this agent benchmark.
citing papers explorer
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
Retrieved clauses yield macro-F1 within 0.02 of gold clauses (wide CI) despite 7% rank-1 exact match, indicating exact-match recall underestimates policy utility in this agent benchmark.