Do llms know when to not answer? investigating abstention abilities of large language models.arXiv preprint arXiv:2407.16221

Madhusudhan, Nishanth, others , year= · 2024 · arXiv 2407.16221

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

cs.AI · 2026-05-06 · unverdicted · novelty 7.0

Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

Causal Evidence that Language Models use Confidence to Drive Behavior

cs.LG · 2026-03-23 · unverdicted · novelty 6.0

Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

cs.CL · 2026-06-22 · unverdicted · novelty 4.0

Retrieved clauses yield macro-F1 within 0.02 of gold clauses (wide CI) despite 7% rank-1 exact match, indicating exact-match recall underestimates policy utility in this agent benchmark.

citing papers explorer

Showing 2 of 2 citing papers after filters.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 35
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents cs.CL · 2026-06-22 · unverdicted · none · ref 19
Retrieved clauses yield macro-F1 within 0.02 of gold clauses (wide CI) despite 7% rank-1 exact match, indicating exact-match recall underestimates policy utility in this agent benchmark.

Do llms know when to not answer? investigating abstention abilities of large language models.arXiv preprint arXiv:2407.16221

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer