LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
hub Canonical reference
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.
hub tools
citation-role summary
citation-polarity summary
roles
background 7representative citing papers
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
MRI-Eval benchmark shows frontier LLMs scoring 93-97% on MRI MCQs but falling to 37-61% on stem-only questions, with GE scanner operations as the weakest category for all models.
First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) across three 7-8B models and two benchmarks.
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
MIRROR benchmark shows LLMs universally fail at compositional self-prediction and cannot translate partial self-knowledge into better agentic actions, with external metacognitive control reducing confident failures by ~70-76%.
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
CyberCorrect applies cybernetic control theory to LLM self-correction, reporting 79.8% accuracy on a new 440-task benchmark with 6.2-point gains and 41% less over-correction.
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
Twin-Pass Chain-of-Thought Ensembling cuts Expected Calibration Error by up to 88% in Gemma-3 models on TeleQnA, ORANBench, and srsRANBench.
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
citing papers explorer
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling
Twin-Pass Chain-of-Thought Ensembling cuts Expected Calibration Error by up to 88% in Gemma-3 models on TeleQnA, ORANBench, and srsRANBench.
-
Calibration-Aware Policy Optimization for Reasoning LLMs
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
-
Causal Evidence that Language Models use Confidence to Drive Behavior
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
-
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.
-
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
- MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination
- Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards