hub

A Survey of Con- fidence Estimation and Calibration in Large Language Models

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, Iryna Gurevych · 2024 · DOI 10.18653/v1/2024.naacl-long.366

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open at publisher browse 13 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

PCHI uses a frozen probe to detect likely wrong-but-confident LLM responses and conditionally intervenes on attention heads during confidence generation, converting 82.2% of wrong high-confidence outputs to low while damaging only 5.1% of correct ones and lowering ECE from 21.9% to 9.2%.

Code Is More Than Text: Uncertainty Estimation for Code Generation

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Three code-specific uncertainty axes (lexical, algorithmic, functional) yield an ensemble that raises average AUROC from 0.696 to 0.776 across five code LLMs, with one single-pass signal matching multi-pass baselines at lower cost.

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Introduces DOSEBENCH benchmark and shows four LLMs often fail at rolling 24-hour dose calculations and constraint adherence in OTC dosing decisions despite appearing confident.

Quantifying Faithful Confidence Expression in Large Reasoning Models

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

A new framework quantifies faithful confidence expression in large reasoning models by comparing linguistic decisiveness to token probabilities, hidden states, and response consistency, revealing it as a persistent challenge.

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

LLMs correct only 34.8% of zero-shot annotation errors via prompting, and Definition-Specific Familiarity correlates positively with performance (partial r = +0.41) while memorization metrics do not.

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

BAG prompts LLMs to reason over K sampled responses for strategy selection in multi-turn ambiguous QA, improving accuracy and faithfulness to uncertainty over baselines across six models.

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

cs.CL · 2025-02-20 · unverdicted · novelty 6.0

Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher entropy.

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

cs.CY · 2026-03-29 · unverdicted · novelty 4.0

Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.

Improving the Distributional Alignment of LLMs using Supervision

cs.CL · 2025-07-01 · unverdicted · novelty 4.0

Simple supervision improves LLM distributional alignment with diverse population groups on three datasets, with evaluation across multiple models and prompts providing a benchmark.

ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

cs.AI · 2026-05-19 · 2 refs

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

cs.CR · 2026-05-01

citing papers explorer

Showing 1 of 1 citing paper after filters.

Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs cs.LG · 2026-06-02 · unverdicted · none · ref 6
PCHI uses a frozen probe to detect likely wrong-but-confident LLM responses and conditionally intervenes on attention heads during confidence generation, converting 82.2% of wrong high-confidence outputs to low while damaging only 5.1% of correct ones and lowering ECE from 21.9% to 9.2%.

A Survey of Con- fidence Estimation and Calibration in Large Language Models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer