hub Canonical reference

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, Sebastian Farquhar · 2023 · cs.CL · arXiv 2302.09664

Canonical reference. 86% of citing Pith papers cite this work as background.

77 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 77 citing papers arXiv PDF

abstract

We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1

citation-polarity summary

background 6 baseline 1

representative citing papers

An Empirical Study of Security Calibration in Large Language Models for Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

cs.MA · 2026-06-25 · unverdicted · novelty 7.0

Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.

MortarBench: Evaluating Mortgage Loan Origination Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MortarBench benchmark shows LLMs achieve ≤77.1% accuracy on loan origination; CRIT calibration raises accuracy to 80.5% and reduces bias.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

Introduces object-level semantic uncertainty for VLM memory, the UQ-DAAAM refinement system, and probabilistic guarantees that selected high-quality views reduce uncertainty more effectively.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.

Before and After Temperature: A Distributional View of Creative LLM Generation

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

cs.HC · 2026-05-27 · unverdicted · novelty 7.0

A between-subjects experiment (N=192) finds that token-level uncertainty increases agreement with LLM answers while relation-level uncertainty reduces external verification in medical decision tasks.

Proper Scoring Rules for Agentic Uncertainty Quantification

cs.AI · 2026-05-23 · unverdicted · novelty 7.0

Introduces Trajectory Proper Score (TPS) as a strictly proper family of trajectory-level scoring rules that elicits the complete prefix-conditioned success probability process.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.

Task-Aware Calibration: Provably Optimal Decoding in LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

Active Testing of Large Language Models via Approximate Neyman Allocation

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.

Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

cs.AI · 2026-04-20 · conditional · novelty 7.0

GROVE visualizes distributions of language model generations as overlapping paths through a text graph, with user studies showing that graph summaries aid structural judgments like diversity assessment while raw outputs remain better for details.

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.

The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions

cs.CL · 2026-06-20 · unverdicted · novelty 6.0

Comparative evaluation of seven confidence constructions across 25 LLM-dataset pairs reveals that verbalized scores provide good ranking but coarse granularity for thresholding, while multi-query aggregation helps weak models but can harm strong ones.

Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

Temporal Attractor Steering resolves 29-57% of parametric temporal conflicts in open-weight LLMs while preserving 85-99% accuracy on non-conflict queries.

citing papers explorer

Showing 15 of 15 citing papers after filters.

MortarBench: Evaluating Mortgage Loan Origination Agents cs.LG · 2026-06-17 · unverdicted · none · ref 7 · internal anchor
MortarBench benchmark shows LLMs achieve ≤77.1% accuracy on loan origination; CRIT calibration raises accuracy to 80.5% and reduces bias.
Task-Aware Calibration: Provably Optimal Decoding in LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders cs.LG · 2026-04-21 · unverdicted · none · ref 43 · internal anchor
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models cs.LG · 2026-06-18 · unverdicted · none · ref 59 · internal anchor
Temporal Attractor Steering resolves 29-57% of parametric temporal conflicts in open-weight LLMs while preserving 85-99% accuracy on non-conflict queries.
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 63 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Reading Calibrated Uncertainty from Language Model Trajectories cs.LG · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment cs.LG · 2026-01-29 · unverdicted · none · ref 18 · 2 links · internal anchor
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
Neuron-Aware Active Few-Shot Learning for LLMs cs.LG · 2026-07-02 · unverdicted · none · ref 28 · internal anchor
NeuFS selects active few-shot samples for LLMs by representing samples via neuron activation patterns and applying a dual-criteria strategy of diversity and neuron consensus to identify informative examples.
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation cs.LG · 2026-07-02 · unverdicted · none · ref 24 · internal anchor
DALorRA applies variational Bayesian sparse masking to LoRA ranks to calibrate LLM uncertainty while preserving accuracy.
Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs cs.LG · 2026-05-30 · unverdicted · none · ref 13 · internal anchor
Linear probes on mid-layer hidden states in quantized LLMs detect hallucinations at 0.904-1.000 AUROC, exceeding sampling baselines and showing consistent layer bands across model families.
R2V Agent: Teaching SLMs When to Ask for Help cs.LG · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy cs.LG · 2026-05-05 · unverdicted · none · ref 10 · 2 links · internal anchor
ACSE estimates LLM uncertainty via adaptive semantic entropy clustering with conformal prediction guarantees, reporting higher AUROC than token entropy baselines on datasets like TriviaQA.
ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning cs.LG · 2026-06-06 · unverdicted · none · ref 25 · internal anchor
ConSteer-RL adds a confidence-aware reward derived from per-token probabilities to GRPO-based RLVR and reports 2.3-4% average gains over baselines across model scales.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR) cs.LG · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer