Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
hub Canonical reference
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.
Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.
A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.
A between-subjects experiment (N=192) finds that token-level uncertainty increases agreement with LLM answers while relation-level uncertainty reduces external verification in medical decision tasks.
Introduces Trajectory Proper Score (TPS) as a strictly proper family of trajectory-level scoring rules that elicits the complete prefix-conditioned success probability process.
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
GROVE visualizes distributions of language model generations as overlapping paths through a text graph, with user studies showing that graph summaries aid structural judgments like diversity assessment while raw outputs remain better for details.
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
Comparative evaluation of seven confidence constructions across 25 LLM-dataset pairs reveals that verbalized scores provide good ranking but coarse granularity for thresholding, while multi-query aggregation helps weak models but can harm strong ones.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
BalanceRAG uses sequential graphical testing on a 2D lattice of threshold pairs to certify safe operating points that meet target risk levels in cascaded RAG while increasing coverage.
citing papers explorer
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
GROVE visualizes distributions of language model generations as overlapping paths through a text graph, with user studies showing that graph summaries aid structural judgments like diversity assessment while raw outputs remain better for details.
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.