Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
hub
arXiv preprint arXiv:2305.19187 , year=
27 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
MATU quantifies uncertainty in LLM multi-agent systems by turning reasoning trajectories into embedding matrices, stacking runs into a tensor, and decomposing it to separate sources of variability.
Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
BalanceRAG uses sequential graphical testing on a 2D lattice of threshold pairs to certify safe operating points that meet target risk levels in cascaded RAG while increasing coverage.
Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
LUCID detects hallucinations in LLM-KG reasoning by extracting node/edge features from attention and semantics then integrating them with KG structure in a GNN, achieving SOTA on nine new benchmark datasets versus 15 baselines.
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
CoUR uses LLMs for efficient RL reward design through uncertainty quantification and similarity selection, achieving better performance and lower evaluation costs on IsaacGym and Bidexterous Manipulation benchmarks.
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detection without human annotations.
The method aggregates multiple hallucination evaluation scores via conformal p-values to enable calibrated detection with controlled false alarm rates across LLMs and datasets.
A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.
Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.
HawkesLLM pairs a multivariate Hawkes process with language models to model temporal influence cascades in agentic text simulation and reports improved late-stage semantic alignment on a GDELT news case study under limited memory.
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
RDS computes total L1 distance of N sampled embeddings from their empirical centroid on the unit hypersphere to measure semantic variability, with a probability-weighted variant that outperforms nine baselines on hallucination detection across four QA datasets and four LLMs.
citing papers explorer
No citing papers match the current filters.