Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
Measuring short-form factuality in large language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
dataset 1polarities
use dataset 1representative citing papers
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
citing papers explorer
-
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.