The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.
Under review
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models where log-probabilities fail.
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
citing papers explorer
-
A Finite-Calibration Regime Map for LLM Judge Panels
The paper introduces a finite-calibration regime map and Finite-Calibration Panel Selection selector, finding scalar aggregation wins on most real benchmark-budget combinations while joint tables help when interactions are present.