Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
QA-calibration of language model confidence scores.arXiv preprint arXiv:2410.06615
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
citing papers explorer
-
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.