Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
A judge-aware ranking framework for evaluating large language models without ground truth.arXiv preprint arXiv:2601.21817
2 Pith papers cite this work. Polarity classification is still indexing.
fields
stat.ME 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.
citing papers explorer
-
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
-
Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.