A judge-aware ranking framework for evaluating large language models without ground truth.arXiv preprint arXiv:2601.21817

11 Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou · 2026 · arXiv 2601.21817

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

stat.ME · 2026-05-10 · unverdicted · novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence

stat.ME · 2026-05-06 · unverdicted · novelty 6.0

HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.

citing papers explorer

Showing 2 of 2 citing papers.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges stat.ME · 2026-05-10 · unverdicted · none · ref 13
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence stat.ME · 2026-05-06 · unverdicted · none · ref 17
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.

A judge-aware ranking framework for evaluating large language models without ground truth.arXiv preprint arXiv:2601.21817

fields

years

verdicts

representative citing papers

citing papers explorer