SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Ali Emami; Hassan Sajjad; Sher Badshah

arxiv: 2602.13110 · v3 · pith:BV2ATAP7new · submitted 2026-02-13 · 💻 cs.CL · cs.AI

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah , Ali Emami , Hassan Sajjad This is my paper

classification 💻 cs.CL cs.AI

keywords scopepairwiseevaluationunderalphaconformaljudgingjudgments

0 comments

read the original abstract

Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and biases. We propose SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework that calibrates an acceptance threshold so that, under exchangeability, the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To supply SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions and converts the order-averaged preference probability into an entropy-based score. Across various pairwise judging benchmarks, BPE outperforms standard confidence proxies in calibration and discrimination, while SCOPE consistently satisfies the target risk bound (empirical FDR $\approx 0.097$ to $0.099$ at $\alpha = 0.10$) and retains substantial coverage. Compared to vanilla baselines, SCOPE accepts up to $2.4\times$ more judgments under the same risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CARE: A Conformal Safety Layer for Medical Summarization
cs.CL 2026-06 unverdicted novelty 6.0

CARE applies conformal risk control to deliver distribution-free guarantees bounding hallucination probability and omission fraction in medical summarization while reducing flagged sentences.