Recognition: no theorem link
Bias and Uncertainty in LLM-as-a-Judge Estimation
Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3
The pith
Sharing calibration across models in LLM-as-a-Judge evaluations can reverse the apparent winner with high apparent confidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-as-a-Judge evaluation using the naive estimator of raw judge outputs is systematically biased. Bias-corrected estimators remain unreliable for model comparisons when calibration is shared across models, producing estimates that can point in the wrong direction with high apparent confidence. Analytical bias expressions, simulations over judge quality J and cross-model calibration instability ΔJ, and an MMLU-Pro case study with observed sign reversal establish this failure mode and motivate reporting J and ΔJ as reliability diagnostics.
What carries the argument
Analytical expressions and simulations for bias and uncertainty in terms of judge quality J and cross-model calibration instability ΔJ, which quantify distortion in shared-calibration comparisons.
If this is right
- The naive estimator from raw judge outputs is systematically biased.
- Corrected estimators require both high judge quality and stable calibration across compared models to avoid distortion.
- Shared calibration, though practical, risks estimates that reverse the actual performance order.
- Reporting J and ΔJ lets users assess when LaaJ results are likely to be invalid.
Where Pith is reading between the lines
- Separate calibration sets per model may be necessary despite higher collection cost to prevent sign reversal.
- The same shared-calibration risk could appear in any pairwise comparison that reuses a single judge calibration set.
- The J and ΔJ diagnostics could be computed automatically in evaluation pipelines to warn users before results are trusted.
Load-bearing premise
The analytical bias expressions and simulation results generalize to real LLM judges and evaluation tasks beyond the MMLU-Pro case study examined.
What would settle it
A controlled experiment on additional benchmarks with known true model accuracies in which the shared-calibration corrected estimate matches the true ordering even when measured ΔJ is large.
Figures
read the original abstract
LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ($J$) and cross-model calibration instability ($\Delta J$), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $\Delta J$ as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes biases in LLM-as-a-Judge (LaaJ) evaluations for model performance assessment. It argues that bias-corrected estimators can still produce severe bias when calibration is shared across compared models, including cases of sign-reversed comparison estimates with high apparent confidence. The authors derive analytical bias expressions, perform simulations sweeping judge quality J and cross-model calibration instability ΔJ, demonstrate sign reversal on a real MMLU-Pro dataset, and propose J and ΔJ as diagnostics with reporting guidance for LaaJ evaluations.
Significance. If the central findings hold, the work is significant for the rapidly growing use of LLM judges in scalable ML evaluation, as it identifies a practical and previously under-emphasized failure mode in calibration sharing that can invert conclusions. Strengths include the closed-form analytical derivations, systematic simulation sweeps, and the real-data existence proof of reversal, which together provide both theoretical insight and a concrete cautionary example. The proposed diagnostics offer a constructive path forward for practitioners.
major comments (3)
- [§3] §3 (Analytical derivations): The closed-form bias expressions for shared-calibration comparisons rest on specific assumptions about judge error distributions and the parametric form of ΔJ. These assumptions are not shown to be robust to common real-LLM phenomena such as heavy-tailed errors, position biases, or task-dependent calibration shifts, which directly affects whether the quantitative severity predictions generalize beyond the MMLU-Pro example.
- [§5] §5 (MMLU-Pro case study): The sign-reversal demonstration is valuable as an existence proof, yet the section provides limited detail on the number of model pairs, statistical power, or controls for other confounding factors in the judge outputs. This makes it difficult to evaluate how representative the observed bias magnitude and confidence levels are for the broader claim.
- [§4] §4 (Simulations and diagnostics): While the sweeps over J and ΔJ are comprehensive, the manuscript does not fully specify how practitioners would estimate these quantities from real judge outputs on new tasks. Without this mapping, the proposed diagnostics remain difficult to apply, weakening their utility for the recommended reporting guidance.
minor comments (2)
- [Abstract] Notation for J and ΔJ is introduced clearly in the body but could benefit from a brief intuitive definition when first mentioned in the abstract.
- [Figures in §4] Simulation figure captions should explicitly restate the assumed judge error model and parameter ranges to improve standalone readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Analytical derivations): The closed-form bias expressions for shared-calibration comparisons rest on specific assumptions about judge error distributions and the parametric form of ΔJ. These assumptions are not shown to be robust to common real-LLM phenomena such as heavy-tailed errors, position biases, or task-dependent calibration shifts, which directly affects whether the quantitative severity predictions generalize beyond the MMLU-Pro example.
Authors: We agree that the closed-form derivations rely on specific assumptions regarding judge error distributions and the parametric form of ΔJ. The MMLU-Pro case study provides an empirical illustration under real judge outputs that are likely to exhibit some of these phenomena. To address the concern directly, we will add a dedicated limitations subsection discussing the sensitivity of the bias expressions to heavy-tailed errors, position biases, and task-dependent shifts, including qualitative analysis of how violations would affect the sign-reversal predictions. We will also include a small set of additional simulation results under alternative error distributions where feasible within space constraints. This is a partial revision, as the core analytical results remain valid under the stated modeling assumptions. revision: partial
-
Referee: [§5] §5 (MMLU-Pro case study): The sign-reversal demonstration is valuable as an existence proof, yet the section provides limited detail on the number of model pairs, statistical power, or controls for other confounding factors in the judge outputs. This makes it difficult to evaluate how representative the observed bias magnitude and confidence levels are for the broader claim.
Authors: We thank the referee for highlighting this. In the revised manuscript we will expand §5 to report the exact number of model pairs evaluated, the statistical power calculations or confidence intervals used for the comparisons, and the controls applied for confounding factors such as prompt formatting variations and position biases in the judge outputs. These additions will better situate the observed reversal magnitudes and apparent confidence levels within the broader claim. revision: yes
-
Referee: [§4] §4 (Simulations and diagnostics): While the sweeps over J and ΔJ are comprehensive, the manuscript does not fully specify how practitioners would estimate these quantities from real judge outputs on new tasks. Without this mapping, the proposed diagnostics remain difficult to apply, weakening their utility for the recommended reporting guidance.
Authors: We acknowledge this practical gap. We will revise the diagnostics section to include explicit, step-by-step procedures for estimating J (judge quality) and ΔJ (cross-model calibration instability) from real judge outputs on new tasks. These will be based on held-out calibration sets or cross-validation approaches using the same judge model, together with guidance on sample sizes needed for stable estimates. This will directly support the recommended reporting practices. revision: yes
Circularity Check
No circularity: derivations are independent analytical results under stated assumptions
full rationale
The paper derives closed-form bias expressions from explicit assumptions on judge output distributions and calibration instability ΔJ, then validates via parameter sweeps on synthetic judges and one external MMLU-Pro case study. No self-definitional steps, no fitted parameters renamed as predictions, and no load-bearing self-citations appear in the provided abstract or derivation outline. The central claims about bias severity and sign reversal are direct consequences of the stated error model rather than reductions to the inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://www.science.org/doi/abs/10.1126/science.adi6000
doi: 10.1126/science.adi6000. URLhttps://www.science.org/doi/abs/10.1126/science.adi6000. Anastasios N. Angelopoulos, John C. Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference,
- [2]
-
[3]
URLhttps://arxiv.org/abs/2601.05420. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, N...
-
[4]
Association for Computational Linguistics. ISBN 979-8-89176-384-5. doi: 10.18653/v1/2026.eacl-industry.69. URLhttps://aclanthology.org/2026.eacl-industry.69/. Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from ...
-
[5]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
URLhttps://openreview.net/forum?id=4hturzLcKX. Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,
work page internal anchor Pith review arXiv
-
[6]
An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4
Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...
2025
-
[7]
Association for Computational Linguistics. ISBN 979- 8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.306. URL https://aclanthology.org/ 2025.findings-acl.306/. Zsolt Lang and Jen˝o Reiczigel. Confidence limits for prevalence of disease adjusted for estimated sensitivity and specificity.Preventive V eterinary Medicine, 113(1):13–22, 01
-
[8]
doi: 10.1016/j. prevetmed.2013.09.015. LangChain. How to calibrate llm-as-a-judge with human corrections, n.d. URL https://www. langchain.com/articles/llm-as-a-judge. Accessed: 2026-04-29. Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy yong Sohn, and Kangwook Lee. How to correctly report llm-as-a-judge evaluations,
work page doi:10.1016/j 2013
-
[9]
How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025
URLhttps://arxiv.org/abs/2511.21140. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December
-
[10]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Association for Computational Lin- guistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023. emnlp-main.153/. 10 Microsoft Azure AI Foundry Blog. Evaluating ai agents: Can llm- as-a-judge evaluators be trusted?, January
-
[11]
URL https: //techcommunity.microsoft.com/blog/azure-ai-foundry-blog/ evaluating-ai-agents-can-llm-as-a-judge-evaluators-be-trusted/4480110 . Accessed: 2026-04-29. Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations,
- [12]
-
[13]
doi: 10.1126/science.ns-4.93.453-a. Walter J. Rogan and Beth Gladen. Estimating prevalence from the results of a screening test.American Journal of Epidemiology, 107(1):71–76,
-
[14]
doi: 10.1093/oxfordjournals.aje.a112510. Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM-based judges. InThe Thirteenth International Conference on Learning Representations,
-
[15]
URL https://openreview.net/forum?id=G0dksFayVq. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computation...
-
[16]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica
doi: 10.1002/ 1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt...
1950
-
[17]
Yes” if the student’s reasoning is accurate and sufficient to arrive at the correct answer, “No
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf. 11 A Calibration structure, RG vs PPI++ A.1 Judge-centric vs model-specific A key structural difference between RG and PPI++ is the role of the calibration set. In RG, calibration isjudge-centric: the calibration set consists...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.