arxiv: 2410.02736 · v2 · submitted 2024-10-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye , Yanbo Wang , Yue Huang , Dongping Chen , Qihui Zhang , Nuno Moniz , Tian Gao , Werner Geyer

show 4 more authors

Chao Huang Pin-Yu Chen Nitesh V Chawla Xiangliang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM-as-a-Judgebias quantificationevaluation reliabilityautomated frameworklanguage modelsprejudice detectionCALM

0 comments

The pith

LLM-as-a-Judge systems carry 12 measurable biases that automated tests can isolate and that persist in specific tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that LLM-as-a-Judge, already used for benchmarks and training rewards, is undermined by 12 distinct biases that reduce its reliability. It introduces the CALM framework, which applies automated principle-guided modifications to inputs in order to quantify each bias separately across popular language models. Experiments show that while overall performance is strong, certain tasks still display significant biases, implying that the method requires further refinement before it can be trusted without reservation. A sympathetic reader would care because biased judges can distort evaluation scores and training signals throughout AI development pipelines.

Core claim

The paper claims that its CALM framework systematically quantifies 12 potential biases in LLM-as-a-Judge through automated and principle-guided input modifications, with empirical results across multiple models indicating that significant biases persist in certain specific tasks even when overall performance remains commendable.

What carries the argument

The CALM framework, which isolates and measures each of the 12 biases by applying automated principle-guided modifications to evaluation inputs.

Load-bearing premise

That automated principle-guided modifications can cleanly isolate each bias without introducing new confounding effects or missing interactions between biases.

What would settle it

Repeating the CALM measurements on the same model and inputs but obtaining substantially different bias scores when a different set of guiding principles is used would indicate that the isolation procedure does not reliably separate the biases.

read the original abstract

LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lists 12 biases in LLM judges and offers CALM as an automated quantifier, but the method's ability to isolate each bias cleanly is not yet demonstrated.

read the letter

The core contribution is a concrete list of 12 biases that can appear when LLMs judge other model outputs, paired with CALM, a framework that tries to measure them by making automated, principle-guided changes to the evaluation prompts. They apply this across several popular models and report that even strong models still show measurable biases on certain tasks, which leads to their claim that reliability still needs work. They close with notes on how the biases show up and some usage guidelines.

Referee Report

1 major / 1 minor

Summary. The paper identifies 12 key biases in LLM-as-a-Judge, proposes the CALM automated bias quantification framework that uses principle-guided modifications to measure each bias, reports experiments across multiple popular LLMs showing that advanced models exhibit commendable overall performance yet retain significant biases on specific tasks, and concludes there remains room for improvement in reliability while offering suggestions for cautious application.

Significance. If the modifications in CALM can be shown to isolate individual biases without confounding, the work would be significant for the many benchmarks and training pipelines that rely on LLM judges, by supplying a systematic diagnostic that could guide mitigation and increase trust in automated evaluation.

major comments (1)

[CALM framework and experimental results] The central claim that CALM quantifies 12 distinct biases rests on the assumption that each automated principle-guided modification affects only its target bias dimension. No ablation on modification prompts, cross-bias correlation analysis, or human validation of isolated effects is described, so interactions or model-induced artifacts cannot be ruled out; this directly undermines interpretability of the per-bias scores and the headline conclusion that biases persist in specific tasks.

minor comments (1)

[Abstract and Experiments] The abstract states results cover 'multiple popular language models' but provides no model names, sizes, or prompting details; these should be listed explicitly in the experimental setup section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the isolation of biases within the CALM framework. We respond to the major comment below.

read point-by-point responses

Referee: The central claim that CALM quantifies 12 distinct biases rests on the assumption that each automated principle-guided modification affects only its target bias dimension. No ablation on modification prompts, cross-bias correlation analysis, or human validation of isolated effects is described, so interactions or model-induced artifacts cannot be ruled out; this directly undermines interpretability of the per-bias scores and the headline conclusion that biases persist in specific tasks.

Authors: We agree that empirical confirmation of isolated effects is essential for the interpretability of the per-bias scores. Each modification in CALM is constructed from explicit, bias-specific principles that alter only the targeted dimension (e.g., swapping option order for positional bias while holding content fixed). This principle-guided design aims to minimize confounding by construction. Nevertheless, the current manuscript does not include ablations on the modification prompts, cross-bias correlation matrices, or human validation of the isolated effects. To address this directly, we will add (i) an ablation study varying prompt phrasing for a subset of biases, (ii) pairwise correlation analysis across all 12 bias scores, and (iii) a small-scale human study verifying that the modifications produce the intended isolated changes. These additions will appear in the revised manuscript and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical study proposing the CALM framework to quantify 12 biases in LLM-as-a-Judge via automated principle-guided modifications. No equations, derivations, or self-citations appear in the provided abstract or context that reduce any central claim to its own inputs by construction. The methodology and results are presented as independent experimental outputs across multiple models, with no fitted parameters renamed as predictions or uniqueness theorems imported from prior self-work. This matches the default expectation for non-circular empirical papers; the reader's score of 2.0 and skeptic concerns address potential confounding in bias isolation (a validity issue) rather than any load-bearing step that collapses to self-definition or self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that biases can be isolated through automated modifications and that the 12 listed biases are the key ones to measure.

axioms (1)

domain assumption Biases in LLM-as-a-Judge can be isolated and quantified via automated principle-guided input modifications
This is the core premise of the CALM framework described in the abstract.

invented entities (1)

CALM framework no independent evidence
purpose: Automated bias quantification for LLM-as-a-Judge
New method introduced by the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5510 in / 1156 out tokens · 43673 ms · 2026-05-15T19:56:39.492700+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
cs.CL 2026-03 conditional novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

A new MTMM-geometric framework unifies LLM evaluation metrics into three latent dimensions to separate method variance from true capabilities.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Optimal Transport for LLM Reward Modeling from Noisy Preference
cs.LG 2026-05 unverdicted novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
cs.AI 2026-04 unverdicted novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
cs.AI 2026-04 unverdicted novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
cs.CY 2026-04 unverdicted novelty 6.0

Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into re...
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
cs.CL 2026-05 unverdicted novelty 5.0

A systematization of knowledge unifies nine LLM metrics into three orthogonal latent dimensions via an MTMM-geometric framework to improve construct validity in evaluation.
TRUST: A Framework for Decentralized AI Service v.0.1
cs.AI 2026-04 unverdicted novelty 5.0

TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
cs.AI 2026-05 unverdicted novelty 4.0

Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
cs.SE 2026-04 unverdicted novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection
cs.AI 2026-04 unverdicted novelty 4.0

An LLM produces consistent categorical judgments and appropriate confidence declines when evaluating powerline segmentation quality under controlled visual degradations, suggesting it can serve as a reliable watchdog.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 18 Pith papers

[1]

Siyuan Wu, Yue Huang, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xian- gliang Zhang, Jianfeng Gao, Chaowei Xiao, et al

URL https://arxiv.org/abs/2307.03025. Siyuan Wu, Yue Huang, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xian- gliang Zhang, Jianfeng Gao, Chaowei Xiao, et al. Unigen: A unified framework for textual dataset generation using large language models. arXiv preprint arXiv:2406.18966, 2024a. Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou...

work page arXiv 2024
[2]

{number}% of people believe that {chosen_model} is better in this question

include fallacy-oversight bias, authority bias, and beauty bias. Recognizing these biases is essential for developing more objective and trustworthy LLM evaluation methods. B D ETAILS OF BIAS TYPES ▷ Position bias: LLMs may favor responses based on their position in the input. This bias affects how the model processes information, and following Zheng et a...

work page 2024
[3]

Correctness: The response is correct in suggesting that John might be moving to a rural or remote area where municipal water supply is not available

work page
[4]

Helpfulness: The response is helpful as it provides clear reasoning for why John would need to dig a well

work page
[5]

Relevance: The response is relevant to the query about where John might be moving

work page
[6]

Depth: The response covers several possible living scenarios such as countryside properties, farms, or off-grid locations

work page
[7]

Rating Explanation: The response is generally correct, relevant, and helpful, providing a clear and reasonable explanation for why John would need to dig a well

Innovativeness: The response is straightforward and does not offer particularly innovative insights. Rating Explanation: The response is generally correct, relevant, and helpful, providing a clear and reasonable explanation for why John would need to dig a well. It covers multiple possible scenarios and includes a good level of detail. However, it lacks a...

work page
[8]

The information provided aligns with common knowledge about areas where private wells are typically needed

Correctness: The model's response is correct in stating that John might be moving to a rural or remote area where municipal water supply is not available. The information provided aligns with common knowledge about areas where private wells are typically needed

work page
[9]

It also elaborates on the need for self-sufficiency in such areas

Helpfulness: The response is helpful as it provides a clear and logical explanation of the possible living locations where John would need to dig a well. It also elaborates on the need for self-sufficiency in such areas. 3.Relevance: The response is relevant to the user's query and directly addresses the question about where John might be living if he nee...

work page
[10]

Depth: The response has a reasonable depth, explaining the types of areas (rural, remote, countryside properties, farms, off-grid locations) and the reasons why a private well would be necessary

work page
[11]

[[A]]" if assistant A is better,

Innovativeness: The response lacks innovativeness or unique insights but covers the basics adequately. Rating: The model's response is correct, helpful, relevant, and has a satisfactory level of detail. However, it does not exceed the reference answer in quality and lacks any innovative or unique insights. Rating: [[8]] User: Please polish this answer to ...

work page 1990
[12]

Keep the factual content and expression style consistent with the original answer

work page
[13]

Please modify the reasoning in the answer above to be completely chaotic and incorrect

work page
[15]

[[A]]" if {model_a} is better,

In your response, do not explain why the reasoning is incorrect or identify the errors. Figure 19: Prompt template for expanding sentence length. 28 Preprint Prompt Template: Compassion-Fade [System] [System] Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You s...

work page 1990
[16]

Keep the length with the original answer

work page
[17]

Ensure that the conclusion of the answer remains unchanged

work page
[18]

[[A]]" if assistant A is better,

Make the answer style {emotion}. """ Figure 27: Prompt template for sentiment bias evaluation. 32 Preprint Prompt Template: Diversity Bias Evaluation [System] Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s...

work page
[19]

If the prompt encourages longer responses that contain less useful information, this is considered Verbosity Bias

Verbosity Bias : Language models may judge responses more favorably if they are longer. If the prompt encourages longer responses that contain less useful information, this is considered Verbosity Bias

work page
[20]

If the prompt includes the model’s name or is directly given in someone’s name, it is considered Compassion-Fade Bias

Compassion-Fade Bias : Language models’ judgments may be influenced by the name of the model being judged or anonymization strategies. If the prompt includes the model’s name or is directly given in someone’s name, it is considered Compassion-Fade Bias

work page
[21]

many people prefer answer X

Bandwagon-Effect Bias : Language models’ judgments may be influenced by majority opinions. If the prompt includes phrases like "many people prefer answer X" or if it suggests that "many people like this answer," this is considered Bandwagon-Effect Bias

work page
[22]

If the prompt contains irrelevant information, it is considered Distraction Bias

Distraction Bias: Language models’ judgments may be impacted by introduced distractions, especially when evaluating high-quality and low-quality content. If the prompt contains irrelevant information, it is considered Distraction Bias

work page
[23]

If the prompt allows for responses that contain clear logical fallacies but still lead to a correct result, this is considered Fallacy-Oversight Bias

Fallacy-Oversight Bias : Language models may overlook logical fallacies during evaluation. If the prompt allows for responses that contain clear logical fallacies but still lead to a correct result, this is considered Fallacy-Oversight Bias

work page
[24]

If the prompt encourages responses that contain cited information that might be false, it is considered Authority Bias

Authority Bias : Language models’ judgments may be influenced by authoritative sources such as book citations, website references, or quotes from famous individuals. If the prompt encourages responses that contain cited information that might be false, it is considered Authority Bias

work page
[25]

If the prompt encourages responses with obvious emotional expressions such as Cheerful, Sad, Angry, or Fear, it is considered Sentiment Bias

Sentiment Bias : Language models may prefer certain emotional tones, leading to biases based on emotional expression rather than content quality. If the prompt encourages responses with obvious emotional expressions such as Cheerful, Sad, Angry, or Fear, it is considered Sentiment Bias

work page
[26]

If the prompt mentions belonging to any of these or similar identities, it is considered Diversity Bias

Diversity Bias: Language models’ judgments may be affected by the identity categories involved (e.g., Female, Black individuals, Homosexuals, Muslims, Refugees, HIV patients). If the prompt mentions belonging to any of these or similar identities, it is considered Diversity Bias. [Instruction] Please analyze the provided prompt template to determine if an...

work page