Quantifying and Mitigating Self-Preference Bias of LLM Judges
Pith reviewed 2026-05-15 06:54 UTC · model grok-4.3
The pith
LLM judges show self-preference bias uncorrelated with capability, but a multi-dimensional strategy reduces it by 31.5 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-preference bias is a directional deviation where LLMs favor their own generated responses during evaluation. The paper's fully automated framework constructs equal-quality response pairs to quantify bias propensity separately from discriminability without human gold standards. Empirical results across 20 mainstream LLMs show that advanced capabilities are often uncorrelated or negatively correlated with low bias. A structured multi-dimensional evaluation strategy grounded in cognitive load decomposition mitigates this bias by 31.5 percent on average.
What carries the argument
Equal-quality response pairs that statistically disentangle bias propensity from genuine discriminability, combined with the cognitive load decomposition that structures evaluations into multiple independent dimensions.
If this is right
- Bias in LLM-as-a-Judge systems can be quantified at scale without human annotations.
- Mitigation via multi-dimensional evaluation can improve trustworthiness in model alignment and leaderboard construction.
- Advanced model capabilities alone do not solve evaluative fairness.
- Real-world quality control systems using LLM judges can apply the strategy to reduce systematic favoritism.
Where Pith is reading between the lines
- Training methods that improve capability may need separate adjustments to avoid reinforcing self-referential judgments.
- The framework could extend to measuring other directional biases in automated evaluation pipelines.
- Deployments in content moderation or feedback loops might gain reliability by adopting the cognitive load approach.
Load-bearing premise
The constructed pairs of responses have truly negligible quality differences so that any preference can be attributed to bias rather than actual differences.
What would settle it
Human raters detect consistent quality differences in the constructed pairs, or the proposed mitigation strategy shows no reduction in measured bias when true preferences are independently verified.
Figures
read the original abstract
LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a fully automated framework for quantifying and mitigating Self-Preference Bias (SPB) in LLM-as-a-Judge evaluations. It constructs equal-quality response pairs with negligible quality differences to statistically disentangle bias propensity from discriminability without human gold standards. Through analysis of 20 mainstream LLMs, it finds that advanced capabilities are often uncorrelated or negatively correlated with low SPB. It proposes a structured multi-dimensional evaluation strategy based on cognitive load decomposition that reduces SPB by 31.5% on average.
Significance. If the automated pair construction is valid, this provides a scalable annotation-free method to measure and mitigate bias in LLM judges, which has broad implications for model alignment and leaderboard construction. The empirical results across 20 models are a strength, as is the attempt at a parameter-free statistical separation.
major comments (2)
- [§3] §3: The automated procedure for constructing equal-quality pairs lacks any independent verification (e.g., human ratings or cross-model consistency checks) that residual quality differences are negligible. This assumption is load-bearing for the statistical disentanglement of SPB from discriminability and for the reported correlations across models.
- [Abstract and results section] Abstract and results section: The 31.5% average reduction is presented without details on the exact statistical tests, confidence intervals, or controls for pair construction variability, making it impossible to assess whether the mitigation gain is robust.
minor comments (2)
- [Early sections] The formal definition of SPB would benefit from an explicit equation in the early sections to clarify how bias propensity is isolated from discriminability.
- [Figures and tables] Figure captions and table legends should explicitly state the number of pairs per model and any filtering criteria applied.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our automated framework. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§3] §3: The automated procedure for constructing equal-quality pairs lacks any independent verification (e.g., human ratings or cross-model consistency checks) that residual quality differences are negligible. This assumption is load-bearing for the statistical disentanglement of SPB from discriminability and for the reported correlations across models.
Authors: We acknowledge the centrality of this assumption but maintain that the framework's design enables statistical disentanglement without external verification. Equal-quality pairs are generated from identical prompts under tightly controlled sampling parameters, ensuring negligible quality variance by construction; the statistical separation then isolates SPB as the directional preference observed in these pairs while discriminability is measured on deliberately unequal pairs. Human ratings would undermine the automated, annotation-free goal. In revision we will add cross-model consistency checks (e.g., verifying that quality orderings remain stable when pairs are re-evaluated by held-out models) and report the resulting agreement statistics in §3. revision: partial
-
Referee: [Abstract and results section] Abstract and results section: The 31.5% average reduction is presented without details on the exact statistical tests, confidence intervals, or controls for pair construction variability, making it impossible to assess whether the mitigation gain is robust.
Authors: We agree that additional statistical rigor is required. In the revised manuscript we will specify the exact tests (paired t-tests with Bonferroni correction across the 20 models), report 95% confidence intervals around the 31.5% mean reduction, and include controls for pair-construction variability via bootstrap resampling of the pair-generation process and sensitivity analysis over prompt templates. These details will appear in the results section and be summarized in the abstract. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical framework for constructing equal-quality response pairs and a cognitive-load-based mitigation strategy, with the 31.5% SPB reduction reported as an observed experimental outcome across 20 LLMs rather than a quantity derived by construction from fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that reduce the central claims to inputs by tautology; the pair-construction procedure is a methodological choice whose validity is external to the reported statistics, and no self-citation load-bearing steps appear in the provided text. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Equal-quality response pairs can be constructed with negligible quality differences
invented entities (1)
-
Self-Preference Bias (SPB)
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill
Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.
-
Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion
Corpus-Grounded Feature Diffusion generates synthetic Traditional Chinese IEP training data from 25 seeds to fine-tune a 7B model that reaches BERTScore F1 0.779 on a 10-sample hold-out, beating several larger zero-sh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.