pith. machine review for the scientific record. sign in

arxiv: 2410.21819 · v2 · submitted 2024-10-29 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Self-Preference Bias in LLM-as-a-Judge

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-preference biasLLM-as-a-judgeperplexityautomated evaluationdialogue systemsGPT-4bias measurement
0
0 comments X

The pith

LLMs as judges give higher scores to low-perplexity outputs than humans, even for non-self-generated text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a quantitative metric to measure self-preference bias in LLM evaluators. Experiments with GPT-4 show it rates its own outputs more favorably than human judges would. The authors link this bias to perplexity, finding that LLMs boost scores for lower-perplexity texts more than humans do, whether or not the text is self-generated. This suggests the bias originates from a preference for familiar, predictable language rather than from self-recognition alone. The result matters because it affects how we trust automated evaluations of dialogue systems.

Core claim

LLMs exhibit self-preference bias by assigning higher evaluations to outputs with lower perplexity than human evaluators, and this pattern holds regardless of whether the outputs were self-generated. The bias therefore arises because LLMs prefer texts that are more familiar to them, as measured by perplexity.

What carries the argument

Quantitative metric for self-preference bias and analysis of its correlation with output perplexity.

If this is right

  • LLM evaluators promote styles and policies intrinsic to the models.
  • Automated evaluation of dialogue systems risks systematic skew toward familiar text.
  • The bias is driven by perplexity preference rather than explicit self-recognition.
  • New metric enables quantitative tracking of this effect across models and tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluations could be adjusted by normalizing for perplexity to better match human judgments.
  • The same mechanism may appear in other applications where LLMs assess text quality.
  • Training LLMs on more diverse perplexity levels might lessen the bias in judging.

Load-bearing premise

The introduced metric isolates self-preference bias from confounding factors in LLM judgments.

What would settle it

If LLMs and humans assign evaluations to low-perplexity outputs at the same rate, the claim that LLMs exhibit a distinct bias tied to perplexity would be falsified.

read the original abstract

Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a novel quantitative metric for measuring self-preference bias in LLM-as-a-judge setups for dialogue evaluation. It reports experimental results showing that GPT-4 exhibits significant self-preference bias and hypothesizes that the bias stems from LLMs favoring lower-perplexity (more familiar) outputs. Analysis indicates LLMs assign higher scores to low-perplexity texts than human evaluators do, even for non-self-generated outputs, concluding that perplexity is the essence of the bias.

Significance. If the metric validly isolates self-preference bias and the perplexity correlation proves causal rather than confounded by quality detection differences, the work would supply a useful quantitative tool for diagnosing and addressing biases in automated evaluation, with direct implications for reliable LLM judges. The GPT-4 experiments provide a concrete empirical anchor, but stronger isolation of the familiarity mechanism would be needed to elevate the contribution beyond correlational observation.

major comments (2)
  1. [Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.
  2. [Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.
minor comments (2)
  1. [Abstract] Abstract: the statistical significance levels and exact sample sizes for the GPT-4 self-preference results should be stated explicitly rather than described only qualitatively as 'significant'.
  2. [Methods] Notation: the paper should clarify whether perplexity is computed with the same model family used as judge or with a separate reference model, as this choice affects the interpretation of 'familiarity'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below, providing our responses and indicating planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.

    Authors: We appreciate this point and agree that a fully controlled experiment isolating perplexity (while holding semantic content, human-rated quality, and length fixed) would provide stronger causal evidence. Our current results show that the preference for lower-perplexity text persists for non-self-generated outputs and diverges from human judgments, which supports familiarity as a contributing factor rather than pure self-preference. However, we acknowledge the limitation in distinguishing this from differences in fluency detection. In the revision, we will add a dedicated limitations subsection, tone down the phrasing from 'essence' to 'a primary contributing factor,' and include supplementary analyses using length-matched and semantically similar output pairs to better control for confounds. revision: partial

  2. Referee: [Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.

    Authors: We thank the referee for highlighting this omission. The metric is defined independently as the average score difference between self-generated and cross-generated outputs under matched conditions, without any perplexity or familiarity terms. We will include the full mathematical definition and pseudocode in the revised Metric Definition section to enable verification and ensure the reported perplexity correlation is an independent empirical finding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric and perplexity correlation are independently measured

full rationale

The paper introduces a novel quantitative metric for self-preference bias as a distinct contribution, then separately hypothesizes that lower perplexity drives the bias and reports an empirical correlation between LLM evaluations and output perplexity (observed even for non-self-generated text). No equation or definition in the provided text shows the bias metric being constructed from perplexity terms, nor is any 'prediction' obtained by fitting a parameter to the same data used for the target claim. The central finding is an observed divergence between LLM and human scoring that correlates with perplexity; this is presented as an empirical result rather than a definitional identity. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to force the conclusion. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about perplexity as a proxy for familiarity and on the validity of the new bias metric; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)
  • domain assumption Perplexity computed by the LLM is a valid measure of how familiar or preferred a text is to that model.
    Invoked when linking lower perplexity to higher LLM evaluations.

pith-pipeline@v0.9.0 · 5501 in / 1167 out tokens · 23844 ms · 2026-05-15T14:41:39.613451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

  2. MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

    cs.CL 2026-05 unverdicted novelty 7.0

    MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under pertu...

  3. Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs

    cs.CL 2026-05 unverdicted novelty 7.0

    DESG uses dynamic graphs of decoupled clinical states and asymmetric geometry to evaluate therapeutic dialogue quality, reaching 0.9353 macro-F1 on a 600-window held-out test set and outperforming LLM judges and text ...

  4. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  5. Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

    cs.CL 2026-04 unverdicted novelty 7.0

    Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.

  6. LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

    cs.CR 2026-04 unverdicted novelty 7.0

    Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.

  7. How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

    cs.AI 2026-04 unverdicted novelty 7.0

    A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.

  8. Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

  9. CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

    cs.CL 2026-03 unverdicted novelty 7.0

    CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.

  10. Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

    cs.CV 2026-05 unverdicted novelty 6.0

    Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

  11. When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...

  12. Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

    cs.CL 2026-05 unverdicted novelty 6.0

    Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

  13. StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall

    cs.CL 2026-04 unverdicted novelty 6.0

    StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.

  14. Learning to Control Summaries with Score Ranking

    cs.CL 2026-04 unverdicted novelty 6.0

    A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.

  15. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

    cs.AI 2026-04 unverdicted novelty 6.0

    Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

  16. LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

  17. Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

    cs.CL 2026-04 conditional novelty 5.0

    Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...

  18. Towards Self-Improving Error Diagnosis in Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 5.0

    ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with veri...

  19. Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model

    cs.AI 2026-04 unverdicted novelty 5.0

    Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 19 Pith papers · 3 internal anchors

  1. [1]

    Daniel Deutsch, Rotem Dror, and Dan Roth

    Curran Associates Inc. Daniel Deutsch, Rotem Dror, and Dan Roth. On the limitations of reference-free evaluations of generated text. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates, December

  2. [2]

    doi:10.18653/v1/2022.emnlp-main.753

    Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.753. URL https://aclanthology.org/2022. emnlp-main.753. Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations,

  3. [3]

    Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

    URL https://arxiv.org/abs/2404.13076. Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: Llm amplifies self-bias in self-refinement,

  4. [4]

    Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara

    URL https://arxiv.org/abs/2402.11436. Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators,

  5. [5]

    Thibault Sellam, Dipanjan Das, and Ankur Parikh

    URL https://arxiv.org/abs/2405.01724. Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July

  6. [6]

    doi:10.18653/v1/2020.acl-main.704

    Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704. Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Evaluation metrics in the era of GPT-4: Reliably evaluating large language models on sequence to sequence tasks. In The 2023 Conference on Empirical Methods in Natural Language Pr...

  7. [7]

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

    URL https://openreview.net/forum?id=SyEwsV52Dk. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, St. Julian’s, Malta, March

  8. [8]

    URL https://aclanthology.org/2024.findings-eacl.61

    Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61. Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics , 9:1408–1424,

  9. [9]

    URL https://aclanthology.org/2021.tacl-1.84

    doi:10.1162/tacl_a_00434. URL https://aclanthology.org/2021.tacl-1.84. 9 Self-Preference Bias in LLM-as-a-Judge Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Pe...

  10. [10]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn

    URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Proc...

  11. [11]

    Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto

    URL https: //openreview.net/forum?id=1hLFLNu4uy. Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following ,

  12. [12]

    Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy

    URL https://openreview.net/forum?id=magEgFpK1y. Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13–18,

  13. [13]

    Moritz Hardt, Eric Price, and Nathan Srebro

    doi:10.1109/ICDMW.2009.83. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3323–3331, Red Hook, NY , USA,

  14. [14]

    GPT-4 Technical Report

    URL https://arxiv.org/abs/2303.08774. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March

  15. [15]

    URL https://lmsys.org/blog/2023-03-30-vicuna/ . Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine ...

  16. [16]

    Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song

    URL https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm . Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April

  17. [17]

    10 Self-Preference Bias in LLM-as-a-Judge et al

    URL https://bair.berkeley.edu/blog/2023/ 04/03/koala/. 10 Self-Preference Bias in LLM-as-a-Judge et al. Jonathan Tow. Stablelm alpha v2 models,

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URL https://arxiv.org/abs/2307.09288. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Amy Yang et al. The llama 3 herd of models,

  19. [19]

    URL https://arxiv.org/abs/ 2407.21783. 11