arxiv: 2410.21819 · v2 · submitted 2024-10-29 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Self-Preference Bias in LLM-as-a-Judge

Koki Wataoka , Tsubasa Takahashi , Ryokan Ri

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-preference biasLLM-as-a-judgeperplexityautomated evaluationdialogue systemsGPT-4bias measurement

0 comments

The pith

LLMs as judges give higher scores to low-perplexity outputs than humans, even for non-self-generated text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a quantitative metric to measure self-preference bias in LLM evaluators. Experiments with GPT-4 show it rates its own outputs more favorably than human judges would. The authors link this bias to perplexity, finding that LLMs boost scores for lower-perplexity texts more than humans do, whether or not the text is self-generated. This suggests the bias originates from a preference for familiar, predictable language rather than from self-recognition alone. The result matters because it affects how we trust automated evaluations of dialogue systems.

Core claim

LLMs exhibit self-preference bias by assigning higher evaluations to outputs with lower perplexity than human evaluators, and this pattern holds regardless of whether the outputs were self-generated. The bias therefore arises because LLMs prefer texts that are more familiar to them, as measured by perplexity.

What carries the argument

Quantitative metric for self-preference bias and analysis of its correlation with output perplexity.

If this is right

LLM evaluators promote styles and policies intrinsic to the models.
Automated evaluation of dialogue systems risks systematic skew toward familiar text.
The bias is driven by perplexity preference rather than explicit self-recognition.
New metric enables quantitative tracking of this effect across models and tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations could be adjusted by normalizing for perplexity to better match human judgments.
The same mechanism may appear in other applications where LLMs assess text quality.
Training LLMs on more diverse perplexity levels might lessen the bias in judging.

Load-bearing premise

The introduced metric isolates self-preference bias from confounding factors in LLM judgments.

What would settle it

If LLMs and humans assign evaluations to low-perplexity outputs at the same rate, the claim that LLMs exhibit a distinct bias tied to perplexity would be falsified.

read the original abstract

Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs score low-perplexity outputs higher than humans even for non-self text, but lacks controls to confirm this is familiarity bias rather than differing quality judgments.

read the letter

The paper's main contribution is a new quantitative metric for self-preference bias in LLM judges, plus experiments showing that GPT-4 gives higher scores to low-perplexity outputs than human raters do. This pattern holds for text the model did not generate itself, which points to familiarity as a driver rather than simple self-generation effects. They test this on dialogue system outputs and tie the scores directly to perplexity measurements. That link is the clearest new piece here and moves the discussion past just documenting the bias. The experiments are straightforward and the results line up consistently enough to make the observation worth noting for anyone running LLM-based evaluations. The perplexity analysis is a reasonable hypothesis to explore and gives practitioners a concrete variable to watch. The soft spots sit in the causal interpretation. The comparison to humans shows a divergence on low-perplexity text, but it does not hold semantic content or human-rated quality fixed while varying only perplexity. Without that control, the result could reflect that LLMs and humans simply weight fluency or token predictability differently rather than a specific familiarity bias. The metric definition is not laid out in enough detail in the abstract to check whether it stays independent of the perplexity term itself. If the metric incorporates model familiarity signals by construction, the reported relationship becomes partly definitional. The paper is aimed at researchers and engineers who build or rely on LLM-as-a-judge pipelines for dialogue systems and model ranking. It gives them a way to quantify the problem and a candidate explanation to test. It deserves peer review because the topic is already affecting deployed evaluation work and the perplexity angle is specific enough to be checked with tighter experiments. A referee can ask for the missing controls and a clearer metric definition without the core observation being dismissed.

Referee Report

2 major / 2 minor

Summary. The paper introduces a novel quantitative metric for measuring self-preference bias in LLM-as-a-judge setups for dialogue evaluation. It reports experimental results showing that GPT-4 exhibits significant self-preference bias and hypothesizes that the bias stems from LLMs favoring lower-perplexity (more familiar) outputs. Analysis indicates LLMs assign higher scores to low-perplexity texts than human evaluators do, even for non-self-generated outputs, concluding that perplexity is the essence of the bias.

Significance. If the metric validly isolates self-preference bias and the perplexity correlation proves causal rather than confounded by quality detection differences, the work would supply a useful quantitative tool for diagnosing and addressing biases in automated evaluation, with direct implications for reliable LLM judges. The GPT-4 experiments provide a concrete empirical anchor, but stronger isolation of the familiarity mechanism would be needed to elevate the contribution beyond correlational observation.

major comments (2)

[Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.
[Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.

minor comments (2)

[Abstract] Abstract: the statistical significance levels and exact sample sizes for the GPT-4 self-preference results should be stated explicitly rather than described only qualitatively as 'significant'.
[Methods] Notation: the paper should clarify whether perplexity is computed with the same model family used as judge or with a separate reference model, as this choice affects the interpretation of 'familiarity'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below, providing our responses and indicating planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Experimental Analysis] Experimental Analysis section: the correlation between LLM scores and lower perplexity does not include a controlled experiment holding semantic content, human-rated quality, and output length fixed while varying only perplexity; without this, the claim that perplexity is the 'essence' of self-preference bias cannot be distinguished from the alternative that LLMs and humans simply differ in how they detect fluency or coherence.

Authors: We appreciate this point and agree that a fully controlled experiment isolating perplexity (while holding semantic content, human-rated quality, and length fixed) would provide stronger causal evidence. Our current results show that the preference for lower-perplexity text persists for non-self-generated outputs and diverges from human judgments, which supports familiarity as a contributing factor rather than pure self-preference. However, we acknowledge the limitation in distinguishing this from differences in fluency detection. In the revision, we will add a dedicated limitations subsection, tone down the phrasing from 'essence' to 'a primary contributing factor,' and include supplementary analyses using length-matched and semantically similar output pairs to better control for confounds. revision: partial
Referee: [Metric Definition] Metric Definition: the novel quantitative metric for self-preference bias is introduced without an explicit formula or pseudocode; it is therefore impossible to verify whether the metric definition itself incorporates perplexity or model-familiarity terms, which would render the reported relationship partly definitional rather than independently discovered.

Authors: We thank the referee for highlighting this omission. The metric is defined independently as the average score difference between self-generated and cross-generated outputs under matched conditions, without any perplexity or familiarity terms. We will include the full mathematical definition and pseudocode in the revised Metric Definition section to enable verification and ensure the reported perplexity correlation is an independent empirical finding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric and perplexity correlation are independently measured

full rationale

The paper introduces a novel quantitative metric for self-preference bias as a distinct contribution, then separately hypothesizes that lower perplexity drives the bias and reports an empirical correlation between LLM evaluations and output perplexity (observed even for non-self-generated text). No equation or definition in the provided text shows the bias metric being constructed from perplexity terms, nor is any 'prediction' obtained by fitting a parameter to the same data used for the target claim. The central finding is an observed divergence between LLM and human scoring that correlates with perplexity; this is presented as an empirical result rather than a definitional identity. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to force the conclusion. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about perplexity as a proxy for familiarity and on the validity of the new bias metric; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)

domain assumption Perplexity computed by the LLM is a valid measure of how familiar or preferred a text is to that model.
Invoked when linking lower perplexity to higher LLM evaluations.

pith-pipeline@v0.9.0 · 5501 in / 1167 out tokens · 23844 ms · 2026-05-15T14:41:39.613451+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity.
IndisputableMonolith.Foundation.LawOfExistence law_of_existence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new metric to quantify self-preference bias in LLMs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
cs.CL 2026-05 unverdicted novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
cs.CL 2026-05 unverdicted novelty 7.0

MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under pertu...
Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs
cs.CL 2026-05 unverdicted novelty 7.0

DESG uses dynamic graphs of decoupled clinical states and asymmetric geometry to evaluate therapeutic dialogue quality, reaching 0.9353 macro-F1 on a 600-window held-out test set and outperforming LLM judges and text ...
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
cs.CL 2026-04 unverdicted novelty 7.0

Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
cs.CR 2026-04 unverdicted novelty 7.0

Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
cs.CL 2026-03 unverdicted novelty 7.0

CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
cs.CR 2026-05 unverdicted novelty 6.0

Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
cs.CL 2026-05 unverdicted novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
cs.CL 2026-04 unverdicted novelty 6.0

StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.
Learning to Control Summaries with Score Ranking
cs.CL 2026-04 unverdicted novelty 6.0

A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
cs.AI 2026-04 unverdicted novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
cs.CL 2026-04 conditional novelty 5.0

Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with veri...
Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 19 Pith papers · 3 internal anchors

[1]

Daniel Deutsch, Rotem Dror, and Dan Roth

Curran Associates Inc. Daniel Deutsch, Rotem Dror, and Dan Roth. On the limitations of reference-free evaluations of generated text. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates, December

work page 2022
[2]

doi:10.18653/v1/2022.emnlp-main.753

Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.753. URL https://aclanthology.org/2022. emnlp-main.753. Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations,

work page doi:10.18653/v1/2022.emnlp-main.753 2022
[3]

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

URL https://arxiv.org/abs/2404.13076. Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: Llm amplifies self-bias in self-refinement,

work page arXiv
[4]

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara

URL https://arxiv.org/abs/2402.11436. Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators,

work page arXiv
[5]

Thibault Sellam, Dipanjan Das, and Ankur Parikh

URL https://arxiv.org/abs/2405.01724. Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July

work page arXiv
[6]

doi:10.18653/v1/2020.acl-main.704

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704. Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Evaluation metrics in the era of GPT-4: Reliably evaluating large language models on sequence to sequence tasks. In The 2023 Conference on Empirical Methods in Natural Language Pr...

work page doi:10.18653/v1/2020.acl-main.704 2020
[7]

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

URL https://openreview.net/forum?id=SyEwsV52Dk. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, St. Julian’s, Malta, March

work page 2024
[8]

URL https://aclanthology.org/2024.findings-eacl.61

Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61. Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics , 9:1408–1424,

work page 2024
[9]

URL https://aclanthology.org/2021.tacl-1.84

doi:10.1162/tacl_a_00434. URL https://aclanthology.org/2021.tacl-1.84. 9 Self-Preference Bias in LLM-as-a-Judge Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Pe...

work page doi:10.1162/tacl_a_00434 2021
[10]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Proc...

work page 2022
[11]

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto

URL https: //openreview.net/forum?id=1hLFLNu4uy. Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following ,

work page 2023
[12]

Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy

URL https://openreview.net/forum?id=magEgFpK1y. Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13–18,

work page 2009
[13]

Moritz Hardt, Eric Price, and Nathan Srebro

doi:10.1109/ICDMW.2009.83. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3323–3331, Red Hook, NY , USA,

work page doi:10.1109/icdmw.2009.83 2009
[14]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URL https://lmsys.org/blog/2023-03-30-vicuna/ . Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine ...

work page 2023
[16]

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song

URL https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm . Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April

work page 2023
[17]

10 Self-Preference Bias in LLM-as-a-Judge et al

URL https://bair.berkeley.edu/blog/2023/ 04/03/koala/. 10 Self-Preference Bias in LLM-as-a-Judge et al. Jonathan Tow. Stablelm alpha v2 models,

work page 2023
[18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://arxiv.org/abs/2307.09288. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Amy Yang et al. The llama 3 herd of models,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URL https://arxiv.org/abs/ 2407.21783. 11

work page internal anchor Pith review Pith/arXiv arXiv