MLLM-as-a-Judge Exhibits Model Preference Bias
Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3
The pith
MLLM judges exhibit self-preference bias toward their own outputs and those from related model families.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Representative MLLMs tend to exhibit self-preference bias when acting as judges, with mutual preference bias within particular model families potentially driven by reused connectors and overlapping instruction-tuning resources; these biases can be quantified via Philautia-Eval and mitigated by an ensemble of MLLMs.
What carries the argument
Philautia-Eval, a method that disentangles model preference tendencies from genuine differences in generation quality using large-scale paired evaluations.
If this is right
- Single-M LLM judge benchmarks may systematically distort performance comparisons between models.
- Model families sharing training components show correlated biases in automatic evaluations.
- Ensemble judges like Pomms can serve as a practical way to reduce bias in evaluation pipelines.
- Evaluation protocols relying on MLLM judges require explicit checks for model-specific preferences.
Where Pith is reading between the lines
- Analogous self-preference effects are likely present when using LLMs as judges in text-only settings.
- Developers could reduce downstream bias by diversifying connectors and instruction data across models.
- Extending Philautia-Eval to other modalities or tasks would test whether the bias pattern generalizes.
Load-bearing premise
Philautia-Eval successfully disentangles model preference tendencies from genuine differences in generation quality without introducing new artifacts.
What would settle it
An experiment where generation quality is first verified as equal by humans or independent metrics across models, then checking whether Philautia-Eval still detects preference biases.
Figures
read the original abstract
Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Philautia-Eval, a method to quantify model-specific preference bias in MLLM-as-a-Judge by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs from 12 MLLMs, it reports self-preference bias in representative models and mutual preference bias within model families, potentially attributable to reused connectors and overlapping instruction-tuning data. It further introduces Pomms, a simple ensemble of MLLMs that mitigates the measured bias while preserving evaluation performance.
Significance. If the disentangling procedure in Philautia-Eval is robust, the work identifies a practically important limitation in the growing use of MLLMs for automatic multimodal evaluation, which could otherwise distort model rankings and benchmark-driven research. The scale of the study (1.29M pairs across 12 models) and the proposed mitigation via ensemble provide concrete, actionable contributions. The findings on family-wise bias also open avenues for understanding training-data overlap effects in multimodal models.
major comments (3)
- [§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.
- [§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.
- [§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.
minor comments (2)
- [Abstract] Abstract and §2: The names 'Philautia-Eval' and 'Pomms' are introduced without expansion or motivation, which reduces immediate readability for readers unfamiliar with the Greek root or acronym.
- [Figure 2] Figure 2 or equivalent bias heatmap: Error bars or confidence intervals are missing from the per-model bias scores, making it difficult to judge the reliability of the reported differences.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and suggestions. We provide point-by-point responses to the major comments below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.
Authors: We agree that the disentangling procedure requires more explicit documentation. In the revised version, we will add the full set of equations describing the normalization and regression steps used in Philautia-Eval, include pseudocode for the algorithm, and perform an ablation analysis to check for residual correlations with the training data of the judge models. This will address concerns about potential artifacts from shared pipelines. revision: yes
-
Referee: [§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.
Authors: We thank the referee for this valuable suggestion. We will enhance §4.2 by adding permutation tests to validate the significance of the bias measurements and apply appropriate multiple-comparison corrections for the 12 models. While we do not have an independent quality oracle in the current study, the large scale of the 1.29M caption-score pairs helps control for confounders; we will explicitly discuss this in the revision and note it as a limitation. revision: partial
-
Referee: [§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.
Authors: We recognize that the explanatory claim is not supported by direct analysis. In the revision, we will modify the language in §5 to present this as a hypothesis rather than a firm attribution, and we will include a discussion on how future work could use data-overlap metrics or ablations to investigate this. The core bias measurements remain valid independently of this interpretation. revision: yes
Circularity Check
No circularity: empirical bias measurement via new disentangling method on collected data
full rationale
The paper proposes Philautia-Eval as a new framework to quantify model-specific preference bias by disentangling it from generation quality differences, then applies it to an independently collected dataset of 1.29M caption-score pairs across 12 MLLMs. The self-preference and family-wise mutual bias findings are presented as direct experimental observations from this evaluation, with an additional ensemble method (Pomms) introduced to mitigate observed bias. No equations, fitted parameters, or self-citations are described that would reduce the bias quantification or central claims to tautological inputs by construction. The derivation chain consists of data collection followed by application of the proposed disentangling procedure, which is external to the measured outputs and does not invoke prior author work as a uniqueness theorem or ansatz. This is a standard empirical study without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Philautia-Eval
no independent evidence
-
Pomms
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., et al.: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 (2024)
work page internal anchor Pith review arXiv 2024
- [2]
-
[3]
From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge
Aditya, S., Yang, Y., Baral, C., Aloimonos, Y.: From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge. arXiv:1511.03292 (2015)
work page Pith review arXiv 2015
- [4]
-
[5]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. In: ICLR (2024)
work page 2024
-
[6]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL Technical Report. arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
- [8]
- [9]
- [10]
-
[11]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al.: Expanding Per- formance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [12]
-
[13]
Comanici, G., Bieber, E., et al.: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabili- ties. arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [14]
- [15]
-
[16]
Computational Linguistics50(3), 1097–1179 (2024)
Gallegos, I., Rossi, R., Barrow, J., Tanjim, M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., et al.: Bias and Fairness in Large Language Models: A Survey. Computational Linguistics50(3), 1097–1179 (2024)
work page 2024
-
[17]
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A Survey on LLM-as-a-Judge. The Innovation (2024) 16 S. Koyama et al
work page 2024
-
[18]
Hirano, S., Wada, Y., Matsuda, K., Otsuki, S., Sugiura, K.: LLM-Free Image Cap- tioning Evaluation in Reference-Flexible Settings. In: AAAI (2026)
work page 2026
-
[19]
Hodosh, M., et al.: Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR47, 853–899 (2013)
work page 2013
-
[20]
Hu, Z., Song, L., Zhang, J., Xiao, Z., Chen, Z., Xiong, H.: Explaining Length Bias in LLM-Based Preference Evaluations. In: ICLR Workshop (2025)
work page 2025
-
[21]
Hurst,A.,Lerer,A.,Goucher,A.,Perelman,A.,Ramesh,A.,etal.:GPT-4oSystem Card. arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Inoue, N., Goto, K., Oi, M., Gruszka, M., Ukai, M., Hirose, T., Sekikawa, Y.: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning. AAAI (2026)
work page 2026
-
[23]
Visual Intelligence3(1), 27 (2025)
Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., Gan, Z., Wang, Y., Wang, C., Tan, X., et al.: Efficient Multimodal Large Language Models: A Survey. Visual Intelligence3(1), 27 (2025)
work page 2025
-
[24]
Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: ACL. pp. 26642– 26657 (2025)
work page 2025
-
[25]
Krasin, I., Duerig, T., Alldrin, N., Veit, A., Abu-El-Haija, S., Belongie, S., Cai, D., Feng, Z., Ferrari, V., Gomes, V.: OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. (2016)
work page 2016
-
[26]
PNAS122(31), e2415697122 (2025)
Laurito, W., Davis, B., et al.: AI–AI Bias: Large Language Models Favor Com- munications Generated by Large Language Models. PNAS122(31), e2415697122 (2025)
work page 2025
-
[27]
Lee, Y., Park, L., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: ACL. pp. 3732– 3746 (2024)
work page 2024
-
[28]
Lee, Y., et al.: CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists. arXiv:2403.18771 (2024)
-
[29]
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy Visual Task Transfer. TMLR (2024)
work page 2024
-
[30]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Li,H.,Dong,Q.,Chen,J.,Su,H.,Zhou,Y.,Ai,Q.,Ye,Z.,Liu,Y.:LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv:2412.05579 (2024)
work page internal anchor Pith review arXiv 2024
-
[31]
Li, Z., Chen, G., Liu, S., Wang, S., VS, V., Ji, Y., Lan, S., Zhang, H., et al.: Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models. arXiv:2501.14818 (2025)
-
[32]
Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State-of-the- Art Large Vision Language Models: Alignment, Benchmarks, Evaluations, and Challenges. arXiv:2501.02189 (2025)
- [33]
-
[34]
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, J.: LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge (2024)
work page 2024
- [35]
- [36]
-
[37]
Liu, Y., Moosavi, S., Lin, C.: LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In: ACL. pp. 12688–12701 (2024) MLLM-as-a-Judge Exhibits Model Preference Bias 17
work page 2024
- [38]
- [39]
-
[40]
Mordor Intelligence: Large Language Model (LLM) Market Size & Share Anal- ysis (2026),https://www.mordorintelligence.com/industry-reports/large- language-model-llm-market
work page 2026
- [41]
-
[42]
Ohi, M., et al.: Likelihood-based Mitigation of Evaluation Bias in Large Language Models. In: ACL Findings. pp. 3237–3245 (2024)
work page 2024
-
[43]
Panickssery, A., Bowman, S., Feng, S.: LLM Evaluators Recognize and Favor Their Own Generations. In: NeurIPS. vol. 37, pp. 68772–68802 (2024)
work page 2024
- [44]
- [45]
-
[46]
Sarto,S.,Moratelli,N.,etal.:Positive-AugmentedContrastiveLearningforVision- and-Language Evaluation and Training. In: IJCV (2025)
work page 2025
- [47]
- [48]
-
[49]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 Technical Report. arXiv:2503.19786 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [50]
- [51]
-
[52]
Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., et al.: Large Language Models Are Not Fair Evaluators. In: ACL. pp. 9440–9450 (2024)
work page 2024
-
[53]
Wataoka, K., Takahashi, T., Ri, R.: Self-Preference Bias in LLM-as-a-Judge. In: NeurIPS Workshop (2024)
work page 2024
-
[54]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding. arXiv:2412.10302 (2024)
work page internal anchor Pith review arXiv 2024
-
[55]
Xu, W., Zhu, G., Zhao, X., et al.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024)
work page 2024
- [56]
-
[57]
National Science Review11(12) (2024)
Yin, S., Fu, C., Zhao, S., Li, K., et al.: A Survey on Multimodal Large Language Models. National Science Review11(12) (2024)
work page 2024
-
[58]
Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)
work page 2023
-
[59]
Zhou, C., et al.: From Perception to Cognition: A Survey of Vision-Language Inter- active Reasoning in Multimodal Large Language Models. arXiv:2509.25373 (2025) 18 S. Koyama et al
-
[60]
Zhu, L., Wang, X., Wang, X.: JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges. In: ICLR (2025) MLLM-as-a-Judge Exhibits Model Preference Bias Shuitsu Koyama⋆, Yuiga Wada⋆, Daichi Yashima, and Komei Sugiura Keio University, Japan {koyamashu3, yuiga, ydaichi1207, komei.sugiura}@keio.jp A Details of Experimental Setup A.1 Generators and Evaluato...
work page 2025
-
[61]
Carefully observe the provided image to understand its main content
-
[62]
Read the reference captions carefully to identify the important information they highlight
-
[63]
Compare the generated caption to both the reference captions and the visual content of the image
-
[64]
Assess how well the generated caption covers the main points of the visual con- tent and the reference captions, and how much irrelevant or redundant information it contains
-
[65]
Assign an integer score from 0 to 100, considering both the alignment with the image and the inclusion of key points from the references. Please remember the score. Reference captions: {{Reference}} Image is attached Generated captions: {{Caption}} Response Format: You should first give a detailed reason for your score, ending with a sentence like this: T...
-
[66]
Carefully observe the image provided
-
[67]
Identify the main points of the visual content in the image
-
[68]
Assess how well the generated caption covers the main points of the visual content, and how much irrelevant or redundant information it contains
-
[69]
Assign an integer score from 0 to 100, please remember it. Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$. Note that the score should be an integer from 0 to 100, and should be wrapped in the dollar signs ($)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.