MLLM-as-a-Judge Exhibits Model Preference Bias

Daichi Yashima; Komei Sugiura; Shuitsu Koyama; Yuiga Wada

arxiv: 2604.11589 · v1 · submitted 2026-04-13 · 💻 cs.CV

MLLM-as-a-Judge Exhibits Model Preference Bias

Shuitsu Koyama , Yuiga Wada , Daichi Yashima , Komei Sugiura This is my paper

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords MLLM-as-a-Judgepreference biasself-preferenceautomatic evaluationmultimodal modelsmodel familiesensemble evaluation

0 comments

The pith

MLLM judges exhibit self-preference bias toward their own outputs and those from related model families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Philautia-Eval to quantify how much MLLM-as-a-Judge methods favor text generated by specific models. It separates these preference tendencies from actual differences in generation quality across 1.29 million caption-score pairs from 12 models. Results show clear self-preference bias, plus mutual biases within model families that may arise from shared connectors and instruction-tuning data. A simple ensemble called Pomms reduces the bias while preserving evaluation performance.

Core claim

Representative MLLMs tend to exhibit self-preference bias when acting as judges, with mutual preference bias within particular model families potentially driven by reused connectors and overlapping instruction-tuning resources; these biases can be quantified via Philautia-Eval and mitigated by an ensemble of MLLMs.

What carries the argument

Philautia-Eval, a method that disentangles model preference tendencies from genuine differences in generation quality using large-scale paired evaluations.

If this is right

Single-M LLM judge benchmarks may systematically distort performance comparisons between models.
Model families sharing training components show correlated biases in automatic evaluations.
Ensemble judges like Pomms can serve as a practical way to reduce bias in evaluation pipelines.
Evaluation protocols relying on MLLM judges require explicit checks for model-specific preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analogous self-preference effects are likely present when using LLMs as judges in text-only settings.
Developers could reduce downstream bias by diversifying connectors and instruction data across models.
Extending Philautia-Eval to other modalities or tasks would test whether the bias pattern generalizes.

Load-bearing premise

Philautia-Eval successfully disentangles model preference tendencies from genuine differences in generation quality without introducing new artifacts.

What would settle it

An experiment where generation quality is first verified as equal by humans or independent metrics across models, then checking whether Philautia-Eval still detects preference biases.

Figures

Figures reproduced from arXiv: 2604.11589 by Daichi Yashima, Komei Sugiura, Shuitsu Koyama, Yuiga Wada.

**Figure 2.** Figure 2: An example of self-preference bias in MLLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Φ˜ in the (i) reference-based and (ii) reference-free settings. All philautia scores (diagonal items) were greater than zero, indicating the presence of self-preference bias within the MLLMs used in our experiments. the dataset includes 45,000 human-written captions that are used as references. These captions were curated from the nocaps dataset and have a vocabulary size of 11,404 words, … view at source ↗

**Figure 5.** Figure 5: Example of self-preference bias. The bar chart shows the scores given to a caption generated by Gemini-2.5-Pro. Gemini-2.5-Pro exceptionally gave high scores to its own generations compared with the other Evaluators. The symbol ♦ represents the mean value of the scores by each Evaluator. Red text within yˆg highlights hallucination. of their philautia scores from their respective means. Specifically, the … view at source ↗

**Figure 6.** Figure 6: Visualization of preference bias within model families. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLM judges show self-preference and family biases on a large scale, but the disentangling step in Philautia-Eval leaves room for artifacts that could inflate the measured effect.

read the letter

The core finding is that representative MLLMs rate their own outputs higher than others and show similar favoritism toward models in the same family. The authors collect 1.29 million caption-score pairs from 12 models, introduce Philautia-Eval to separate preference tendencies from quality differences, and report that a simple ensemble called Pomms reduces the bias without hurting overall scores. They tie the family effect to shared connectors and overlapping instruction-tuning data. That scale of data and the family-wise pattern are the clearest new pieces here. Prior work on LLM judges noted self-preference, but the multimodal extension with this volume and the mutual-family observation add something concrete. The large paired dataset is also a practical contribution that others could reuse. The main soft spot sits in the disentangling procedure. Without seeing the exact normalization or matching steps, it is hard to rule out residual correlation between the judge's training data and the quality signals it is supposed to ignore. If that separation is incomplete, the bias numbers could partly reflect unaccounted quality differences rather than pure preference. The causal story about connectors and tuning resources is plausible but remains speculative; the paper does not test it directly. The Pomms ensemble is straightforward and works in their setup, yet it receives limited analysis on whether it generalizes or trades off other properties. This paper is aimed at researchers who build or rely on MLLM-based automatic evaluation for vision-language tasks. Anyone running benchmarks with these judges should see the warning flag. It is worth a serious referee because the problem it flags matters for benchmark validity, even though the measurement method needs tighter validation and the mitigation is preliminary. I would send it for review with requests for more detail on the disentangling controls and additional checks on the family-bias claim.

Referee Report

3 major / 2 minor

Summary. The paper proposes Philautia-Eval, a method to quantify model-specific preference bias in MLLM-as-a-Judge by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs from 12 MLLMs, it reports self-preference bias in representative models and mutual preference bias within model families, potentially attributable to reused connectors and overlapping instruction-tuning data. It further introduces Pomms, a simple ensemble of MLLMs that mitigates the measured bias while preserving evaluation performance.

Significance. If the disentangling procedure in Philautia-Eval is robust, the work identifies a practically important limitation in the growing use of MLLMs for automatic multimodal evaluation, which could otherwise distort model rankings and benchmark-driven research. The scale of the study (1.29M pairs across 12 models) and the proposed mitigation via ensemble provide concrete, actionable contributions. The findings on family-wise bias also open avenues for understanding training-data overlap effects in multimodal models.

major comments (3)

[§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.
[§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.
[§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.

minor comments (2)

[Abstract] Abstract and §2: The names 'Philautia-Eval' and 'Pomms' are introduced without expansion or motivation, which reduces immediate readability for readers unfamiliar with the Greek root or acronym.
[Figure 2] Figure 2 or equivalent bias heatmap: Error bars or confidence intervals are missing from the per-model bias scores, making it difficult to judge the reliability of the reported differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We provide point-by-point responses to the major comments below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.

Authors: We agree that the disentangling procedure requires more explicit documentation. In the revised version, we will add the full set of equations describing the normalization and regression steps used in Philautia-Eval, include pseudocode for the algorithm, and perform an ablation analysis to check for residual correlations with the training data of the judge models. This will address concerns about potential artifacts from shared pipelines. revision: yes
Referee: [§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.

Authors: We thank the referee for this valuable suggestion. We will enhance §4.2 by adding permutation tests to validate the significance of the bias measurements and apply appropriate multiple-comparison corrections for the 12 models. While we do not have an independent quality oracle in the current study, the large scale of the 1.29M caption-score pairs helps control for confounders; we will explicitly discuss this in the revision and note it as a limitation. revision: partial
Referee: [§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.

Authors: We recognize that the explanatory claim is not supported by direct analysis. In the revision, we will modify the language in §5 to present this as a hypothesis rather than a firm attribution, and we will include a discussion on how future work could use data-overlap metrics or ablations to investigate this. The core bias measurements remain valid independently of this interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical bias measurement via new disentangling method on collected data

full rationale

The paper proposes Philautia-Eval as a new framework to quantify model-specific preference bias by disentangling it from generation quality differences, then applies it to an independently collected dataset of 1.29M caption-score pairs across 12 MLLMs. The self-preference and family-wise mutual bias findings are presented as direct experimental observations from this evaluation, with an additional ensemble method (Pomms) introduced to mitigate observed bias. No equations, fitted parameters, or self-citations are described that would reduce the bias quantification or central claims to tautological inputs by construction. The derivation chain consists of data collection followed by application of the proposed disentangling procedure, which is external to the measured outputs and does not invoke prior author work as a uniqueness theorem or ansatz. This is a standard empirical study without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new named constructs (Philautia-Eval and Pomms) whose validity rests on the unstated details of the disentangling procedure. No free parameters or mathematical axioms are mentioned in the abstract.

invented entities (2)

Philautia-Eval no independent evidence
purpose: Quantify model-specific preference bias by disentangling preference from generation quality
New method proposed in the paper
Pomms no independent evidence
purpose: Ensemble of MLLMs to mitigate model preference bias
New mitigation approach introduced in the paper

pith-pipeline@v0.9.0 · 5499 in / 1122 out tokens · 41695 ms · 2026-05-10T15:38:45.991571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., et al.: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 (2024)

work page internal anchor Pith review arXiv 2024
[2]

In: EMNLP

Adilazuarda, M., Mukherjee, S., Lavania, P., Singh, S., Aji, A., O’Neill, J., Modi, A., Choudhury, M.: Towards Measuring and Modeling “Culture” in LLMs: A Sur- vey. In: EMNLP. pp. 15763–15784 (2024)

work page 2024
[3]

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Aditya, S., Yang, Y., Baral, C., Aloimonos, Y.: From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge. arXiv:1511.03292 (2015)

work page Pith review arXiv 2015
[4]

In: ICCV

Agrawal, H., Desai, K., et al.: nocaps: Novel Object Captioning at Scale. In: ICCV. pp. 8948–8957 (2019)

work page 2019
[5]

In: ICLR (2024)

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. In: ICLR (2024)

work page 2024
[6]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL Technical Report. arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: EMNLP

Chan, D., Petryk, S., et al.: CLAIR: Evaluating Image Captions with Large Lan- guage Models. In: EMNLP. pp. 13638–13646 (2023)

work page 2023
[8]

In: ICML

Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., Sun, L.: MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-language Benchmark. In: ICML. vol. 235, pp. 6562–6595 (2024)

work page 2024
[9]

In: EMNLP

Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the Judge? A Study on Judgement Bias. In: EMNLP. pp. 8301–8327 (2024)

work page 2024
[10]

Chen, W., Wei, Z., Zhu, X., Feng, S., Meng, Y.: Do LLM Evaluators Prefer Them- selves for a Reason? arXiv:2504.03846 (2025)

work page arXiv 2025
[11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al.: Expanding Per- formance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

In: EMNLP

Chen, Z., Wang, H., Zhang, X., Hu, E., Lin, Y.: Beyond the Surface: Measuring Self-Preference in LLM Judgments. In: EMNLP. pp. 1653–1672 (2025)

work page 2025
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., et al.: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabili- ties. arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

In: CVPR

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J., Salehi, M., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision- Language Models. In: CVPR. pp. 91–104 (2025)

work page 2025
[15]

In: NAACL

Fu, J., Ng, S., Jiang, Z., et al.: GPTScore: Evaluate as You Desire. In: NAACL. pp. 6556–6576 (2024)

work page 2024
[16]

Computational Linguistics50(3), 1097–1179 (2024)

Gallegos, I., Rossi, R., Barrow, J., Tanjim, M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., et al.: Bias and Fairness in Large Language Models: A Survey. Computational Linguistics50(3), 1097–1179 (2024)

work page 2024
[17]

The Innovation (2024) 16 S

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A Survey on LLM-as-a-Judge. The Innovation (2024) 16 S. Koyama et al

work page 2024
[18]

In: AAAI (2026)

Hirano, S., Wada, Y., Matsuda, K., Otsuki, S., Sugiura, K.: LLM-Free Image Cap- tioning Evaluation in Reference-Flexible Settings. In: AAAI (2026)

work page 2026
[19]

JAIR47, 853–899 (2013)

Hodosh, M., et al.: Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR47, 853–899 (2013)

work page 2013
[20]

In: ICLR Workshop (2025)

Hu, Z., Song, L., Zhang, J., Xiao, Z., Chen, Z., Xiong, H.: Explaining Length Bias in LLM-Based Preference Evaluations. In: ICLR Workshop (2025)

work page 2025
[21]

GPT-4o System Card

Hurst,A.,Lerer,A.,Goucher,A.,Perelman,A.,Ramesh,A.,etal.:GPT-4oSystem Card. arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

AAAI (2026)

Inoue, N., Goto, K., Oi, M., Gruszka, M., Ukai, M., Hirose, T., Sekikawa, Y.: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning. AAAI (2026)

work page 2026
[23]

Visual Intelligence3(1), 27 (2025)

Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., Gan, Z., Wang, Y., Wang, C., Tan, X., et al.: Efficient Multimodal Large Language Models: A Survey. Visual Intelligence3(1), 27 (2025)

work page 2025
[24]

Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: ACL. pp. 26642– 26657 (2025)

work page 2025
[25]

Krasin, I., Duerig, T., Alldrin, N., Veit, A., Abu-El-Haija, S., Belongie, S., Cai, D., Feng, Z., Ferrari, V., Gomes, V.: OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. (2016)

work page 2016
[26]

PNAS122(31), e2415697122 (2025)

Laurito, W., Davis, B., et al.: AI–AI Bias: Large Language Models Favor Com- munications Generated by Large Language Models. PNAS122(31), e2415697122 (2025)

work page 2025
[27]

Lee, Y., Park, L., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: ACL. pp. 3732– 3746 (2024)

work page 2024
[28]

arXiv:2403.18771

Lee, Y., et al.: CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists. arXiv:2403.18771 (2024)

work page arXiv 2024
[29]

TMLR (2024)

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy Visual Task Transfer. TMLR (2024)

work page 2024
[30]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Li,H.,Dong,Q.,Chen,J.,Su,H.,Zhou,Y.,Ai,Q.,Ye,Z.,Liu,Y.:LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv:2412.05579 (2024)

work page internal anchor Pith review arXiv 2024
[31]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Li, Z., Chen, G., Liu, S., Wang, S., VS, V., Ji, Y., Lan, S., Zhang, H., et al.: Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models. arXiv:2501.14818 (2025)

work page arXiv 2025
[32]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State-of-the- Art Large Vision Language Models: Alignment, Benchmarks, Evaluations, and Challenges. arXiv:2501.02189 (2025)

work page arXiv 2025
[33]

In: ECCV

Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., et al.: Microsoft COCO: Common Objects in Context. In: ECCV. pp. 740–755 (2014)

work page 2014
[34]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, J.: LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge (2024)

work page 2024
[35]

In: CVPR

Liu, H., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)

work page 2024
[36]

In: EMNLP

Liu, Y., Iter, D., Xu, Y., Wang, S., et al.: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In: EMNLP. pp. 2511–2522 (2023)

work page 2023
[37]

Liu, Y., Moosavi, S., Lin, C.: LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In: ACL. pp. 12688–12701 (2024) MLLM-as-a-Judge Exhibits Model Preference Bias 17

work page 2024
[38]

In: EMNLP

Matsuda, K., Wada, Y., Hirano, S., Otsuki, S., Sugiura, K.: VELA: An LLM- Hybrid-as-a-Judge Approach for Evaluating Long Image Captions. In: EMNLP. pp. 8680–8696 (2025)

work page 2025
[39]

In: ACCV

Matsuda, K., et al.: DENEB: A Hallucination-Robust Automatic Evaluation Met- ric for Image Captioning. In: ACCV. pp. 3570–3586 (2024)

work page 2024
[40]

Mordor Intelligence: Large Language Model (LLM) Market Size & Share Anal- ysis (2026),https://www.mordorintelligence.com/industry-reports/large- language-model-llm-market

work page 2026
[41]

In: EMNLP

Nangia, N., et al.: CrowS-pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In: EMNLP. pp. 1953–1967 (2020)

work page 1953
[42]

In: ACL Findings

Ohi, M., et al.: Likelihood-based Mitigation of Evaluation Bias in Large Language Models. In: ACL Findings. pp. 3237–3245 (2024)

work page 2024
[43]

In: NeurIPS

Panickssery, A., Bowman, S., Feng, S.: LLM Evaluators Recognize and Favor Their Own Generations. In: NeurIPS. vol. 37, pp. 68772–68802 (2024)

work page 2024
[44]

Ranjan, S

Ranjan, R., Gupta, S., Singh, S.: A Comprehensive Survey of Bias in LLMs: Cur- rent Landscape and Future Directions. arXiv:2409.16430 (2024)

work page arXiv 2024
[45]

In: CVPR

Sarto, S., Barraco, M., et al.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)

work page 2023
[46]

In: IJCV (2025)

Sarto,S.,Moratelli,N.,etal.:Positive-AugmentedContrastiveLearningforVision- and-Language Evaluation and Training. In: IJCV (2025)

work page 2025
[47]

In: NAACL

Shen, S., Logeswaran, L., Lee, M., Lee, H., Poria, S., Mihalcea, R.: Understanding the Capabilities and Limitations of Large Language Models for Cultural Common- sense. In: NAACL. pp. 5668–5680 (2024)

work page 2024
[48]

In: AACL

Shi, L., Ma, C., Liang, W., Diao, X., et al.: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In: AACL. pp. 292–314 (2025)

work page 2025
[49]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 Technical Report. arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

In: AAAI

Tong, T., He, S., et al.: G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. In: AAAI. vol. 39, pp. 7419–7427 (2025)

work page 2025
[51]

In: CVPR

Wada, Y., Kanta, K., et al.: Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning. In: CVPR. pp. 13559–13568 (2024)

work page 2024
[52]

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., et al.: Large Language Models Are Not Fair Evaluators. In: ACL. pp. 9440–9450 (2024)

work page 2024
[53]

In: NeurIPS Workshop (2024)

Wataoka, K., Takahashi, T., Ri, R.: Self-Preference Bias in LLM-as-a-Judge. In: NeurIPS Workshop (2024)

work page 2024
[54]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding. arXiv:2412.10302 (2024)

work page internal anchor Pith review arXiv 2024
[55]

Xu, W., Zhu, G., Zhao, X., et al.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024)

work page 2024
[56]

In: ECCV

Yao, Z., et al.: HiFi-Score: Fine-Grained Image Description Evaluation with Hier- archical Parsing Graphs. In: ECCV. pp. 441–458 (2024)

work page 2024
[57]

National Science Review11(12) (2024)

Yin, S., Fu, C., Zhao, S., Li, K., et al.: A Survey on Multimodal Large Language Models. National Science Review11(12) (2024)

work page 2024
[58]

In: NeurIPS

Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)

work page 2023
[59]

From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,

Zhou, C., et al.: From Perception to Cognition: A Survey of Vision-Language Inter- active Reasoning in Multimodal Large Language Models. arXiv:2509.25373 (2025) 18 S. Koyama et al

work page arXiv 2025
[60]

rice crispy balls

Zhu, L., Wang, X., Wang, X.: JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges. In: ICLR (2025) MLLM-as-a-Judge Exhibits Model Preference Bias Shuitsu Koyama⋆, Yuiga Wada⋆, Daichi Yashima, and Komei Sugiura Keio University, Japan {koyamashu3, yuiga, ydaichi1207, komei.sugiura}@keio.jp A Details of Experimental Setup A.1 Generators and Evaluato...

work page 2025
[61]

Carefully observe the provided image to understand its main content

work page
[62]

Read the reference captions carefully to identify the important information they highlight

work page
[63]

Compare the generated caption to both the reference captions and the visual content of the image

work page
[64]

Assess how well the generated caption covers the main points of the visual con- tent and the reference captions, and how much irrelevant or redundant information it contains

work page
[65]

Please remember the score

Assign an integer score from 0 to 100, considering both the alignment with the image and the inclusion of key points from the references. Please remember the score. Reference captions: {{Reference}} Image is attached Generated captions: {{Caption}} Response Format: You should first give a detailed reason for your score, ending with a sentence like this: T...

work page
[66]

Carefully observe the image provided

work page
[67]

Identify the main points of the visual content in the image

work page
[68]

Assess how well the generated caption covers the main points of the visual content, and how much irrelevant or redundant information it contains

work page
[69]

Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$

Assign an integer score from 0 to 100, please remember it. Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$. Note that the score should be an integer from 0 to 100, and should be wrapped in the dollar signs ($)

work page

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., et al.: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 (2024)

work page internal anchor Pith review arXiv 2024

[2] [2]

In: EMNLP

Adilazuarda, M., Mukherjee, S., Lavania, P., Singh, S., Aji, A., O’Neill, J., Modi, A., Choudhury, M.: Towards Measuring and Modeling “Culture” in LLMs: A Sur- vey. In: EMNLP. pp. 15763–15784 (2024)

work page 2024

[3] [3]

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Aditya, S., Yang, Y., Baral, C., Aloimonos, Y.: From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge. arXiv:1511.03292 (2015)

work page Pith review arXiv 2015

[4] [4]

In: ICCV

Agrawal, H., Desai, K., et al.: nocaps: Novel Object Captioning at Scale. In: ICCV. pp. 8948–8957 (2019)

work page 2019

[5] [5]

In: ICLR (2024)

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. In: ICLR (2024)

work page 2024

[6] [6]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL Technical Report. arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: EMNLP

Chan, D., Petryk, S., et al.: CLAIR: Evaluating Image Captions with Large Lan- guage Models. In: EMNLP. pp. 13638–13646 (2023)

work page 2023

[8] [8]

In: ICML

Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., Sun, L.: MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-language Benchmark. In: ICML. vol. 235, pp. 6562–6595 (2024)

work page 2024

[9] [9]

In: EMNLP

Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the Judge? A Study on Judgement Bias. In: EMNLP. pp. 8301–8327 (2024)

work page 2024

[10] [10]

Chen, W., Wei, Z., Zhu, X., Feng, S., Meng, Y.: Do LLM Evaluators Prefer Them- selves for a Reason? arXiv:2504.03846 (2025)

work page arXiv 2025

[11] [11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al.: Expanding Per- formance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

In: EMNLP

Chen, Z., Wang, H., Zhang, X., Hu, E., Lin, Y.: Beyond the Surface: Measuring Self-Preference in LLM Judgments. In: EMNLP. pp. 1653–1672 (2025)

work page 2025

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., et al.: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabili- ties. arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

In: CVPR

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J., Salehi, M., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision- Language Models. In: CVPR. pp. 91–104 (2025)

work page 2025

[15] [15]

In: NAACL

Fu, J., Ng, S., Jiang, Z., et al.: GPTScore: Evaluate as You Desire. In: NAACL. pp. 6556–6576 (2024)

work page 2024

[16] [16]

Computational Linguistics50(3), 1097–1179 (2024)

Gallegos, I., Rossi, R., Barrow, J., Tanjim, M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., et al.: Bias and Fairness in Large Language Models: A Survey. Computational Linguistics50(3), 1097–1179 (2024)

work page 2024

[17] [17]

The Innovation (2024) 16 S

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A Survey on LLM-as-a-Judge. The Innovation (2024) 16 S. Koyama et al

work page 2024

[18] [18]

In: AAAI (2026)

Hirano, S., Wada, Y., Matsuda, K., Otsuki, S., Sugiura, K.: LLM-Free Image Cap- tioning Evaluation in Reference-Flexible Settings. In: AAAI (2026)

work page 2026

[19] [19]

JAIR47, 853–899 (2013)

Hodosh, M., et al.: Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR47, 853–899 (2013)

work page 2013

[20] [20]

In: ICLR Workshop (2025)

Hu, Z., Song, L., Zhang, J., Xiao, Z., Chen, Z., Xiong, H.: Explaining Length Bias in LLM-Based Preference Evaluations. In: ICLR Workshop (2025)

work page 2025

[21] [21]

GPT-4o System Card

Hurst,A.,Lerer,A.,Goucher,A.,Perelman,A.,Ramesh,A.,etal.:GPT-4oSystem Card. arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

AAAI (2026)

Inoue, N., Goto, K., Oi, M., Gruszka, M., Ukai, M., Hirose, T., Sekikawa, Y.: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning. AAAI (2026)

work page 2026

[23] [23]

Visual Intelligence3(1), 27 (2025)

Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., Gan, Z., Wang, Y., Wang, C., Tan, X., et al.: Efficient Multimodal Large Language Models: A Survey. Visual Intelligence3(1), 27 (2025)

work page 2025

[24] [24]

Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: ACL. pp. 26642– 26657 (2025)

work page 2025

[25] [25]

Krasin, I., Duerig, T., Alldrin, N., Veit, A., Abu-El-Haija, S., Belongie, S., Cai, D., Feng, Z., Ferrari, V., Gomes, V.: OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. (2016)

work page 2016

[26] [26]

PNAS122(31), e2415697122 (2025)

Laurito, W., Davis, B., et al.: AI–AI Bias: Large Language Models Favor Com- munications Generated by Large Language Models. PNAS122(31), e2415697122 (2025)

work page 2025

[27] [27]

Lee, Y., Park, L., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: ACL. pp. 3732– 3746 (2024)

work page 2024

[28] [28]

arXiv:2403.18771

Lee, Y., et al.: CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists. arXiv:2403.18771 (2024)

work page arXiv 2024

[29] [29]

TMLR (2024)

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy Visual Task Transfer. TMLR (2024)

work page 2024

[30] [30]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Li,H.,Dong,Q.,Chen,J.,Su,H.,Zhou,Y.,Ai,Q.,Ye,Z.,Liu,Y.:LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv:2412.05579 (2024)

work page internal anchor Pith review arXiv 2024

[31] [31]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Li, Z., Chen, G., Liu, S., Wang, S., VS, V., Ji, Y., Lan, S., Zhang, H., et al.: Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models. arXiv:2501.14818 (2025)

work page arXiv 2025

[32] [32]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State-of-the- Art Large Vision Language Models: Alignment, Benchmarks, Evaluations, and Challenges. arXiv:2501.02189 (2025)

work page arXiv 2025

[33] [33]

In: ECCV

Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., et al.: Microsoft COCO: Common Objects in Context. In: ECCV. pp. 740–755 (2014)

work page 2014

[34] [34]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, J.: LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge (2024)

work page 2024

[35] [35]

In: CVPR

Liu, H., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)

work page 2024

[36] [36]

In: EMNLP

Liu, Y., Iter, D., Xu, Y., Wang, S., et al.: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In: EMNLP. pp. 2511–2522 (2023)

work page 2023

[37] [37]

Liu, Y., Moosavi, S., Lin, C.: LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In: ACL. pp. 12688–12701 (2024) MLLM-as-a-Judge Exhibits Model Preference Bias 17

work page 2024

[38] [38]

In: EMNLP

Matsuda, K., Wada, Y., Hirano, S., Otsuki, S., Sugiura, K.: VELA: An LLM- Hybrid-as-a-Judge Approach for Evaluating Long Image Captions. In: EMNLP. pp. 8680–8696 (2025)

work page 2025

[39] [39]

In: ACCV

Matsuda, K., et al.: DENEB: A Hallucination-Robust Automatic Evaluation Met- ric for Image Captioning. In: ACCV. pp. 3570–3586 (2024)

work page 2024

[40] [40]

Mordor Intelligence: Large Language Model (LLM) Market Size & Share Anal- ysis (2026),https://www.mordorintelligence.com/industry-reports/large- language-model-llm-market

work page 2026

[41] [41]

In: EMNLP

Nangia, N., et al.: CrowS-pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In: EMNLP. pp. 1953–1967 (2020)

work page 1953

[42] [42]

In: ACL Findings

Ohi, M., et al.: Likelihood-based Mitigation of Evaluation Bias in Large Language Models. In: ACL Findings. pp. 3237–3245 (2024)

work page 2024

[43] [43]

In: NeurIPS

Panickssery, A., Bowman, S., Feng, S.: LLM Evaluators Recognize and Favor Their Own Generations. In: NeurIPS. vol. 37, pp. 68772–68802 (2024)

work page 2024

[44] [44]

Ranjan, S

Ranjan, R., Gupta, S., Singh, S.: A Comprehensive Survey of Bias in LLMs: Cur- rent Landscape and Future Directions. arXiv:2409.16430 (2024)

work page arXiv 2024

[45] [45]

In: CVPR

Sarto, S., Barraco, M., et al.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)

work page 2023

[46] [46]

In: IJCV (2025)

Sarto,S.,Moratelli,N.,etal.:Positive-AugmentedContrastiveLearningforVision- and-Language Evaluation and Training. In: IJCV (2025)

work page 2025

[47] [47]

In: NAACL

Shen, S., Logeswaran, L., Lee, M., Lee, H., Poria, S., Mihalcea, R.: Understanding the Capabilities and Limitations of Large Language Models for Cultural Common- sense. In: NAACL. pp. 5668–5680 (2024)

work page 2024

[48] [48]

In: AACL

Shi, L., Ma, C., Liang, W., Diao, X., et al.: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In: AACL. pp. 292–314 (2025)

work page 2025

[49] [49]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 Technical Report. arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

In: AAAI

Tong, T., He, S., et al.: G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. In: AAAI. vol. 39, pp. 7419–7427 (2025)

work page 2025

[51] [51]

In: CVPR

Wada, Y., Kanta, K., et al.: Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning. In: CVPR. pp. 13559–13568 (2024)

work page 2024

[52] [52]

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., et al.: Large Language Models Are Not Fair Evaluators. In: ACL. pp. 9440–9450 (2024)

work page 2024

[53] [53]

In: NeurIPS Workshop (2024)

Wataoka, K., Takahashi, T., Ri, R.: Self-Preference Bias in LLM-as-a-Judge. In: NeurIPS Workshop (2024)

work page 2024

[54] [54]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding. arXiv:2412.10302 (2024)

work page internal anchor Pith review arXiv 2024

[55] [55]

Xu, W., Zhu, G., Zhao, X., et al.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024)

work page 2024

[56] [56]

In: ECCV

Yao, Z., et al.: HiFi-Score: Fine-Grained Image Description Evaluation with Hier- archical Parsing Graphs. In: ECCV. pp. 441–458 (2024)

work page 2024

[57] [57]

National Science Review11(12) (2024)

Yin, S., Fu, C., Zhao, S., Li, K., et al.: A Survey on Multimodal Large Language Models. National Science Review11(12) (2024)

work page 2024

[58] [58]

In: NeurIPS

Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)

work page 2023

[59] [59]

From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,

Zhou, C., et al.: From Perception to Cognition: A Survey of Vision-Language Inter- active Reasoning in Multimodal Large Language Models. arXiv:2509.25373 (2025) 18 S. Koyama et al

work page arXiv 2025

[60] [60]

rice crispy balls

Zhu, L., Wang, X., Wang, X.: JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges. In: ICLR (2025) MLLM-as-a-Judge Exhibits Model Preference Bias Shuitsu Koyama⋆, Yuiga Wada⋆, Daichi Yashima, and Komei Sugiura Keio University, Japan {koyamashu3, yuiga, ydaichi1207, komei.sugiura}@keio.jp A Details of Experimental Setup A.1 Generators and Evaluato...

work page 2025

[61] [61]

Carefully observe the provided image to understand its main content

work page

[62] [62]

Read the reference captions carefully to identify the important information they highlight

work page

[63] [63]

Compare the generated caption to both the reference captions and the visual content of the image

work page

[64] [64]

Assess how well the generated caption covers the main points of the visual con- tent and the reference captions, and how much irrelevant or redundant information it contains

work page

[65] [65]

Please remember the score

Assign an integer score from 0 to 100, considering both the alignment with the image and the inclusion of key points from the references. Please remember the score. Reference captions: {{Reference}} Image is attached Generated captions: {{Caption}} Response Format: You should first give a detailed reason for your score, ending with a sentence like this: T...

work page

[66] [66]

Carefully observe the image provided

work page

[67] [67]

Identify the main points of the visual content in the image

work page

[68] [68]

Assess how well the generated caption covers the main points of the visual content, and how much irrelevant or redundant information it contains

work page

[69] [69]

Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$

Assign an integer score from 0 to 100, please remember it. Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$. Note that the score should be an integer from 0 to 100, and should be wrapped in the dollar signs ($)

work page