Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.
How reliable is multilingual LLM -as-a-judge? In Findings of the Association for Computational Linguistics: EMNLP 2025 , pp.\ 11040--11053, Suzhou, China
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
Modifying nationality and language parameters in English-centric personas for mental health dialogues introduces clinical inconsistencies across languages and causes LLM judges to perform inaccurately on non-English depression severity assessments.
citing papers explorer
-
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.
-
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
-
Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language
Modifying nationality and language parameters in English-centric personas for mental health dialogues introduces clinical inconsistencies across languages and causes LLM judges to perform inaccurately on non-English depression severity assessments.