BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
Pith reviewed 2026-06-26 11:04 UTC · model grok-4.3
The pith
LLM judges show reliability drops in lower-resource languages that raw accuracy understates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BabelJudge measures position bias, verbosity bias, order inconsistency, and cross-lingual degradation on any judge model by generating pairwise items whose gold labels are known by construction through controlled degradation of high-quality references. On Qwen2.5-7B-Instruct-4bit the bias-penalised reliability score drops from 0.714 in Hindi to 0.550 in Swahili while raw accuracy only falls from 0.835 to 0.660; Swahili order consistency collapses to 0.480. The framework further supports nine trajectory-level perturbations and reports tool accuracy, hallucination detection rate, and trajectory-length bias.
What carries the argument
Gold-labelling by degradation, which creates pairwise comparison items with known-correct labels by applying controlled perturbations to high-quality reference responses.
If this is right
- Raw accuracy alone masks large reliability gaps across languages.
- Order inconsistency can render verdicts near-random under simple slot swaps in some languages.
- The benchmark works on any judge model and requires no human preference data.
- Trajectory perturbations enable measurement of tool-use accuracy and hallucination detection in agent evaluations.
- A composite score that penalizes detected biases gives a stricter assessment than accuracy.
Where Pith is reading between the lines
- Judge developers may need language-specific calibration data to close the observed reliability gaps.
- The automatic labelling technique could be reused to build evaluation sets for other subjective tasks such as summarization or reasoning quality.
- Similar order and position effects may appear when the same judges assess multi-step agent trajectories.
- Widespread adoption of the released package would let practitioners routinely audit judges before deployment.
Load-bearing premise
Perturbations applied to high-quality references produce items whose true preference label is known by construction.
What would settle it
If human raters systematically disagree with the assumed gold labels on a held-out sample of perturbed pairs, the automatic labelling method would be invalidated.
Figures
read the original abstract
LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), they prefer longer responses regardless of quality (verbosity bias), and their reliability degrades sharply in lower-resource languages. We introduce BabelJudge, an open-source benchmark and reliability audit framework that measures all four failure modes -- position bias, verbosity bias, order inconsistency, and cross-lingual degradation -- on any judge model, without requiring human preference labels. The key insight is gold-labelling by degradation: starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction, eliminating annotation cost. We evaluate Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili and find that our composite bias-penalised reliability score drops from 0.714 in Hindi to 0.550 in Swahili, a gap that raw accuracy (0.835 vs. 0.660) understates. Swahili order consistency collapses to 0.480, meaning judge verdicts are near-random under slot-order swaps -- a failure mode invisible to accuracy alone. We further extend the framework to agentic evaluation via nine trajectory-level perturbations (argument corruption, tool swaps, hallucinated calls, missing steps) and three new metrics: tool accuracy, hallucination detection rate, and trajectory-length bias. BabelJudge is released as a Python package supporting 11 judge backends. Code: https://github.com/Shreyaskc/BabelJudge
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BabelJudge, an open-source benchmark and audit framework for LLM-as-a-judge models that measures position bias, verbosity bias, order inconsistency, and cross-lingual degradation without human preference labels. The core method constructs gold labels by applying controlled perturbations to high-quality reference responses. It evaluates Qwen2.5-7B-Instruct-4bit on English, Hindi, Arabic, and Swahili, reporting a composite bias-penalised reliability drop from 0.714 (Hindi) to 0.550 (Swahili) and order consistency of 0.480 in Swahili; the framework is extended to agent trajectories via nine perturbations (e.g., argument corruption, tool swaps) with new metrics for tool accuracy, hallucination detection, and trajectory-length bias. A Python package supporting 11 judge backends is released.
Significance. If the perturbation-based gold labelling is valid, the work supplies a scalable, annotation-free method to expose failure modes that raw accuracy conceals, especially in low-resource languages and agentic settings. The open-source release and multi-backend support constitute a concrete contribution to reproducibility.
major comments (2)
- [Abstract] Abstract (method description): The central claim that 'starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction' is load-bearing for every reported metric (bias-penalised reliability, order consistency, tool accuracy, hallucination detection). No independent validation—such as human agreement rates confirming that each perturbation class produces an objectively inferior response—is described, and this assumption is not shown to hold invariantly across the four languages.
- [Abstract] Abstract (results): The reported gap between raw accuracy (0.835 Hindi vs. 0.660 Swahili) and the composite score (0.714 vs. 0.550), together with the Swahili order-consistency collapse to 0.480, rests entirely on the correctness of the constructed labels. If any perturbation class fails to degrade quality consistently, these numerical claims and the conclusion that accuracy 'understates' the problem are undermined.
Simulated Author's Rebuttal
We thank the referee for the constructive comments regarding the foundational assumption of our perturbation-based gold labeling. We address each major comment below and commit to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (method description): The central claim that 'starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction' is load-bearing for every reported metric (bias-penalised reliability, order consistency, tool accuracy, hallucination detection). No independent validation—such as human agreement rates confirming that each perturbation class produces an objectively inferior response—is described, and this assumption is not shown to hold invariantly across the four languages.
Authors: The perturbations are constructed to introduce unambiguous degradations (e.g., factual errors, coherence breaks, or incorrect tool calls) that are intended to be objectively inferior by design. We acknowledge, however, that the manuscript does not report independent human validation of these labels. In the revised manuscript we will add a targeted human evaluation on a stratified sample of perturbed pairs across English, Hindi, Arabic, and Swahili, reporting inter-annotator agreement with the constructed gold labels. revision: yes
-
Referee: [Abstract] Abstract (results): The reported gap between raw accuracy (0.835 Hindi vs. 0.660 Swahili) and the composite score (0.714 vs. 0.550), together with the Swahili order-consistency collapse to 0.480, rests entirely on the correctness of the constructed labels. If any perturbation class fails to degrade quality consistently, these numerical claims and the conclusion that accuracy 'understates' the problem are undermined.
Authors: The reported metrics and the claim that raw accuracy understates reliability issues are indeed contingent on the perturbations consistently producing lower-quality responses. The human validation study described above will provide direct empirical support for this assumption across languages, allowing us to verify that the observed gaps and the Swahili order-consistency drop reflect genuine judge limitations rather than label artifacts. revision: yes
Circularity Check
No circularity: synthetic labels by explicit construction, no fitted predictions or self-citation chains
full rationale
The paper's core method explicitly constructs pairwise gold labels via controlled perturbations on reference responses, stating the label is 'known by construction.' This is a deliberate synthetic benchmark design rather than a derivation, prediction, or fitted result that reduces to its inputs. No equations, parameters, or self-citations are described that would create self-definitional or load-bearing circularity. Reported metrics (bias-penalised reliability, order consistency) are direct evaluations against these transparently synthetic labels. The framework is self-contained as a measurement tool without claiming external first-principles derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Controlled perturbations applied to high-quality reference responses create pairwise items with known gold labels by construction
Reference graph
Works this paper leans on
-
[1]
MEGA: Multilingual evaluation of generative AI
Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Cao, Sunny Shen, Isha Ashok, Gullal Bhatt, et al. MEGA: Multilingual evaluation of generative AI. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, 2023
2023
-
[2]
MLX: An array framework for Apple silicon, 2023
Apple MLX Team. MLX: An array framework for Apple silicon, 2023. URL https://github. com/ml-explore/mlx
2023
-
[3]
AlpacaFarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InAdvances in Neural Information Processing Systems, volume 36, 2024
2024
-
[4]
GPTScore: Evaluate as you desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024
2024
-
[5]
Evaluating NLP models via contrast sets
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating NLP models via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3382–3387, 2020
2020
-
[6]
Evaluating large language models: A comprehensive survey
Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023
-
[7]
METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024
Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024. 10
-
[8]
Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering.arXiv preprint arXiv:2004.14602, 2020
-
[9]
Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Wei- wei Deng, Feng Sun, and Qi Zhang. Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023
-
[10]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022
2022
-
[11]
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery, Samuel R Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations.arXiv preprint arXiv:2404.13076, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Beyond accuracy: Behavioral testing of NLP models with CheckList
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020
2020
-
[14]
Verbosity bias in preference labeling by large language models
Keita Saito, Akifumi Sugawa, Hiroaki Ouchi, and Taro Watanabe. Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076, 2023
-
[15]
Shivalika Singh, Freddie Vargus, Daniel Dsouza, B¨ orje F Slightam, Gullal S Bhatt, Huang Kaitao, et al. Aya dataset: An open-access collection for multilingual instruction tuning.arXiv preprint arXiv:2402.06619, 2024
-
[16]
Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF.arXiv preprint arXiv:2310.03716, 2023
-
[17]
Large language models are not robust multiple choice selectors
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors. InInternational Conference on Learning Representations, 2024
2024
-
[18]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 11
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.