BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

Shreyas KC

arxiv: 2606.22329 · v1 · pith:DURIBKVQnew · submitted 2026-06-21 · 💻 cs.CL · cs.AI

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

Shreyas KC This is my paper

Pith reviewed 2026-06-26 11:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM-as-a-judgebias evaluationcross-lingual reliabilityorder inconsistencyagent trajectoriesposition biasverbosity bias

0 comments

The pith

LLM judges show reliability drops in lower-resource languages that raw accuracy understates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BabelJudge supplies a benchmark and audit framework to quantify four biases in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The approach generates gold-labeled pairs automatically by taking high-quality reference responses and applying controlled perturbations, so no human preference annotations are required. When tested on Qwen2.5-7B-Instruct-4bit, the composite bias-penalised reliability score falls from 0.714 in Hindi to 0.550 in Swahili, and Swahili order consistency reaches only 0.480. The same machinery extends to agent trajectories through nine targeted perturbations and three new metrics for tool accuracy, hallucination detection, and length bias. Because LLM judges now dominate scalable evaluation, these hidden failures can distort results especially outside high-resource languages.

Core claim

BabelJudge measures position bias, verbosity bias, order inconsistency, and cross-lingual degradation on any judge model by generating pairwise items whose gold labels are known by construction through controlled degradation of high-quality references. On Qwen2.5-7B-Instruct-4bit the bias-penalised reliability score drops from 0.714 in Hindi to 0.550 in Swahili while raw accuracy only falls from 0.835 to 0.660; Swahili order consistency collapses to 0.480. The framework further supports nine trajectory-level perturbations and reports tool accuracy, hallucination detection rate, and trajectory-length bias.

What carries the argument

Gold-labelling by degradation, which creates pairwise comparison items with known-correct labels by applying controlled perturbations to high-quality reference responses.

If this is right

Raw accuracy alone masks large reliability gaps across languages.
Order inconsistency can render verdicts near-random under simple slot swaps in some languages.
The benchmark works on any judge model and requires no human preference data.
Trajectory perturbations enable measurement of tool-use accuracy and hallucination detection in agent evaluations.
A composite score that penalizes detected biases gives a stricter assessment than accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Judge developers may need language-specific calibration data to close the observed reliability gaps.
The automatic labelling technique could be reused to build evaluation sets for other subjective tasks such as summarization or reasoning quality.
Similar order and position effects may appear when the same judges assess multi-step agent trajectories.
Widespread adoption of the released package would let practitioners routinely audit judges before deployment.

Load-bearing premise

Perturbations applied to high-quality references produce items whose true preference label is known by construction.

What would settle it

If human raters systematically disagree with the assumed gold labels on a held-out sample of perturbed pairs, the automatic labelling method would be invalidated.

Figures

Figures reproduced from arXiv: 2606.22329 by Shreyas KC.

**Figure 2.** Figure 2: Reliability score vs. raw accuracy for Qwen2.5-7B-Instruct-4bit across four languages. Raw [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-metric bias breakdown across four languages. Colours indicate severity: green = [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Reliability radar charts per language. Each axis represents a normalised reliability [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), they prefer longer responses regardless of quality (verbosity bias), and their reliability degrades sharply in lower-resource languages. We introduce BabelJudge, an open-source benchmark and reliability audit framework that measures all four failure modes -- position bias, verbosity bias, order inconsistency, and cross-lingual degradation -- on any judge model, without requiring human preference labels. The key insight is gold-labelling by degradation: starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction, eliminating annotation cost. We evaluate Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili and find that our composite bias-penalised reliability score drops from 0.714 in Hindi to 0.550 in Swahili, a gap that raw accuracy (0.835 vs. 0.660) understates. Swahili order consistency collapses to 0.480, meaning judge verdicts are near-random under slot-order swaps -- a failure mode invisible to accuracy alone. We further extend the framework to agentic evaluation via nine trajectory-level perturbations (argument corruption, tool swaps, hallucinated calls, missing steps) and three new metrics: tool accuracy, hallucination detection rate, and trajectory-length bias. BabelJudge is released as a Python package supporting 11 judge backends. Code: https://github.com/Shreyaskc/BabelJudge

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BabelJudge introduces a perturbation method to create gold labels for auditing LLM judges without annotations, but the approach depends on an unverified assumption that degradations always produce objectively worse responses.

read the letter

The main thing to know is that this paper offers a label-free way to test LLM judges for position bias, verbosity bias, order inconsistency, and language degradation by starting with good responses and applying controlled changes to create known-better pairs. They extend the same idea to agent trajectories with nine perturbation types like tool swaps and hallucinated calls, and they release code that works with eleven judge backends.

What the paper does well is release usable software and demonstrate that their composite score shows larger drops across languages than raw accuracy does. The reported numbers on Swahili order consistency falling to 0.480 illustrate why simple metrics can hide problems. Testing four languages and adding trajectory-level metrics is a reasonable expansion of the audit idea.

The soft spot is the one flagged in the stress-test note. The method assumes every perturbation strictly degrades quality in an objective, language-invariant way so the original is always better by construction. If that does not hold, especially in lower-resource languages like Swahili where quality signals may differ, the gold labels become unreliable and all the downstream reliability scores rest on shaky ground. The abstract gives no indication they ran human checks to confirm people agree the perturbed versions are worse.

This is for researchers and engineers who build or rely on automated evaluation pipelines for LLMs and agents, particularly those working across languages. A reader who needs a practical benchmark to measure hidden judge failures would get value from the framework and the reported gaps.

It deserves a serious referee because the problem is real and the scalable approach addresses a clear need, even though the label construction step needs stronger support.

Referee Report

2 major / 0 minor

Summary. The paper introduces BabelJudge, an open-source benchmark and audit framework for LLM-as-a-judge models that measures position bias, verbosity bias, order inconsistency, and cross-lingual degradation without human preference labels. The core method constructs gold labels by applying controlled perturbations to high-quality reference responses. It evaluates Qwen2.5-7B-Instruct-4bit on English, Hindi, Arabic, and Swahili, reporting a composite bias-penalised reliability drop from 0.714 (Hindi) to 0.550 (Swahili) and order consistency of 0.480 in Swahili; the framework is extended to agent trajectories via nine perturbations (e.g., argument corruption, tool swaps) with new metrics for tool accuracy, hallucination detection, and trajectory-length bias. A Python package supporting 11 judge backends is released.

Significance. If the perturbation-based gold labelling is valid, the work supplies a scalable, annotation-free method to expose failure modes that raw accuracy conceals, especially in low-resource languages and agentic settings. The open-source release and multi-backend support constitute a concrete contribution to reproducibility.

major comments (2)

[Abstract] Abstract (method description): The central claim that 'starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction' is load-bearing for every reported metric (bias-penalised reliability, order consistency, tool accuracy, hallucination detection). No independent validation—such as human agreement rates confirming that each perturbation class produces an objectively inferior response—is described, and this assumption is not shown to hold invariantly across the four languages.
[Abstract] Abstract (results): The reported gap between raw accuracy (0.835 Hindi vs. 0.660 Swahili) and the composite score (0.714 vs. 0.550), together with the Swahili order-consistency collapse to 0.480, rests entirely on the correctness of the constructed labels. If any perturbation class fails to degrade quality consistently, these numerical claims and the conclusion that accuracy 'understates' the problem are undermined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the foundational assumption of our perturbation-based gold labeling. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (method description): The central claim that 'starting from a high-quality reference response and applying a controlled perturbation yields a pairwise item whose gold label is known by construction' is load-bearing for every reported metric (bias-penalised reliability, order consistency, tool accuracy, hallucination detection). No independent validation—such as human agreement rates confirming that each perturbation class produces an objectively inferior response—is described, and this assumption is not shown to hold invariantly across the four languages.

Authors: The perturbations are constructed to introduce unambiguous degradations (e.g., factual errors, coherence breaks, or incorrect tool calls) that are intended to be objectively inferior by design. We acknowledge, however, that the manuscript does not report independent human validation of these labels. In the revised manuscript we will add a targeted human evaluation on a stratified sample of perturbed pairs across English, Hindi, Arabic, and Swahili, reporting inter-annotator agreement with the constructed gold labels. revision: yes
Referee: [Abstract] Abstract (results): The reported gap between raw accuracy (0.835 Hindi vs. 0.660 Swahili) and the composite score (0.714 vs. 0.550), together with the Swahili order-consistency collapse to 0.480, rests entirely on the correctness of the constructed labels. If any perturbation class fails to degrade quality consistently, these numerical claims and the conclusion that accuracy 'understates' the problem are undermined.

Authors: The reported metrics and the claim that raw accuracy understates reliability issues are indeed contingent on the perturbations consistently producing lower-quality responses. The human validation study described above will provide direct empirical support for this assumption across languages, allowing us to verify that the observed gaps and the Swahili order-consistency drop reflect genuine judge limitations rather than label artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: synthetic labels by explicit construction, no fitted predictions or self-citation chains

full rationale

The paper's core method explicitly constructs pairwise gold labels via controlled perturbations on reference responses, stating the label is 'known by construction.' This is a deliberate synthetic benchmark design rather than a derivation, prediction, or fitted result that reduces to its inputs. No equations, parameters, or self-citations are described that would create self-definitional or load-bearing circularity. Reported metrics (bias-penalised reliability, order consistency) are direct evaluations against these transparently synthetic labels. The framework is self-contained as a measurement tool without claiming external first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that controlled perturbations reliably produce lower-quality responses whose relative quality is known by construction.

axioms (1)

domain assumption Controlled perturbations applied to high-quality reference responses create pairwise items with known gold labels by construction
This assumption enables the entire label-free evaluation framework described in the abstract.

pith-pipeline@v0.9.1-grok · 5825 in / 1286 out tokens · 39674 ms · 2026-06-26T11:04:00.070412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages · 2 internal anchors

[1]

MEGA: Multilingual evaluation of generative AI

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Cao, Sunny Shen, Isha Ashok, Gullal Bhatt, et al. MEGA: Multilingual evaluation of generative AI. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, 2023

2023
[2]

MLX: An array framework for Apple silicon, 2023

Apple MLX Team. MLX: An array framework for Apple silicon, 2023. URL https://github. com/ml-explore/mlx

2023
[3]

AlpacaFarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InAdvances in Neural Information Processing Systems, volume 36, 2024

2024
[4]

GPTScore: Evaluate as you desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

2024
[5]

Evaluating NLP models via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating NLP models via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3382–3387, 2020

2020
[6]

Evaluating large language models: A comprehensive survey

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023

work page arXiv 2023
[7]

METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024

Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024. 10

work page arXiv 2024
[8]

Look at the first sentence: Position bias in question answering.arXiv preprint arXiv:2004.14602, 2020

Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering.arXiv preprint arXiv:2004.14602, 2020

work page arXiv 2004
[9]

Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Wei- wei Deng, Feng Sun, and Qi Zhang. Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023

work page arXiv 2023
[10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

2022
[11]

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Samuel R Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations.arXiv preprint arXiv:2404.13076, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020

2020
[14]

Verbosity bias in preference labeling by large language models

Keita Saito, Akifumi Sugawa, Hiroaki Ouchi, and Taro Watanabe. Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076, 2023

work page arXiv 2023
[15]

Aya dataset: An open-access collection for multilingual instruction tuning.arXiv preprint arXiv:2402.06619, 2024

Shivalika Singh, Freddie Vargus, Daniel Dsouza, B¨ orje F Slightam, Gullal S Bhatt, Huang Kaitao, et al. Aya dataset: An open-access collection for multilingual instruction tuning.arXiv preprint arXiv:2402.06619, 2024

work page arXiv 2024
[16]

Singhal, T

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF.arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023
[17]

Large language models are not robust multiple choice selectors

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors. InInternational Conference on Learning Representations, 2024

2024
[18]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 11

2023

[1] [1]

MEGA: Multilingual evaluation of generative AI

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Cao, Sunny Shen, Isha Ashok, Gullal Bhatt, et al. MEGA: Multilingual evaluation of generative AI. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, 2023

2023

[2] [2]

MLX: An array framework for Apple silicon, 2023

Apple MLX Team. MLX: An array framework for Apple silicon, 2023. URL https://github. com/ml-explore/mlx

2023

[3] [3]

AlpacaFarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InAdvances in Neural Information Processing Systems, volume 36, 2024

2024

[4] [4]

GPTScore: Evaluate as you desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

2024

[5] [5]

Evaluating NLP models via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating NLP models via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3382–3387, 2020

2020

[6] [6]

Evaluating large language models: A comprehensive survey

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023

work page arXiv 2023

[7] [7]

METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024

Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. METaL: Multilingual evaluation of trustworthy LLMs.arXiv preprint arXiv:2407.03470, 2024. 10

work page arXiv 2024

[8] [8]

Look at the first sentence: Position bias in question answering.arXiv preprint arXiv:2004.14602, 2020

Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering.arXiv preprint arXiv:2004.14602, 2020

work page arXiv 2004

[9] [9]

Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Wei- wei Deng, Feng Sun, and Qi Zhang. Calibrating LLM-based evaluator.arXiv preprint arXiv:2309.13308, 2023

work page arXiv 2023

[10] [10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

2022

[11] [11]

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Samuel R Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations.arXiv preprint arXiv:2404.13076, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020

2020

[14] [14]

Verbosity bias in preference labeling by large language models

Keita Saito, Akifumi Sugawa, Hiroaki Ouchi, and Taro Watanabe. Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076, 2023

work page arXiv 2023

[15] [15]

Aya dataset: An open-access collection for multilingual instruction tuning.arXiv preprint arXiv:2402.06619, 2024

Shivalika Singh, Freddie Vargus, Daniel Dsouza, B¨ orje F Slightam, Gullal S Bhatt, Huang Kaitao, et al. Aya dataset: An open-access collection for multilingual instruction tuning.arXiv preprint arXiv:2402.06619, 2024

work page arXiv 2024

[16] [16]

Singhal, T

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF.arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023

[17] [17]

Large language models are not robust multiple choice selectors

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors. InInternational Conference on Learning Representations, 2024

2024

[18] [18]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 11

2023