Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Mina Remeli; Moritz Hardt

arxiv: 2606.09409 · v1 · pith:6AESUCV3new · submitted 2026-06-08 · 💻 cs.AI · cs.CL· cs.LG

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Mina Remeli , Moritz Hardt This is my paper

Pith reviewed 2026-06-27 16:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords pairwise comparisonsElo rankingsmodel evaluationgenerative modelsaccuracy correlationjudge biasstyle biasecho effect

0 comments

The pith

Pairwise comparisons with Elo produce model rankings that match accuracy rankings at Spearman correlation above 0.9.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper converts five established benchmarks into open-ended generative tasks so that both accuracy and pairwise judgments can be measured on the same outputs. It reports that Elo rankings derived from pairwise comparisons align closely with the accuracy orderings. The alignment remains strong even when the judge model is weak, a setting where direct scoring of individual answers performs noticeably worse. Style and judge bias show only small effects on the final rankings, while repetition of the final answer influences preferences on pairs that are tied in correctness.

Core claim

By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

What carries the argument

Elo rating aggregation applied to pairwise comparisons of model-generated answers on converted benchmarks

If this is right

Elo rankings from pairwise comparisons can serve as a reliable proxy for accuracy orderings when ground truth is unavailable.
Pairwise methods remain effective even when the judging model is weak, unlike direct per-answer scoring.
Style and judge biases exert limited distortion on overall model rankings in practice.
On correctness-tied pairs, controlling for echo repetition can reduce one source of preference variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The validation approach could be repeated on entirely new open-ended tasks that never had ground-truth answers to begin with.
Model developers might prefer pairwise collection over direct scoring when only a weak judge model is available.
The echo finding suggests that standardizing output format across compared answers could further stabilize judgments.

Load-bearing premise

Converting the five benchmarks into free-form generative evaluations preserves the original accuracy signal without introducing new artifacts that would artificially inflate the observed correlation.

What would settle it

Applying the same benchmark conversion and pairwise evaluation protocol to a sixth held-out benchmark and obtaining a Spearman correlation below 0.8 between Elo and accuracy rankings would falsify the reported level of agreement.

Figures

Figures reproduced from arXiv: 2606.09409 by Mina Remeli, Moritz Hardt.

**Figure 1.** Figure 1: High correlation between accuracy and Bradley-Terry scores. Each point represents a model evaluated on a specific benchmark: we plot accuracy on the x axis and the BradleyTerry score on the y axis. Bradley-Terry estimates each model’s ‘strength’ by aggregating over preferences between two candidate answers (pairwise comparisons). We collected pairwise comparisons using gpt-oss-20b. We observe both high … view at source ↗

**Figure 2.** Figure 2: From benchmarks to rankings. For each benchmark we obtain an Elo-style ranking and compare it to the accuracy based ranking. Accuracies are obtained by grading answers based on the ground truth answer (top part of the figure), while Elo-like scores are obtained by aggregating over pairwise comparisons across answer pairs (bottom part of the figure). For multiple choice benchmarks (such as MMLU Pro) we remo… view at source ↗

**Figure 3.** Figure 3: Alignment with accuracy based ranking. On the left side we have models ranked by their accuracy while on the right we rank them by their Bradley-Terry score. Distance between ranks is measured using Kendall’s distance (KD). We see consistent alignment across different types of benchmarks: multiple choice (MMLU Pro and GPQA Diamond [GPQA-D]), LLM graded (GSM8K) and multitask (Big Bench Hard [BBH]). Judge mo… view at source ↗

**Figure 4.** Figure 4: Direct judge doubles the rank distance on SimpleQA We plot the rank alignment on SimpleQA for both Bradley-Terry and the direct judge baseline. Using gpt-oss-20b as a judge directly doubles the model pairs that would need to be swapped to recover the accuracy based ranking. 4.3 Metrics Score correlation We use Pearson’s R to measure the linear correlation between scores. It captures the strength and direct… view at source ↗

**Figure 5.** Figure 5: Bradley-Terry significantly outperforms baseline in the weak model setting. We measure the correlation to the accuracy-based ranking on SimpleQA. We vary our evaluator between the “strongest model” on the benchmark (o3, 59.3% acc) and the “weakest model” on the benchmark (gpt-oss-20b, 4.9% acc). While the direct judge is better in the strong model setting, Bradley-Terry clearly dominates the baseline in th… view at source ↗

**Figure 6.** Figure 6: Correcting for biases can lead to modest improvements Original rank correlation (without bias correction) is red. m stands for number of ranked models. Error bars are 95% confidence intervals on 100 bootstrap samples. Judge model: gpt-oss-20b 5.3 Bias correction We also investigate potential sources of judge bias documented in prior work Zheng et al. (2023); Dubois et al. (2024). In [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 7.** Figure 7: Echo as a causal driver of judge preference. We plot the probability of the judge model choosing answer A over B. The intervention is adding echo to one answer by appending the question–answer sequence three times. Echo is a strong driver of judge preference on nondiscriminative pairs (where both answers are either correct or incorrect). This effect disappears on discriminative pairs. Judge model: o3. Err… view at source ↗

**Figure 8.** Figure 8: Correlation between accuracy and Bradley-Terry score using other models for collecting [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Bradley-Terry vs. direct judge on Simple QA We vary our evaluator between the “strongest model” on the benchmark (o3, 59.3% acc), a “middle model” (gpt-oss-120b, 8.8% acc) and the “weakest model” on the benchmark (gpt-oss-20b, 4.9% acc). While the two methods are comparable using strong models, Bradley-Terry clearly dominates the baseline in the weaker model setting. 20 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 10.** Figure 10: Correcting for the most common type of judge biases. Original rank correlation (without [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Results after filtering pairwise comparisons. We test two settings: in one, we keep non-discriminative pairs (both answers correct or both incorrect). This models the case where one of the distinguishing features is the appearance of the answer. Nevertheless, we still get surprisingly high rank correlation (Fig 11a). In the other, we keep discriminative pairs (one answer correct, one incorrect), which imp… view at source ↗

read the original abstract

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pairwise Elo tracks accuracy rankings closely on these converted benchmarks, but the conversion itself is the unverified step that carries the result.

read the letter

The key point is that Elo rankings from pairwise judgments line up with ground-truth accuracy at Spearman above 0.9 on five benchmarks turned into open-ended tasks, and they hold up better than direct scoring when the judge is weak.

The paper converts the benchmarks, runs the comparisons, and reports that style and bias effects stay small even though most pairs are both correct or both incorrect. It also flags repetition after the final answer as a driver of preference on those ties. That gives a concrete empirical check on whether pairwise methods are mostly noise or actually recover the right order.

The conversion step is the soft spot. The claim rests on the generative versions preserving the original accuracy signal. If the new format changes what counts as correct or introduces fresh error types, the high correlation could be an artifact of how the tasks were rewritten rather than a property of the aggregation method. The abstract gives no verification that relative model orderings survived the change, so that assumption needs direct evidence.

This is for people who run or rely on LLM evaluations at scale. The result is straightforward enough that a referee can check the conversion details and the statistical tests without much trouble. It deserves peer review because the question is practical and the setup is falsifiable, even if the conversion needs more scrutiny before the correlation can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that converting five well-known benchmarks into free-form generative evaluations yields Elo rankings from pairwise comparisons that achieve Spearman correlation above 0.9 with ground-truth accuracy rankings. It further asserts that this pairwise approach substantially outperforms direct evaluation (especially with weak judges), that style and judge bias exert only minor effects on rankings despite most judgments occurring on pairs where both answers are correct or incorrect, and that repetition after the final answer ('echo') is a causal driver of judge preference on such pairs.

Significance. If the conversion procedure faithfully preserves the original accuracy signals, the result would provide empirical support for the reliability of Elo-based pairwise aggregation as a proxy for accuracy rankings in generative model evaluation. This could reduce dependence on direct scoring methods that are sensitive to judge strength and would offer a concrete causal finding on preference drivers. The multi-benchmark scope adds potential robustness, though only if conversion artifacts are ruled out.

major comments (2)

[Abstract and conversion description] Abstract and conversion description: The central claim of Spearman correlation >0.9 between Elo and accuracy rankings requires that the five benchmarks' conversion to free-form generative tasks preserves the original accuracy signal and relative model orderings. No verification, comparison to source accuracies, or details on how correctness is redefined in open-ended format are supplied, leaving open the possibility that the high correlation is driven by conversion artifacts rather than pairwise aggregation properties.
[Results and statistical analysis (presumed §5)] Results and statistical analysis (presumed §5): The claim that Elo 'substantially outperform[s] direct evaluation when the judge is weak' is load-bearing for the practical recommendation, yet no specification of the weak judge model, direct-evaluation protocol, number of models/pairs, or statistical comparison (e.g., significance test on the difference in correlations) is provided to substantiate the outperformance.

minor comments (2)

[Abstract] The five benchmarks are referred to only as 'well-known' without naming them (e.g., MMLU, GSM8K), which reduces reproducibility and contextualization of the accuracy signal.
The manuscript should report confidence intervals or standard errors on the reported Spearman correlations and clarify the total number of pairwise judgments performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications from the manuscript and indicate where revisions will be made to improve transparency.

read point-by-point responses

Referee: [Abstract and conversion description] Abstract and conversion description: The central claim of Spearman correlation >0.9 between Elo and accuracy rankings requires that the five benchmarks' conversion to free-form generative tasks preserves the original accuracy signal and relative model orderings. No verification, comparison to source accuracies, or details on how correctness is redefined in open-ended format are supplied, leaving open the possibility that the high correlation is driven by conversion artifacts rather than pairwise aggregation properties.

Authors: The Methods section describes the conversion of each benchmark to free-form format by stripping multiple-choice options and redefining correctness via reference matching or semantic equivalence to the original ground truth. While relative model orderings are preserved by construction, we agree that explicit verification strengthens the claim. In revision we will add a table and text comparing per-model accuracies (and their rank correlations) between the original benchmarks and the converted generative versions. revision: yes
Referee: [Results and statistical analysis (presumed §5)] Results and statistical analysis (presumed §5): The claim that Elo 'substantially outperform[s] direct evaluation when the judge is weak' is load-bearing for the practical recommendation, yet no specification of the weak judge model, direct-evaluation protocol, number of models/pairs, or statistical comparison (e.g., significance test on the difference in correlations) is provided to substantiate the outperformance.

Authors: Section 5 specifies the weak judge (Llama-3-8B), the direct-evaluation protocol (scalar 1-10 scoring), the set of 10 models, and the pair counts; bootstrap tests on the correlation differences are reported in the supplementary material. To make these elements immediately visible without requiring the reader to locate them, we will expand the main-text description and add the exact p-values to the primary results table. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical correlation measurement

full rationale

The paper reports an empirical study: five benchmarks are converted to free-form generative tasks, pairwise judgments are collected, Elo rankings are computed, and their Spearman correlation with ground-truth accuracy rankings is measured (reported >0.9). No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described chain. The central claim is an observed statistical agreement between two independently computed rankings, not a result forced by definition or prior self-referential work. The conversion step is an experimental design choice whose validity is an external assumption, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5655 in / 997 out tokens · 19182 ms · 2026-06-27T16:30:02.063764+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

[1]

2013 , publisher=

Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=

2013
[2]

Rank analysis of incomplete block designs: I

Bradley, Ralph Allan and Terry, Milton E , journal =. Rank analysis of incomplete block designs: I. the method of paired comparisons , volume =
[3]

arXiv preprint arXiv:2507.02856 , year =

Answer matching outperforms multiple choice for language model evaluation , author =. arXiv preprint arXiv:2507.02856 , year =

arXiv
[4]

Gonzalez and Ion Stoica , doi =

Tianle Li and Wei-Lin Chiang and Evan Frick and Lisa Dunlap and Tianhao Wu and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica , doi =. International Conference on Machine Learning , title =
[5]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , url =

Yang Liu and Dan Iter and Yichong Xu and Shuo Wang and Ruochen Xu and Chenguang Zhu , doi =. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , url =. Conference on Empirical Methods in Natural Language Processing , keywords =
[6]

Does style matter? disentangling style and substance in Chatbot Arena , url =

Li, Tianle and Angelopoulos, Anastasios and Chiang, Wei-Lin , journal =. Does style matter? disentangling style and substance in Chatbot Arena , url =
[7]

Chatbot arena: An open platform for evaluating llms by human preference , year =

Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E and others , booktitle =. Chatbot arena: An open platform for evaluating llms by human preference , year =
[8]

Richard Kelley and Duncan Wilson , title =
[9]

Advances in Neural Information Processing Systems , volume =

Long-form factuality in large language models , author =. Advances in Neural Information Processing Systems , volume =
[10]

Gpqa: A graduate-level google-proof q&a benchmark , year =

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , booktitle =. Gpqa: A graduate-level google-proof q&a benchmark , year =
[11]

The Rating of Chessplayers, Past and Present , year =

Elo, Arpad E , publisher =. The Rating of Chessplayers, Past and Present , year =
[12]

Training verifiers to solve math word problems , year =

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others , journal =. Training verifiers to solve math word problems , year =
[13]

Boyd-Graber , doi =

Nishant Balepur and Rachel Rudinger and J. Boyd-Graber , doi =. Annual Meeting of the Association for Computational Linguistics , title =
[14]

Efficient elicitation of annotations for human evaluation of machine translation , year =

Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin , booktitle =. Efficient elicitation of annotations for human evaluation of machine translation , year =
[15]

Humans or LLMs as the judge? a study on judgement bias

Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou , booktitle =. Humans or. doi:10.18653/v1/2024.emnlp-main.474 , file =

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[16]

Do these llm benchmarks agree? fixing benchmark evaluation with benchbench , year =

Perlitz, Yotam and Gera, Ariel and Arviv, Ofir and Yehudai, Asaf and Bandel, Elron and Shnarch, Eyal and Shmueli-Scheuer, Michal and Choshen, Leshem , journal =. Do these llm benchmarks agree? fixing benchmark evaluation with benchbench , year =
[17]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.824 , pages =

work page doi:10.18653/v1/2023.findings-acl.824 2023
[18]

The validity of evaluation results: Assessing concurrence across compositionality benchmarks , year =

Sun, Kaiser and Williams, Adina and Hupkes, Dieuwke , booktitle =. The validity of evaluation results: Assessing concurrence across compositionality benchmarks , year =
[19]

First Conference on Language Modeling , year =

Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators , author =. First Conference on Language Modeling , year =
[20]

Judging llm-as-a-judge with mt-bench and chatbot arena , volume =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal =. Judging llm-as-a-judge with mt-bench and chatbot arena , volume =
[21]

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts , url =

Helia Hashemi and Jason Eisner and Corby Rosset and Benjamin Van Durme and Chris Kedzie , doi =. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts , url =. Annual Meeting of the Association for Computational Linguistics , keywords =
[22]

Self-Preference Bias in

Koki Wataoka and Tsubasa Takahashi and Ryokan Ri , booktitle =. Self-Preference Bias in. 2024 , url =

2024
[23]

Do Question Answering Modeling Improvements Hold Across Benchmarks? , year =

Liu, Nelson F and Lee, Tony and Jia, Robin and Liang, Percy , booktitle =. Do Question Answering Modeling Improvements Hold Across Benchmarks? , year =
[24]

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , url =

Yidong Wang and Zhuohao Yu and Zhengran Zeng and Linyi Yang and Cunxiang Wang and Hao Chen and Chaoya Jiang and Rui Xie and Jindong Wang and Xingxu Xie and Wei Ye and Shikun Zhang and Yue Zhang , doi =. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , url =. International Conference on Learning Representations , keywords =
[25]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , volume =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal =. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , volume =
[26]

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

Train-before-Test Harmonizes Language Model Rankings , author =. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

2025
[27]

arXiv preprint arXiv:2406.09363 , year =

Elicitationgpt: Text elicitation mechanisms via language models , author =. arXiv preprint arXiv:2406.09363 , year =

arXiv

[1] [1]

2013 , publisher=

Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=

2013

[2] [2]

Rank analysis of incomplete block designs: I

Bradley, Ralph Allan and Terry, Milton E , journal =. Rank analysis of incomplete block designs: I. the method of paired comparisons , volume =

[3] [3]

arXiv preprint arXiv:2507.02856 , year =

Answer matching outperforms multiple choice for language model evaluation , author =. arXiv preprint arXiv:2507.02856 , year =

arXiv

[4] [4]

Gonzalez and Ion Stoica , doi =

Tianle Li and Wei-Lin Chiang and Evan Frick and Lisa Dunlap and Tianhao Wu and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica , doi =. International Conference on Machine Learning , title =

[5] [5]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , url =

Yang Liu and Dan Iter and Yichong Xu and Shuo Wang and Ruochen Xu and Chenguang Zhu , doi =. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , url =. Conference on Empirical Methods in Natural Language Processing , keywords =

[6] [6]

Does style matter? disentangling style and substance in Chatbot Arena , url =

Li, Tianle and Angelopoulos, Anastasios and Chiang, Wei-Lin , journal =. Does style matter? disentangling style and substance in Chatbot Arena , url =

[7] [7]

Chatbot arena: An open platform for evaluating llms by human preference , year =

Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E and others , booktitle =. Chatbot arena: An open platform for evaluating llms by human preference , year =

[8] [8]

Richard Kelley and Duncan Wilson , title =

[9] [9]

Advances in Neural Information Processing Systems , volume =

Long-form factuality in large language models , author =. Advances in Neural Information Processing Systems , volume =

[10] [10]

Gpqa: A graduate-level google-proof q&a benchmark , year =

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , booktitle =. Gpqa: A graduate-level google-proof q&a benchmark , year =

[11] [11]

The Rating of Chessplayers, Past and Present , year =

Elo, Arpad E , publisher =. The Rating of Chessplayers, Past and Present , year =

[12] [12]

Training verifiers to solve math word problems , year =

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others , journal =. Training verifiers to solve math word problems , year =

[13] [13]

Boyd-Graber , doi =

Nishant Balepur and Rachel Rudinger and J. Boyd-Graber , doi =. Annual Meeting of the Association for Computational Linguistics , title =

[14] [14]

Efficient elicitation of annotations for human evaluation of machine translation , year =

Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin , booktitle =. Efficient elicitation of annotations for human evaluation of machine translation , year =

[15] [15]

Humans or LLMs as the judge? a study on judgement bias

Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou , booktitle =. Humans or. doi:10.18653/v1/2024.emnlp-main.474 , file =

work page doi:10.18653/v1/2024.emnlp-main.474 2024

[16] [16]

Do these llm benchmarks agree? fixing benchmark evaluation with benchbench , year =

Perlitz, Yotam and Gera, Ariel and Arviv, Ofir and Yehudai, Asaf and Bandel, Elron and Shnarch, Eyal and Shmueli-Scheuer, Michal and Choshen, Leshem , journal =. Do these llm benchmarks agree? fixing benchmark evaluation with benchbench , year =

[17] [17]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.824 , pages =

work page doi:10.18653/v1/2023.findings-acl.824 2023

[18] [18]

The validity of evaluation results: Assessing concurrence across compositionality benchmarks , year =

Sun, Kaiser and Williams, Adina and Hupkes, Dieuwke , booktitle =. The validity of evaluation results: Assessing concurrence across compositionality benchmarks , year =

[19] [19]

First Conference on Language Modeling , year =

Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators , author =. First Conference on Language Modeling , year =

[20] [20]

Judging llm-as-a-judge with mt-bench and chatbot arena , volume =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal =. Judging llm-as-a-judge with mt-bench and chatbot arena , volume =

[21] [21]

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts , url =

Helia Hashemi and Jason Eisner and Corby Rosset and Benjamin Van Durme and Chris Kedzie , doi =. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts , url =. Annual Meeting of the Association for Computational Linguistics , keywords =

[22] [22]

Self-Preference Bias in

Koki Wataoka and Tsubasa Takahashi and Ryokan Ri , booktitle =. Self-Preference Bias in. 2024 , url =

2024

[23] [23]

Do Question Answering Modeling Improvements Hold Across Benchmarks? , year =

Liu, Nelson F and Lee, Tony and Jia, Robin and Liang, Percy , booktitle =. Do Question Answering Modeling Improvements Hold Across Benchmarks? , year =

[24] [24]

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , url =

Yidong Wang and Zhuohao Yu and Zhengran Zeng and Linyi Yang and Cunxiang Wang and Hao Chen and Chaoya Jiang and Rui Xie and Jindong Wang and Xingxu Xie and Wei Ye and Shikun Zhang and Yue Zhang , doi =. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , url =. International Conference on Learning Representations , keywords =

[25] [25]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , volume =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal =. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , volume =

[26] [26]

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

Train-before-Test Harmonizes Language Model Rankings , author =. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

2025

[27] [27]

arXiv preprint arXiv:2406.09363 , year =

Elicitationgpt: Text elicitation mechanisms via language models , author =. arXiv preprint arXiv:2406.09363 , year =

arXiv