Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Chaitanya Dwivedi; Himanshu Gupta; Neeraj Varshney; Pratik Jayarao

arxiv: 2509.13332 · v2 · submitted 2025-09-09 · 💻 cs.AI · cs.CL

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Pratik Jayarao , Himanshu Gupta , Neeraj Varshney , Chaitanya Dwivedi This is my paper

Pith reviewed 2026-05-18 17:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM as judgeexplicit reasoningthinking modelsaccuracy and efficiencybias robustnessRewardBenchmultilingual evaluation

0 comments

The pith

Thinking models deliver 10 accuracy points more as LLM judges than non-thinking ones at under twice the compute cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares thinking and non-thinking LLMs as automated judges using small Qwen 3 models. Thinking models generate explicit reasoning before deciding and reach roughly ten percentage points higher accuracy on RewardBench tasks. Non-thinking models stay behind even after several augmentation strategies such as few-shot examples, rubrics, and reference-based checks, and those strategies require more than eight times the computation for smaller gains. Thinking models also hold up better against positional, bandwagon, identity, diversity, and random biases, showing six percent higher consistency on average. The same pattern appears in multilingual tests.

Core claim

The central claim is that explicit reasoning improves LLM performance in the judge role. Thinking models achieve approximately 10 percentage points higher accuracy with under 2x overhead, while augmentation strategies for non-thinking models produce only modest gains at over 8x cost. Thinking models also show significantly greater consistency across positional, bandwagon, identity, diversity, and random bias conditions, averaging 6 percent higher robustness, and these benefits extend to multilingual settings.

What carries the argument

The direct comparison of thinking models that output explicit reasoning steps before a final judgment against non-thinking models that output judgments directly, measured on RewardBench tasks for accuracy, FLOPs, and bias consistency.

If this is right

Thinking models provide a better accuracy-efficiency trade-off for automated judging than prompting or aggregation enhancements to non-thinking models.
Explicit reasoning reduces the impact of positional, identity, and other common biases by roughly 6 percent on average.
The accuracy and robustness advantages of reasoning persist when evaluation moves beyond English to other languages.
Complex augmentation pipelines for direct-output models deliver smaller returns at substantially higher computational expense.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Judge systems in benchmarks or reward modeling could simplify by defaulting to reasoning-enabled models instead of layered prompting tricks.
Similar gains might appear in other decision tasks where LLMs must stay consistent under biased or noisy inputs.
Training objectives that encourage step-by-step reasoning could become standard for models intended as evaluators.

Load-bearing premise

The performance gaps arise mainly from the presence or absence of explicit reasoning rather than from unstated differences in how the thinking and non-thinking model variants were trained or prompted.

What would settle it

Train or fine-tune otherwise identical base models with and without an explicit reasoning objective, then run both on the same RewardBench and bias suites to check whether the accuracy and robustness gaps remain.

Figures

Figures reproduced from arXiv: 2509.13332 by Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao.

**Figure 2.** Figure 2: The plots compare average accuracy against relative computational cost (FLOPs) for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt for Baseline setting 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt for LLMaaJ w In Context Examples 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for LLMaaJ w Reference 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for LLMaaJ w Rubric Prompt 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Rubric for MRewardBench subset: alpacaeval-easy [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Rubric for MRewardBench subset: alpacaeval-hard [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Rubric for MRewardBench subset: alpacaeval-length [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Rubric for MRewardBench subset: mt-bench-easy [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Rubric for MRewardBench subset: mt-bench-med [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Rubric for MRewardBench subset: mt-bench-hard [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Rubric for MRewardBench subset: llmbar-natural [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Rubric for MRewardBench subset: llmbar-adver-neighbor [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Rubric for MRewardBench subset: llmbar-adver-GPTInst [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Rubric for MRewardBench subset: llmbar-adver-GPTOut [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Rubric for MRewardBench subset: llmbar-adver-manual [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Rubric for MRewardBench subset: refusals-dangerous [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Rubric for MRewardBench subset: refusals-offensive [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Rubric for MRewardBench subset: xstest-should-refuse [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Rubric for MRewardBench subset: xstest-should-respond [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Rubric for MRewardBench subset: donotanswer [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Rubric for MRewardBench subset: hep-python [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Rubric for MRewardBench subset: hep-js Rubric for MRewardBench subset: hep-java Pairwise judge for HumanEvalPack (Java). Steps: 1) Read the method/class spec in the User Question; read code from Assistant A and Assistant B. 2) Checks: - Functional correctness: meets the spec; would pass tests. - API contract: correct method/class signatures, visibility, and types. - Edge cases & complexity: covers edge ca… view at source ↗

**Figure 25.** Figure 25: Rubric for MRewardBench subset: hep-java [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗

**Figure 26.** Figure 26: Rubric for MRewardBench subset: hep-go Rubric for MRewardBench subset: hep-cpp Pairwise judge for HumanEvalPack (C++). Steps: 1) Read the function spec in the User Question; read code from Assistant A and Assistant B. 2) Checks: - Functional correctness: logic meets the spec; would pass tests. - API contract: correct signature, headers, and namespaces. - Edge cases & complexity: covers edge cases; appropr… view at source ↗

**Figure 27.** Figure 27: Rubric for MRewardBench subset: hep-cpp 29 [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗

**Figure 28.** Figure 28: Rubric for MRewardBench subset: hep-rust [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗

**Figure 29.** Figure 29: Rubric for MRewardBench subset: math-prm [PITH_FULL_IMAGE:figures/full_fig_p030_29.png] view at source ↗

**Figure 30.** Figure 30: Verbosity Prompt Bandwagon Bias Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail… view at source ↗

**Figure 31.** Figure 31: Prompt to evaluate LLMaaJ w Bandwagon Bias [PITH_FULL_IMAGE:figures/full_fig_p031_31.png] view at source ↗

**Figure 32.** Figure 32: Prompt to evaluate LLMaaJ w Diversity Bias [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗

**Figure 33.** Figure 33: Prompt to evaluate LLMaaJ w Identity Bias [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗

**Figure 34.** Figure 34: Prompt to evaluate LLMaaJ w Distraction Bias [PITH_FULL_IMAGE:figures/full_fig_p034_34.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a systematic empirical comparison of 'thinking' (explicit reasoning) versus 'non-thinking' variants of small open-source Qwen3 models (0.6B, 1.7B, 4B parameters) used as LLM judges. On RewardBench tasks, it reports that thinking models deliver approximately 10 percentage points higher accuracy with under 2x computational overhead, outperforming non-thinking models even after applying augmentation strategies such as few-shot learning (>8x cost), rubric-guided judging, reference-based evaluation, and n-best aggregation. Additional analyses show thinking models exhibit 6% higher average consistency under positional, bandwagon, identity, diversity, and random biases, with benefits extending to multilingual settings.

Significance. If the central comparison is fair and the quantitative results are reproducible, the work supplies concrete evidence that explicit reasoning steps at inference time can improve accuracy, efficiency, and robustness in the LLM-as-a-judge setting. This is relevant for reward modeling and automated evaluation pipelines. The evaluation across three model scales, multiple augmentation baselines, bias conditions, and a multilingual extension constitutes a systematic contribution that could inform practical choices between reasoning-enabled and augmented non-reasoning judges.

major comments (2)

The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).
The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.

minor comments (1)

The abstract contains a repeated sentence ('Our results show that...') that could be consolidated for conciseness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful and constructive comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the transparency and rigor of our claims without altering the core findings.

read point-by-point responses

Referee: The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).

Authors: We thank the referee for identifying this important point of clarification. The Qwen3 thinking and non-thinking variants share the same base pre-training data and model architecture as released in the official Qwen3 series. The primary difference lies in post-training: the thinking variants receive additional supervised fine-tuning on reasoning traces and chain-of-thought data to enable explicit reasoning at inference time, while the non-thinking variants use the base post-training without this emphasis. This setup allows us to isolate the effect of explicit reasoning steps during judging. To address the concern, we will revise the manuscript by adding a new paragraph in Section 3 (Experimental Setup) that explicitly describes the shared pre-training, the post-training distinctions with citations to the Qwen3 technical report, and how this relates to the augmentation baselines. This ensures the central claim is not confounded. revision: yes
Referee: The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.

Authors: We agree that referencing these methodological details is essential for supporting the quantitative claims. The full manuscript already provides the exact prompt templates (including thinking and non-thinking variants) in Appendix A, describes the RewardBench data splits and evaluation protocol in Section 3.1, and notes that main results are averaged over three independent runs with different random seeds. However, we did not include formal statistical tests or explicit variance reporting in the abstract or main results tables. In the revision, we will (1) add cross-references in the abstract to the relevant sections and appendix, (2) report standard deviations alongside the mean accuracies and robustness scores, and (3) add statistical significance tests (e.g., paired t-tests or McNemar’s test) for the key comparisons between thinking and augmented non-thinking models. These additions will be placed in Section 4 and a new subsection on statistical analysis. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of thinking vs non-thinking judges

full rationale

This paper reports direct experimental results from evaluating open-source Qwen3 thinking and non-thinking model variants on RewardBench and related tasks for accuracy, FLOPs efficiency, augmentation strategies, and bias robustness. There are no mathematical derivations, equations, fitted parameters, or first-principles claims that reduce reported outcomes to quantities defined by the experiment itself. Claims rest on standard benchmark measurements and comparisons to public baselines rather than self-definitional loops, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in the LLM-as-a-judge literature plus the premise that the thinking/non-thinking distinction is cleanly implemented in the chosen model family.

axioms (1)

domain assumption Small Qwen 3 models can be reliably configured to produce explicit reasoning traces or direct judgments
The paper treats the thinking versus non-thinking distinction as a controllable experimental variable.

pith-pipeline@v0.9.0 · 5819 in / 1361 out tokens · 64270 ms · 2026-05-18T17:27:20.876594+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lost in Translation: Do LVLM Judges Generalize Across Languages?
cs.CL 2026-04 unverdicted novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

Reference graph

Works this paper leans on

162 extracted references · 162 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Xing, Haotong Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, E. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[2]

Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[3]

Leveraging large language models for NLG evaluation: Advances and challenges

Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. Leveraging large language models for NLG evaluation: Advances and challenges. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, U...

work page 2024
[4]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

work page 2024
[5]

An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4

Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...

work page 2025
[6]

Self-Taught Evaluators, 2024

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-Taught Evaluators, 2024

work page 2024
[7]

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[8]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[9]

RewardBench: Evaluating reward models for language modeling, 2024

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. RewardBench: Evaluating reward models for language modeling, 2024

work page 2024
[10]

M-RewardBench: Evaluating reward models in multilingual settings

Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. M-RewardBench: Evaluating reward models in multilingual settings. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of...

work page 2025
[11]

A Survey on LLM-as-a-Judge, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A Survey on LLM-as-a-Judge, 2024

work page 2024
[12]

Justice or prejudice? quantifying biases in llm-as-a-judge, 2024

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, 2024. 10

work page 2024
[13]

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

work page 2024
[14]

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization-based Prompt Injection Attack to LLM-as-a-Judge. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2024

work page 2024
[15]

Chang, and Prithviraj Ammanabrolu

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-Loud Reward Models, 2024

work page 2024
[16]

ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

work page 2023
[17]

BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge. InInternational Conference on Learning Representations, 2025

work page 2025
[18]

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Rong Tan. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

work page 2025
[19]

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

Isik Baran Sandan, Tu Anh Dinh, and Jan Niehues. Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

work page 2025
[20]

How Reliable is Multilingual LLM-as-a-Judge?, 2025

Xiyan Fu and Wei Liu. How Reliable is Multilingual LLM-as-a-Judge?, 2025

work page 2025
[21]

UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, and Zeya Ahmad. UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

work page 2025
[22]

Ryan, Danmei Xu, Chris Nivera, and Daniel Campos

Michael J. Ryan, Danmei Xu, Chris Nivera, and Daniel Campos. EnronQA: Towards Personal- ized RAG over Private Documents, 2025

work page 2025
[23]

Weyssow, Aton Kamanda, Xin Zhou, and H

M. Weyssow, Aton Kamanda, Xin Zhou, and H. Sahraoui. CodeUltraFeedback: An LLM-as-a- Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Transactions on Software Engineering and Methodology, 2024

work page 2024
[24]

YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering

Jennifer D’Souza, Hamed Babaei Giglou, and Quentin Münch. YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[25]

Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J

E. Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J. Caskey, M. Oguss, Graham Wills, Guanhua Chen, D. Dligach, et al. Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge, 2025

work page 2025
[26]

Brill, 2025

Giuseppe Contissa and Galileo Sartor.Large Language Models in the Justice Domain. Brill, 2025

work page 2025
[27]

When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, and Xingyu Wu. When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

work page 2025
[28]

Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, and Min Zhang. Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

work page arXiv 2025
[29]

A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans. Mach. Learn. Res., 2025, Apr 2025

work page 2025
[30]

Efficient inference for large reasoning models: A survey,

Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. Efficient inference for large reasoning models: A survey.ArXiv, abs/2503.23077, Mar 2025

work page arXiv 2025
[31]

Yuxiao Qu, Matthew Y . R. Yang, Amrith Rajagopal Setlur, Lewis Tunstall, Edward Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.ArXiv, abs/2503.07572, Mar 2025. 11

work page arXiv 2025
[32]

Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025

Fan Liu, WenShuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.ArXiv, abs/2502.07191, Feb 2502

work page arXiv
[33]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wangxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.ArXiv, abs/2503.09567, Mar 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

work page arXiv
[35]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, J. Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models.ArXiv, abs/250...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

work page arXiv
[37]

Kwok, and Yu Zhang

Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, and Yu Zhang. Dynamicmind: A tri-mode thinking system for large language models.ArXiv, abs/2506.05936, Jun 2025

work page arXiv 2025
[38]

arXiv preprint arXiv:2502.18080 , year =

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.ArXiv, abs/2502.18080, Feb 2502

work page arXiv
[39]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.ArXiv, abs/2502.06703, Feb 2025

work page arXiv 2025
[40]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.ArXiv, abs/2408.03314, Aug 2408

work page internal anchor Pith review Pith/arXiv arXiv
[41]

The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

Yunho Jin, Gu-Yeon Wei, and David Brooks. The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

work page arXiv
[42]

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

work page 2024
[43]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[44]

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

work page 2025
[45]

Self-Preference Bias in LLM-as-a-Judge, 2024

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-Preference Bias in LLM-as-a-Judge, 2024

work page 2024
[46]

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking Cognitive Biases in Large Language Models as Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[47]

Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

work page 2025
[48]

Are Bias Evaluation Methods Biased ?, 2025

Lina Berrayana, Sean Rooney, Luis Garc’es-Erice, and Ioana Giurgiu. Are Bias Evaluation Methods Biased ?, 2025

work page 2025
[49]

JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm

Multiple Authors. JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm. Technical report, eScholarship, University of California, 2025. URL: https://escholarship.org/content/qt3w69j2wd/qt3w69j2wd.pdf

work page 2025
[50]

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025

Ashish Sardana. Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025. 12

work page 2025
[51]

Vyas Raina, Adian Liusie, and Mark J. F. Gales. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[52]

Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

work page 2021
[53]

Intent detection for code-mix utterances in task oriented dialogue systems

Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587, 2018

work page 2018
[54]

Multi stain graph fusion for multimodal integration in pathology

Chaitanya Dwivedi, Shima Nofallah, Maryam Pouryahya, Janani Iyer, Kenneth Leidal, Chuhan Chung, Timothy Watkins, Andrew Billin, Robert Myers, John Abel, and Ali Behrooz. Multi stain graph fusion for multimodal integration in pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1835–1845, June 2022

work page 2022
[55]

Yong-Jin Han

Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, and T. Yong-Jin Han. Deep kernels with probabilistic embeddings for small-data learning. In Cassio de Campos and Marloes H. Maathuis, editors,Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 ofProceedings of Machine Learning Research, pages 9...

work page 2021
[56]

Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

work page arXiv 2025
[57]

Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Nilay Pande. Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

work page arXiv 2025
[58]

Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

Sahiti Yerramilli, Nilay Pande, Rynaa Grover, and Jayant Sravan Tamarapalli. Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

work page arXiv 2025
[59]

Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Yonatan Bisk. Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

work page arXiv 2023
[60]

Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, and Eric Nyberg. Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

work page arXiv 2024
[61]

Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Kumar, Aditi Sharma, Ben Suk- boontip, Jayant Tamarapalli, Jingyi Zhang, and Aniruddh Koul. Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

work page 2025
[62]

[[A]]" /

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, and Eric Nyberg. Semantic augmentation in images using language.arXiv preprint arXiv:2404.02353, 2024. 13 lang Chat Chat Hard Safety Reasoning Average Chat Chat Hard Safety Reasoning Average ar 96.00% 61.19% 76.82% 74.70% 77.18% 92.92% 69.57% 81.37% 95.70% 84.89% cs 95...

work page arXiv 2024
[63]

While [53] showcases the impact of multilingual and code-mix training on language models

showcase the ability of using encoder based SLMs to generate embeddings for task-oriented multi-turn dialogue systems. While [53] showcases the impact of multilingual and code-mix training on language models. [54] demonstrate the value of using disparate data sources during model training

work page
[64]

[[A]]" if assistant A is better,

apply Bayesian inference to improve model performance in small data regime. Recent work benchmarks Multimodal Large Language Model (MLLM) weaknesses in counting (‘CountQA‘ [56]), perception (‘HueManity‘ [57]), and geographic reasoning (‘GeoChain‘ [58]). To address underlying flaws like unimodal dominance, research has explored multimodal attribution (‘MAE...

work page
[66]

- Factuality: statements are correct and non-speculative

Checks: - Instruction following: directly satisfies the User Question and all stated constraints. - Factuality: statements are correct and non-speculative. - Completeness: all required parts are covered without gaps. - Clarity: clear, organized, easy to follow. - Reasoning-aware: if steps are shown, they are consistent and lead to a correct result (steps ...

work page
[67]

Penalize: confident errors, ignored constraints, irrelevant fluff

work page
[68]

Decision: choose Assistant A if it better satisfies these checks for the User Question; otherwise choose Assistant B

work page
[69]

Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard)

Neutrality: ignore presentation order, assistant names, and response length; ignore decorative style. Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard). Steps:

work page
[71]

- Factual rigor: accurate, grounded; no hallucinations

Checks: - Multi-constraint satisfaction: meets all explicit constraints, formats, and edge cases. - Factual rigor: accurate, grounded; no hallucinations. - Disambiguation: sensibly resolves underspecification and states assumptions when needed. - Reasoning-aware: if steps are shown, they are sound and consistent. - Clarity: structured, readable, non-verbose

work page
[72]

Penalize: constraint violations, invented details, overconfident but wrong logic

work page
[73]

Decision: choose Assistant A if it better satisfies constraints and correctness (and handles ambiguity/clarity better when close); otherwise choose Assistant B

work page
[74]

Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress

Neutrality: ignore presentation order, assistant names, and length. Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress. Steps:

work page
[76]

- Factuality: correct and grounded

Checks (judge content, not length): - Instruction adherence: precisely satisfies the User Question. - Factuality: correct and grounded. - Efficiency of content: avoids padding; every sentence adds value. - Reasoning-aware: steps, if present, are consistent and correct

work page
[77]

Penalize: padding/verbosity without value, missed constraints, inaccuracies

work page
[78]

Do not use length as a tie- breaker

Decision: choose Assistant A if its content better fulfills the checks; otherwise choose Assistant B. Do not use length as a tie- breaker

work page
[79]

Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy)

Neutrality: ignore presentation order, assistant names, and response length. Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy). Steps:

work page
[80]

Read the full conversation context in the User Question and the final-turn answers from Assistant A and Assistant B

work page
[81]

- Final task fulfillment: satisfies the final request/format in the User Question

Checks: - Turn consistency: tracks prior turns; no contradictions. - Final task fulfillment: satisfies the final request/format in the User Question. - Factual accuracy: information is correct. - Clarity & tone: clear, appropriately concise, helpful. - Reasoning-aware: if steps are shown, they are coherent with the dialogue

work page
[82]

Penalize: loss of context, incorrect facts, meandering/off-task replies

work page
[83]

Decision: choose Assistant A if it better fulfills the final turn while staying consistent and accurate; otherwise choose Assistant B

work page

Showing first 80 references.

[1] [1]

Xing, Haotong Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, E. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[2] [2]

Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[3] [3]

Leveraging large language models for NLG evaluation: Advances and challenges

Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. Leveraging large language models for NLG evaluation: Advances and challenges. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, U...

work page 2024

[4] [4]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

work page 2024

[5] [5]

An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4

Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...

work page 2025

[6] [6]

Self-Taught Evaluators, 2024

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-Taught Evaluators, 2024

work page 2024

[7] [7]

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[8] [8]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[9] [9]

RewardBench: Evaluating reward models for language modeling, 2024

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. RewardBench: Evaluating reward models for language modeling, 2024

work page 2024

[10] [10]

M-RewardBench: Evaluating reward models in multilingual settings

Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. M-RewardBench: Evaluating reward models in multilingual settings. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of...

work page 2025

[11] [11]

A Survey on LLM-as-a-Judge, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A Survey on LLM-as-a-Judge, 2024

work page 2024

[12] [12]

Justice or prejudice? quantifying biases in llm-as-a-judge, 2024

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, 2024. 10

work page 2024

[13] [13]

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

work page 2024

[14] [14]

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization-based Prompt Injection Attack to LLM-as-a-Judge. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2024

work page 2024

[15] [15]

Chang, and Prithviraj Ammanabrolu

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-Loud Reward Models, 2024

work page 2024

[16] [16]

ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

work page 2023

[17] [17]

BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge. InInternational Conference on Learning Representations, 2025

work page 2025

[18] [18]

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Rong Tan. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

work page 2025

[19] [19]

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

Isik Baran Sandan, Tu Anh Dinh, and Jan Niehues. Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

work page 2025

[20] [20]

How Reliable is Multilingual LLM-as-a-Judge?, 2025

Xiyan Fu and Wei Liu. How Reliable is Multilingual LLM-as-a-Judge?, 2025

work page 2025

[21] [21]

UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, and Zeya Ahmad. UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

work page 2025

[22] [22]

Ryan, Danmei Xu, Chris Nivera, and Daniel Campos

Michael J. Ryan, Danmei Xu, Chris Nivera, and Daniel Campos. EnronQA: Towards Personal- ized RAG over Private Documents, 2025

work page 2025

[23] [23]

Weyssow, Aton Kamanda, Xin Zhou, and H

M. Weyssow, Aton Kamanda, Xin Zhou, and H. Sahraoui. CodeUltraFeedback: An LLM-as-a- Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Transactions on Software Engineering and Methodology, 2024

work page 2024

[24] [24]

YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering

Jennifer D’Souza, Hamed Babaei Giglou, and Quentin Münch. YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025

[25] [25]

Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J

E. Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J. Caskey, M. Oguss, Graham Wills, Guanhua Chen, D. Dligach, et al. Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge, 2025

work page 2025

[26] [26]

Brill, 2025

Giuseppe Contissa and Galileo Sartor.Large Language Models in the Justice Domain. Brill, 2025

work page 2025

[27] [27]

When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, and Xingyu Wu. When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

work page 2025

[28] [28]

Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, and Min Zhang. Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

work page arXiv 2025

[29] [29]

A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans. Mach. Learn. Res., 2025, Apr 2025

work page 2025

[30] [30]

Efficient inference for large reasoning models: A survey,

Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. Efficient inference for large reasoning models: A survey.ArXiv, abs/2503.23077, Mar 2025

work page arXiv 2025

[31] [31]

Yuxiao Qu, Matthew Y . R. Yang, Amrith Rajagopal Setlur, Lewis Tunstall, Edward Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.ArXiv, abs/2503.07572, Mar 2025. 11

work page arXiv 2025

[32] [32]

Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025

Fan Liu, WenShuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.ArXiv, abs/2502.07191, Feb 2502

work page arXiv

[33] [33]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wangxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.ArXiv, abs/2503.09567, Mar 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

work page arXiv

[35] [35]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, J. Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models.ArXiv, abs/250...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

work page arXiv

[37] [37]

Kwok, and Yu Zhang

Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, and Yu Zhang. Dynamicmind: A tri-mode thinking system for large language models.ArXiv, abs/2506.05936, Jun 2025

work page arXiv 2025

[38] [38]

arXiv preprint arXiv:2502.18080 , year =

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.ArXiv, abs/2502.18080, Feb 2502

work page arXiv

[39] [39]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.ArXiv, abs/2502.06703, Feb 2025

work page arXiv 2025

[40] [40]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.ArXiv, abs/2408.03314, Aug 2408

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

Yunho Jin, Gu-Yeon Wei, and David Brooks. The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

work page arXiv

[42] [42]

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

work page 2024

[43] [43]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[44] [44]

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

work page 2025

[45] [45]

Self-Preference Bias in LLM-as-a-Judge, 2024

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-Preference Bias in LLM-as-a-Judge, 2024

work page 2024

[46] [46]

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking Cognitive Biases in Large Language Models as Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[47] [47]

Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

work page 2025

[48] [48]

Are Bias Evaluation Methods Biased ?, 2025

Lina Berrayana, Sean Rooney, Luis Garc’es-Erice, and Ioana Giurgiu. Are Bias Evaluation Methods Biased ?, 2025

work page 2025

[49] [49]

JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm

Multiple Authors. JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm. Technical report, eScholarship, University of California, 2025. URL: https://escholarship.org/content/qt3w69j2wd/qt3w69j2wd.pdf

work page 2025

[50] [50]

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025

Ashish Sardana. Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025. 12

work page 2025

[51] [51]

Vyas Raina, Adian Liusie, and Mark J. F. Gales. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[52] [52]

Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

work page 2021

[53] [53]

Intent detection for code-mix utterances in task oriented dialogue systems

Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587, 2018

work page 2018

[54] [54]

Multi stain graph fusion for multimodal integration in pathology

Chaitanya Dwivedi, Shima Nofallah, Maryam Pouryahya, Janani Iyer, Kenneth Leidal, Chuhan Chung, Timothy Watkins, Andrew Billin, Robert Myers, John Abel, and Ali Behrooz. Multi stain graph fusion for multimodal integration in pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1835–1845, June 2022

work page 2022

[55] [55]

Yong-Jin Han

Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, and T. Yong-Jin Han. Deep kernels with probabilistic embeddings for small-data learning. In Cassio de Campos and Marloes H. Maathuis, editors,Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 ofProceedings of Machine Learning Research, pages 9...

work page 2021

[56] [56]

Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

work page arXiv 2025

[57] [57]

Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Nilay Pande. Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

work page arXiv 2025

[58] [58]

Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

Sahiti Yerramilli, Nilay Pande, Rynaa Grover, and Jayant Sravan Tamarapalli. Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

work page arXiv 2025

[59] [59]

Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Yonatan Bisk. Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

work page arXiv 2023

[60] [60]

Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, and Eric Nyberg. Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

work page arXiv 2024

[61] [61]

Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Kumar, Aditi Sharma, Ben Suk- boontip, Jayant Tamarapalli, Jingyi Zhang, and Aniruddh Koul. Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

work page 2025

[62] [62]

[[A]]" /

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, and Eric Nyberg. Semantic augmentation in images using language.arXiv preprint arXiv:2404.02353, 2024. 13 lang Chat Chat Hard Safety Reasoning Average Chat Chat Hard Safety Reasoning Average ar 96.00% 61.19% 76.82% 74.70% 77.18% 92.92% 69.57% 81.37% 95.70% 84.89% cs 95...

work page arXiv 2024

[63] [63]

While [53] showcases the impact of multilingual and code-mix training on language models

showcase the ability of using encoder based SLMs to generate embeddings for task-oriented multi-turn dialogue systems. While [53] showcases the impact of multilingual and code-mix training on language models. [54] demonstrate the value of using disparate data sources during model training

work page

[64] [64]

[[A]]" if assistant A is better,

apply Bayesian inference to improve model performance in small data regime. Recent work benchmarks Multimodal Large Language Model (MLLM) weaknesses in counting (‘CountQA‘ [56]), perception (‘HueManity‘ [57]), and geographic reasoning (‘GeoChain‘ [58]). To address underlying flaws like unimodal dominance, research has explored multimodal attribution (‘MAE...

work page

[65] [66]

- Factuality: statements are correct and non-speculative

Checks: - Instruction following: directly satisfies the User Question and all stated constraints. - Factuality: statements are correct and non-speculative. - Completeness: all required parts are covered without gaps. - Clarity: clear, organized, easy to follow. - Reasoning-aware: if steps are shown, they are consistent and lead to a correct result (steps ...

work page

[66] [67]

Penalize: confident errors, ignored constraints, irrelevant fluff

work page

[67] [68]

Decision: choose Assistant A if it better satisfies these checks for the User Question; otherwise choose Assistant B

work page

[68] [69]

Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard)

Neutrality: ignore presentation order, assistant names, and response length; ignore decorative style. Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard). Steps:

work page

[69] [71]

- Factual rigor: accurate, grounded; no hallucinations

Checks: - Multi-constraint satisfaction: meets all explicit constraints, formats, and edge cases. - Factual rigor: accurate, grounded; no hallucinations. - Disambiguation: sensibly resolves underspecification and states assumptions when needed. - Reasoning-aware: if steps are shown, they are sound and consistent. - Clarity: structured, readable, non-verbose

work page

[70] [72]

Penalize: constraint violations, invented details, overconfident but wrong logic

work page

[71] [73]

Decision: choose Assistant A if it better satisfies constraints and correctness (and handles ambiguity/clarity better when close); otherwise choose Assistant B

work page

[72] [74]

Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress

Neutrality: ignore presentation order, assistant names, and length. Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress. Steps:

work page

[73] [76]

- Factuality: correct and grounded

Checks (judge content, not length): - Instruction adherence: precisely satisfies the User Question. - Factuality: correct and grounded. - Efficiency of content: avoids padding; every sentence adds value. - Reasoning-aware: steps, if present, are consistent and correct

work page

[74] [77]

Penalize: padding/verbosity without value, missed constraints, inaccuracies

work page

[75] [78]

Do not use length as a tie- breaker

Decision: choose Assistant A if its content better fulfills the checks; otherwise choose Assistant B. Do not use length as a tie- breaker

work page

[76] [79]

Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy)

Neutrality: ignore presentation order, assistant names, and response length. Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy). Steps:

work page

[77] [80]

Read the full conversation context in the User Question and the final-turn answers from Assistant A and Assistant B

work page

[78] [81]

- Final task fulfillment: satisfies the final request/format in the User Question

Checks: - Turn consistency: tracks prior turns; no contradictions. - Final task fulfillment: satisfies the final request/format in the User Question. - Factual accuracy: information is correct. - Clarity & tone: clear, appropriately concise, helpful. - Reasoning-aware: if steps are shown, they are coherent with the dialogue

work page

[79] [82]

Penalize: loss of context, incorrect facts, meandering/off-task replies

work page

[80] [83]

Decision: choose Assistant A if it better fulfills the final turn while staying consistent and accurate; otherwise choose Assistant B

work page