pith. sign in

arxiv: 2509.13332 · v2 · submitted 2025-09-09 · 💻 cs.AI · cs.CL

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Pith reviewed 2026-05-18 17:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM as judgeexplicit reasoningthinking modelsaccuracy and efficiencybias robustnessRewardBenchmultilingual evaluation
0
0 comments X

The pith

Thinking models deliver 10 accuracy points more as LLM judges than non-thinking ones at under twice the compute cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares thinking and non-thinking LLMs as automated judges using small Qwen 3 models. Thinking models generate explicit reasoning before deciding and reach roughly ten percentage points higher accuracy on RewardBench tasks. Non-thinking models stay behind even after several augmentation strategies such as few-shot examples, rubrics, and reference-based checks, and those strategies require more than eight times the computation for smaller gains. Thinking models also hold up better against positional, bandwagon, identity, diversity, and random biases, showing six percent higher consistency on average. The same pattern appears in multilingual tests.

Core claim

The central claim is that explicit reasoning improves LLM performance in the judge role. Thinking models achieve approximately 10 percentage points higher accuracy with under 2x overhead, while augmentation strategies for non-thinking models produce only modest gains at over 8x cost. Thinking models also show significantly greater consistency across positional, bandwagon, identity, diversity, and random bias conditions, averaging 6 percent higher robustness, and these benefits extend to multilingual settings.

What carries the argument

The direct comparison of thinking models that output explicit reasoning steps before a final judgment against non-thinking models that output judgments directly, measured on RewardBench tasks for accuracy, FLOPs, and bias consistency.

If this is right

  • Thinking models provide a better accuracy-efficiency trade-off for automated judging than prompting or aggregation enhancements to non-thinking models.
  • Explicit reasoning reduces the impact of positional, identity, and other common biases by roughly 6 percent on average.
  • The accuracy and robustness advantages of reasoning persist when evaluation moves beyond English to other languages.
  • Complex augmentation pipelines for direct-output models deliver smaller returns at substantially higher computational expense.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Judge systems in benchmarks or reward modeling could simplify by defaulting to reasoning-enabled models instead of layered prompting tricks.
  • Similar gains might appear in other decision tasks where LLMs must stay consistent under biased or noisy inputs.
  • Training objectives that encourage step-by-step reasoning could become standard for models intended as evaluators.

Load-bearing premise

The performance gaps arise mainly from the presence or absence of explicit reasoning rather than from unstated differences in how the thinking and non-thinking model variants were trained or prompted.

What would settle it

Train or fine-tune otherwise identical base models with and without an explicit reasoning objective, then run both on the same RewardBench and bias suites to check whether the accuracy and robustness gaps remain.

Figures

Figures reproduced from arXiv: 2509.13332 by Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao.

Figure 1
Figure 1. Figure 1: Demonstrating Qwen-3 4B as a judge under thinking vs. non-thinking mode with various [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The plots compare average accuracy against relative computational cost (FLOPs) for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for Baseline setting 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt for LLMaaJ w In Context Examples 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for LLMaaJ w Reference 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for LLMaaJ w Rubric Prompt 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rubric for MRewardBench subset: alpacaeval-easy [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rubric for MRewardBench subset: alpacaeval-hard [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Rubric for MRewardBench subset: alpacaeval-length [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Rubric for MRewardBench subset: mt-bench-easy [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rubric for MRewardBench subset: mt-bench-med [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Rubric for MRewardBench subset: mt-bench-hard [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Rubric for MRewardBench subset: llmbar-natural [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Rubric for MRewardBench subset: llmbar-adver-neighbor [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Rubric for MRewardBench subset: llmbar-adver-GPTInst [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Rubric for MRewardBench subset: llmbar-adver-GPTOut [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Rubric for MRewardBench subset: llmbar-adver-manual [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Rubric for MRewardBench subset: refusals-dangerous [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Rubric for MRewardBench subset: refusals-offensive [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Rubric for MRewardBench subset: xstest-should-refuse [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Rubric for MRewardBench subset: xstest-should-respond [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Rubric for MRewardBench subset: donotanswer [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Rubric for MRewardBench subset: hep-python [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Rubric for MRewardBench subset: hep-js Rubric for MRewardBench subset: hep-java Pairwise judge for HumanEvalPack (Java). Steps: 1) Read the method/class spec in the User Question; read code from Assistant A and Assistant B. 2) Checks: - Functional correctness: meets the spec; would pass tests. - API contract: correct method/class signatures, visibility, and types. - Edge cases & complexity: covers edge ca… view at source ↗
Figure 25
Figure 25. Figure 25: Rubric for MRewardBench subset: hep-java [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Rubric for MRewardBench subset: hep-go Rubric for MRewardBench subset: hep-cpp Pairwise judge for HumanEvalPack (C++). Steps: 1) Read the function spec in the User Question; read code from Assistant A and Assistant B. 2) Checks: - Functional correctness: logic meets the spec; would pass tests. - API contract: correct signature, headers, and namespaces. - Edge cases & complexity: covers edge cases; appropr… view at source ↗
Figure 27
Figure 27. Figure 27: Rubric for MRewardBench subset: hep-cpp 29 [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Rubric for MRewardBench subset: hep-rust [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Rubric for MRewardBench subset: math-prm [PITH_FULL_IMAGE:figures/full_fig_p030_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Verbosity Prompt Bandwagon Bias Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail… view at source ↗
Figure 31
Figure 31. Figure 31: Prompt to evaluate LLMaaJ w Bandwagon Bias [PITH_FULL_IMAGE:figures/full_fig_p031_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Prompt to evaluate LLMaaJ w Diversity Bias [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Prompt to evaluate LLMaaJ w Identity Bias [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Prompt to evaluate LLMaaJ w Distraction Bias [PITH_FULL_IMAGE:figures/full_fig_p034_34.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a systematic empirical comparison of 'thinking' (explicit reasoning) versus 'non-thinking' variants of small open-source Qwen3 models (0.6B, 1.7B, 4B parameters) used as LLM judges. On RewardBench tasks, it reports that thinking models deliver approximately 10 percentage points higher accuracy with under 2x computational overhead, outperforming non-thinking models even after applying augmentation strategies such as few-shot learning (>8x cost), rubric-guided judging, reference-based evaluation, and n-best aggregation. Additional analyses show thinking models exhibit 6% higher average consistency under positional, bandwagon, identity, diversity, and random biases, with benefits extending to multilingual settings.

Significance. If the central comparison is fair and the quantitative results are reproducible, the work supplies concrete evidence that explicit reasoning steps at inference time can improve accuracy, efficiency, and robustness in the LLM-as-a-judge setting. This is relevant for reward modeling and automated evaluation pipelines. The evaluation across three model scales, multiple augmentation baselines, bias conditions, and a multilingual extension constitutes a systematic contribution that could inform practical choices between reasoning-enabled and augmented non-reasoning judges.

major comments (2)
  1. The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).
  2. The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.
minor comments (1)
  1. The abstract contains a repeated sentence ('Our results show that...') that could be consolidated for conciseness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful and constructive comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the transparency and rigor of our claims without altering the core findings.

read point-by-point responses
  1. Referee: The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).

    Authors: We thank the referee for identifying this important point of clarification. The Qwen3 thinking and non-thinking variants share the same base pre-training data and model architecture as released in the official Qwen3 series. The primary difference lies in post-training: the thinking variants receive additional supervised fine-tuning on reasoning traces and chain-of-thought data to enable explicit reasoning at inference time, while the non-thinking variants use the base post-training without this emphasis. This setup allows us to isolate the effect of explicit reasoning steps during judging. To address the concern, we will revise the manuscript by adding a new paragraph in Section 3 (Experimental Setup) that explicitly describes the shared pre-training, the post-training distinctions with citations to the Qwen3 technical report, and how this relates to the augmentation baselines. This ensures the central claim is not confounded. revision: yes

  2. Referee: The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.

    Authors: We agree that referencing these methodological details is essential for supporting the quantitative claims. The full manuscript already provides the exact prompt templates (including thinking and non-thinking variants) in Appendix A, describes the RewardBench data splits and evaluation protocol in Section 3.1, and notes that main results are averaged over three independent runs with different random seeds. However, we did not include formal statistical tests or explicit variance reporting in the abstract or main results tables. In the revision, we will (1) add cross-references in the abstract to the relevant sections and appendix, (2) report standard deviations alongside the mean accuracies and robustness scores, and (3) add statistical significance tests (e.g., paired t-tests or McNemar’s test) for the key comparisons between thinking and augmented non-thinking models. These additions will be placed in Section 4 and a new subsection on statistical analysis. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of thinking vs non-thinking judges

full rationale

This paper reports direct experimental results from evaluating open-source Qwen3 thinking and non-thinking model variants on RewardBench and related tasks for accuracy, FLOPs efficiency, augmentation strategies, and bias robustness. There are no mathematical derivations, equations, fitted parameters, or first-principles claims that reduce reported outcomes to quantities defined by the experiment itself. Claims rest on standard benchmark measurements and comparisons to public baselines rather than self-definitional loops, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in the LLM-as-a-judge literature plus the premise that the thinking/non-thinking distinction is cleanly implemented in the chosen model family.

axioms (1)
  • domain assumption Small Qwen 3 models can be reliably configured to produce explicit reasoning traces or direct judgments
    The paper treats the thinking versus non-thinking distinction as a controllable experimental variable.

pith-pipeline@v0.9.0 · 5819 in / 1361 out tokens · 64270 ms · 2026-05-18T17:27:20.876594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lost in Translation: Do LVLM Judges Generalize Across Languages?

    cs.CL 2026-04 unverdicted novelty 8.0

    MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

Reference graph

Works this paper leans on

162 extracted references · 162 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Xing, Haotong Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, E. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, 2023

  2. [2]

    Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

    Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

  3. [3]

    Leveraging large language models for NLG evaluation: Advances and challenges

    Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. Leveraging large language models for NLG evaluation: Advances and challenges. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, U...

  4. [4]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024

  5. [5]

    An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4

    Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...

  6. [6]

    Self-Taught Evaluators, 2024

    Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-Taught Evaluators, 2024

  7. [7]

    Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  8. [8]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  9. [9]

    RewardBench: Evaluating reward models for language modeling, 2024

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. RewardBench: Evaluating reward models for language modeling, 2024

  10. [10]

    M-RewardBench: Evaluating reward models in multilingual settings

    Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. M-RewardBench: Evaluating reward models in multilingual settings. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of...

  11. [11]

    A Survey on LLM-as-a-Judge, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A Survey on LLM-as-a-Judge, 2024

  12. [12]

    Justice or prejudice? quantifying biases in llm-as-a-judge, 2024

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, 2024. 10

  13. [13]

    From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024

  14. [14]

    Optimization-based Prompt Injection Attack to LLM-as-a-Judge

    Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization-based Prompt Injection Attack to LLM-as-a-Judge. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2024

  15. [15]

    Chang, and Prithviraj Ammanabrolu

    Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-Loud Reward Models, 2024

  16. [16]

    ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023

  17. [17]

    BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

    Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge. InInternational Conference on Learning Representations, 2025

  18. [18]

    CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

    Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Rong Tan. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025

  19. [19]

    Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

    Isik Baran Sandan, Tu Anh Dinh, and Jan Niehues. Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025

  20. [20]

    How Reliable is Multilingual LLM-as-a-Judge?, 2025

    Xiyan Fu and Wei Liu. How Reliable is Multilingual LLM-as-a-Judge?, 2025

  21. [21]

    UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

    Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, and Zeya Ahmad. UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025

  22. [22]

    Ryan, Danmei Xu, Chris Nivera, and Daniel Campos

    Michael J. Ryan, Danmei Xu, Chris Nivera, and Daniel Campos. EnronQA: Towards Personal- ized RAG over Private Documents, 2025

  23. [23]

    Weyssow, Aton Kamanda, Xin Zhou, and H

    M. Weyssow, Aton Kamanda, Xin Zhou, and H. Sahraoui. CodeUltraFeedback: An LLM-as-a- Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Transactions on Software Engineering and Methodology, 2024

  24. [24]

    YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering

    Jennifer D’Souza, Hamed Babaei Giglou, and Quentin Münch. YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  25. [25]

    Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J

    E. Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J. Caskey, M. Oguss, Graham Wills, Guanhua Chen, D. Dligach, et al. Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge, 2025

  26. [26]

    Brill, 2025

    Giuseppe Contissa and Galileo Sartor.Large Language Models in the Justice Domain. Brill, 2025

  27. [27]

    When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

    Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, and Xingyu Wu. When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025

  28. [28]

    Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

    Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, and Min Zhang. Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025

  29. [29]

    A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans

    Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans. Mach. Learn. Res., 2025, Apr 2025

  30. [30]

    Efficient inference for large reasoning models: A survey,

    Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. Efficient inference for large reasoning models: A survey.ArXiv, abs/2503.23077, Mar 2025

  31. [31]

    Yuxiao Qu, Matthew Y . R. Yang, Amrith Rajagopal Setlur, Lewis Tunstall, Edward Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.ArXiv, abs/2503.07572, Mar 2025. 11

  32. [32]

    Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025

    Fan Liu, WenShuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.ArXiv, abs/2502.07191, Feb 2502

  33. [33]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wangxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.ArXiv, abs/2503.09567, Mar 2025

  34. [34]

    Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

    Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

  35. [35]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, J. Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models.ArXiv, abs/250...

  36. [36]

    Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

    Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410

  37. [37]

    Kwok, and Yu Zhang

    Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, and Yu Zhang. Dynamicmind: A tri-mode thinking system for large language models.ArXiv, abs/2506.05936, Jun 2025

  38. [38]

    arXiv preprint arXiv:2502.18080 , year =

    Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.ArXiv, abs/2502.18080, Feb 2502

  39. [39]

    Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.ArXiv, abs/2502.06703, Feb 2025

  40. [40]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.ArXiv, abs/2408.03314, Aug 2408

  41. [41]

    The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

    Yunho Jin, Gu-Yeon Wei, and David Brooks. The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505

  42. [42]

    Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

    Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024

  43. [43]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

  44. [44]

    No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

    Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025

  45. [45]

    Self-Preference Bias in LLM-as-a-Judge, 2024

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-Preference Bias in LLM-as-a-Judge, 2024

  46. [46]

    Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking Cognitive Biases in Large Language Models as Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

  47. [47]

    Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

    Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025

  48. [48]

    Are Bias Evaluation Methods Biased ?, 2025

    Lina Berrayana, Sean Rooney, Luis Garc’es-Erice, and Ioana Giurgiu. Are Bias Evaluation Methods Biased ?, 2025

  49. [49]

    JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm

    Multiple Authors. JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm. Technical report, eScholarship, University of California, 2025. URL: https://escholarship.org/content/qt3w69j2wd/qt3w69j2wd.pdf

  50. [50]

    Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025

    Ashish Sardana. Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025. 12

  51. [51]

    Vyas Raina, Adian Liusie, and Mark J. F. Gales. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  52. [52]

    Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

    Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021

  53. [53]

    Intent detection for code-mix utterances in task oriented dialogue systems

    Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587, 2018

  54. [54]

    Multi stain graph fusion for multimodal integration in pathology

    Chaitanya Dwivedi, Shima Nofallah, Maryam Pouryahya, Janani Iyer, Kenneth Leidal, Chuhan Chung, Timothy Watkins, Andrew Billin, Robert Myers, John Abel, and Ali Behrooz. Multi stain graph fusion for multimodal integration in pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1835–1845, June 2022

  55. [55]

    Yong-Jin Han

    Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, and T. Yong-Jin Han. Deep kernels with probabilistic embeddings for small-data learning. In Cassio de Campos and Marloes H. Maathuis, editors,Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 ofProceedings of Machine Learning Research, pages 9...

  56. [56]

    Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

    Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025

  57. [57]

    Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

    Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Nilay Pande. Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025

  58. [58]

    Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

    Sahiti Yerramilli, Nilay Pande, Rynaa Grover, and Jayant Sravan Tamarapalli. Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025

  59. [59]

    Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

    Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Yonatan Bisk. Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023

  60. [60]

    Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

    Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, and Eric Nyberg. Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024

  61. [61]

    Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

    Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Kumar, Aditi Sharma, Ben Suk- boontip, Jayant Tamarapalli, Jingyi Zhang, and Aniruddh Koul. Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025

  62. [62]

    [[A]]" /

    Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, and Eric Nyberg. Semantic augmentation in images using language.arXiv preprint arXiv:2404.02353, 2024. 13 lang Chat Chat Hard Safety Reasoning Average Chat Chat Hard Safety Reasoning Average ar 96.00% 61.19% 76.82% 74.70% 77.18% 92.92% 69.57% 81.37% 95.70% 84.89% cs 95...

  63. [63]

    While [53] showcases the impact of multilingual and code-mix training on language models

    showcase the ability of using encoder based SLMs to generate embeddings for task-oriented multi-turn dialogue systems. While [53] showcases the impact of multilingual and code-mix training on language models. [54] demonstrate the value of using disparate data sources during model training

  64. [64]

    [[A]]" if assistant A is better,

    apply Bayesian inference to improve model performance in small data regime. Recent work benchmarks Multimodal Large Language Model (MLLM) weaknesses in counting (‘CountQA‘ [56]), perception (‘HueManity‘ [57]), and geographic reasoning (‘GeoChain‘ [58]). To address underlying flaws like unimodal dominance, research has explored multimodal attribution (‘MAE...

  65. [66]

    - Factuality: statements are correct and non-speculative

    Checks: - Instruction following: directly satisfies the User Question and all stated constraints. - Factuality: statements are correct and non-speculative. - Completeness: all required parts are covered without gaps. - Clarity: clear, organized, easy to follow. - Reasoning-aware: if steps are shown, they are consistent and lead to a correct result (steps ...

  66. [67]

    Penalize: confident errors, ignored constraints, irrelevant fluff

  67. [68]

    Decision: choose Assistant A if it better satisfies these checks for the User Question; otherwise choose Assistant B

  68. [69]

    Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard)

    Neutrality: ignore presentation order, assistant names, and response length; ignore decorative style. Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard). Steps:

  69. [71]

    - Factual rigor: accurate, grounded; no hallucinations

    Checks: - Multi-constraint satisfaction: meets all explicit constraints, formats, and edge cases. - Factual rigor: accurate, grounded; no hallucinations. - Disambiguation: sensibly resolves underspecification and states assumptions when needed. - Reasoning-aware: if steps are shown, they are sound and consistent. - Clarity: structured, readable, non-verbose

  70. [72]

    Penalize: constraint violations, invented details, overconfident but wrong logic

  71. [73]

    Decision: choose Assistant A if it better satisfies constraints and correctness (and handles ambiguity/clarity better when close); otherwise choose Assistant B

  72. [74]

    Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress

    Neutrality: ignore presentation order, assistant names, and length. Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress. Steps:

  73. [76]

    - Factuality: correct and grounded

    Checks (judge content, not length): - Instruction adherence: precisely satisfies the User Question. - Factuality: correct and grounded. - Efficiency of content: avoids padding; every sentence adds value. - Reasoning-aware: steps, if present, are consistent and correct

  74. [77]

    Penalize: padding/verbosity without value, missed constraints, inaccuracies

  75. [78]

    Do not use length as a tie- breaker

    Decision: choose Assistant A if its content better fulfills the checks; otherwise choose Assistant B. Do not use length as a tie- breaker

  76. [79]

    Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy)

    Neutrality: ignore presentation order, assistant names, and response length. Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy). Steps:

  77. [80]

    Read the full conversation context in the User Question and the final-turn answers from Assistant A and Assistant B

  78. [81]

    - Final task fulfillment: satisfies the final request/format in the User Question

    Checks: - Turn consistency: tracks prior turns; no contradictions. - Final task fulfillment: satisfies the final request/format in the User Question. - Factual accuracy: information is correct. - Clarity & tone: clear, appropriately concise, helpful. - Reasoning-aware: if steps are shown, they are coherent with the dialogue

  79. [82]

    Penalize: loss of context, incorrect facts, meandering/off-task replies

  80. [83]

    Decision: choose Assistant A if it better fulfills the final turn while staying consistent and accurate; otherwise choose Assistant B

Showing first 80 references.