Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness
Pith reviewed 2026-05-18 17:27 UTC · model grok-4.3
The pith
Thinking models deliver 10 accuracy points more as LLM judges than non-thinking ones at under twice the compute cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explicit reasoning improves LLM performance in the judge role. Thinking models achieve approximately 10 percentage points higher accuracy with under 2x overhead, while augmentation strategies for non-thinking models produce only modest gains at over 8x cost. Thinking models also show significantly greater consistency across positional, bandwagon, identity, diversity, and random bias conditions, averaging 6 percent higher robustness, and these benefits extend to multilingual settings.
What carries the argument
The direct comparison of thinking models that output explicit reasoning steps before a final judgment against non-thinking models that output judgments directly, measured on RewardBench tasks for accuracy, FLOPs, and bias consistency.
If this is right
- Thinking models provide a better accuracy-efficiency trade-off for automated judging than prompting or aggregation enhancements to non-thinking models.
- Explicit reasoning reduces the impact of positional, identity, and other common biases by roughly 6 percent on average.
- The accuracy and robustness advantages of reasoning persist when evaluation moves beyond English to other languages.
- Complex augmentation pipelines for direct-output models deliver smaller returns at substantially higher computational expense.
Where Pith is reading between the lines
- Judge systems in benchmarks or reward modeling could simplify by defaulting to reasoning-enabled models instead of layered prompting tricks.
- Similar gains might appear in other decision tasks where LLMs must stay consistent under biased or noisy inputs.
- Training objectives that encourage step-by-step reasoning could become standard for models intended as evaluators.
Load-bearing premise
The performance gaps arise mainly from the presence or absence of explicit reasoning rather than from unstated differences in how the thinking and non-thinking model variants were trained or prompted.
What would settle it
Train or fine-tune otherwise identical base models with and without an explicit reasoning objective, then run both on the same RewardBench and bias suites to check whether the accuracy and robustness gaps remain.
Figures
read the original abstract
As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic empirical comparison of 'thinking' (explicit reasoning) versus 'non-thinking' variants of small open-source Qwen3 models (0.6B, 1.7B, 4B parameters) used as LLM judges. On RewardBench tasks, it reports that thinking models deliver approximately 10 percentage points higher accuracy with under 2x computational overhead, outperforming non-thinking models even after applying augmentation strategies such as few-shot learning (>8x cost), rubric-guided judging, reference-based evaluation, and n-best aggregation. Additional analyses show thinking models exhibit 6% higher average consistency under positional, bandwagon, identity, diversity, and random biases, with benefits extending to multilingual settings.
Significance. If the central comparison is fair and the quantitative results are reproducible, the work supplies concrete evidence that explicit reasoning steps at inference time can improve accuracy, efficiency, and robustness in the LLM-as-a-judge setting. This is relevant for reward modeling and automated evaluation pipelines. The evaluation across three model scales, multiple augmentation baselines, bias conditions, and a multilingual extension constitutes a systematic contribution that could inform practical choices between reasoning-enabled and augmented non-reasoning judges.
major comments (2)
- The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).
- The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.
minor comments (1)
- The abstract contains a repeated sentence ('Our results show that...') that could be consolidated for conciseness.
Simulated Author's Rebuttal
We thank the referee for their insightful and constructive comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the transparency and rigor of our claims without altering the core findings.
read point-by-point responses
-
Referee: The abstract and experimental claims attribute the ~10pp accuracy advantage and 6% robustness gain primarily to the presence of explicit reasoning at inference time. However, it is not stated whether the thinking and non-thinking Qwen3 variants (0.6B/1.7B/4B) share identical base pre-training and post-training or whether the thinking variants received additional reasoning-oriented fine-tuning or data. This distinction is load-bearing for the central claim, because any upstream differences would confound the comparison with augmented non-thinking baselines (few-shot, rubric, etc.).
Authors: We thank the referee for identifying this important point of clarification. The Qwen3 thinking and non-thinking variants share the same base pre-training data and model architecture as released in the official Qwen3 series. The primary difference lies in post-training: the thinking variants receive additional supervised fine-tuning on reasoning traces and chain-of-thought data to enable explicit reasoning at inference time, while the non-thinking variants use the base post-training without this emphasis. This setup allows us to isolate the effect of explicit reasoning steps during judging. To address the concern, we will revise the manuscript by adding a new paragraph in Section 3 (Experimental Setup) that explicitly describes the shared pre-training, the post-training distinctions with citations to the Qwen3 technical report, and how this relates to the augmentation baselines. This ensures the central claim is not confounded. revision: yes
-
Referee: The abstract reports specific quantitative results (10-point accuracy gap, 6% robustness gain, cost multipliers under 2x vs. >8x) without reference to statistical tests, exact prompt templates, data splits, number of evaluation runs, or variance across random seeds. These details are necessary to establish that the reported differences reliably support the superiority claims rather than reflecting prompt sensitivity or single-run noise.
Authors: We agree that referencing these methodological details is essential for supporting the quantitative claims. The full manuscript already provides the exact prompt templates (including thinking and non-thinking variants) in Appendix A, describes the RewardBench data splits and evaluation protocol in Section 3.1, and notes that main results are averaged over three independent runs with different random seeds. However, we did not include formal statistical tests or explicit variance reporting in the abstract or main results tables. In the revision, we will (1) add cross-references in the abstract to the relevant sections and appendix, (2) report standard deviations alongside the mean accuracies and robustness scores, and (3) add statistical significance tests (e.g., paired t-tests or McNemar’s test) for the key comparisons between thinking and augmented non-thinking models. These additions will be placed in Section 4 and a new subsection on statistical analysis. revision: yes
Circularity Check
No circularity in empirical comparison of thinking vs non-thinking judges
full rationale
This paper reports direct experimental results from evaluating open-source Qwen3 thinking and non-thinking model variants on RewardBench and related tasks for accuracy, FLOPs efficiency, augmentation strategies, and bias robustness. There are no mathematical derivations, equations, fitted parameters, or first-principles claims that reduce reported outcomes to quantities defined by the experiment itself. Claims rest on standard benchmark measurements and comparisons to public baselines rather than self-definitional loops, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The study is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Small Qwen 3 models can be reliably configured to produce explicit reasoning traces or direct judgments
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Lost in Translation: Do LVLM Judges Generalize Across Languages?
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
Reference graph
Works this paper leans on
-
[1]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, E. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[2]
Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Evaluations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
work page 2023
-
[3]
Leveraging large language models for NLG evaluation: Advances and challenges
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. Leveraging large language models for NLG evaluation: Advances and challenges. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, U...
work page 2024
-
[4]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, 2024
work page 2024
-
[5]
Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...
work page 2025
-
[6]
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-Taught Evaluators, 2024
work page 2024
-
[7]
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[8]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[9]
RewardBench: Evaluating reward models for language modeling, 2024
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. RewardBench: Evaluating reward models for language modeling, 2024
work page 2024
-
[10]
M-RewardBench: Evaluating reward models in multilingual settings
Srishti Gureja, Lester James Validad Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Triandi Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. M-RewardBench: Evaluating reward models in multilingual settings. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of...
work page 2025
-
[11]
A Survey on LLM-as-a-Judge, 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A Survey on LLM-as-a-Judge, 2024
work page 2024
-
[12]
Justice or prejudice? quantifying biases in llm-as-a-judge, 2024
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, 2024. 10
work page 2024
-
[13]
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024
Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, 2024
work page 2024
-
[14]
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization-based Prompt Injection Attack to LLM-as-a-Judge. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2024
work page 2024
-
[15]
Chang, and Prithviraj Ammanabrolu
Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-Loud Reward Models, 2024
work page 2024
-
[16]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 2023
work page 2023
-
[17]
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge. InInternational Conference on Learning Representations, 2025
work page 2025
-
[18]
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025
Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Rong Tan. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks, 2025
work page 2025
-
[19]
Isik Baran Sandan, Tu Anh Dinh, and Jan Niehues. Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, 2025
work page 2025
-
[20]
How Reliable is Multilingual LLM-as-a-Judge?, 2025
Xiyan Fu and Wei Liu. How Reliable is Multilingual LLM-as-a-Judge?, 2025
work page 2025
-
[21]
UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025
Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, and Zeya Ahmad. UQLM: A Python Package for Uncertainty Quantification in Large Language Models, 2025
work page 2025
-
[22]
Ryan, Danmei Xu, Chris Nivera, and Daniel Campos
Michael J. Ryan, Danmei Xu, Chris Nivera, and Daniel Campos. EnronQA: Towards Personal- ized RAG over Private Documents, 2025
work page 2025
-
[23]
Weyssow, Aton Kamanda, Xin Zhou, and H
M. Weyssow, Aton Kamanda, Xin Zhou, and H. Sahraoui. CodeUltraFeedback: An LLM-as-a- Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Transactions on Software Engineering and Methodology, 2024
work page 2024
-
[24]
YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering
Jennifer D’Souza, Hamed Babaei Giglou, and Quentin Münch. YESciEval: Robust LLM-as-a- Judge for Scientific Question Answering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
work page 2025
-
[25]
Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J
E. Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, J. Caskey, M. Oguss, Graham Wills, Guanhua Chen, D. Dligach, et al. Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge, 2025
work page 2025
-
[26]
Giuseppe Contissa and Galileo Sartor.Large Language Models in the Justice Domain. Brill, 2025
work page 2025
-
[27]
Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, and Xingyu Wu. When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance, 2025
work page 2025
-
[28]
Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025
Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, and Min Zhang. Test-time computing: from system-1 thinking to system-2 thinking.ArXiv, abs/2501.02497, 2025
-
[29]
Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.Trans. Mach. Learn. Res., 2025, Apr 2025
work page 2025
-
[30]
Efficient inference for large reasoning models: A survey,
Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. Efficient inference for large reasoning models: A survey.ArXiv, abs/2503.23077, Mar 2025
- [31]
-
[32]
Fan Liu, WenShuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.ArXiv, abs/2502.07191, Feb 2502
-
[33]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wangxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.ArXiv, abs/2503.09567, Mar 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412
-
[35]
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, J. Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models.ArXiv, abs/250...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.ArXiv, abs/2410.02725, Oct 2410
-
[37]
Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, and Yu Zhang. Dynamicmind: A tri-mode thinking system for large language models.ArXiv, abs/2506.05936, Jun 2025
-
[38]
arXiv preprint arXiv:2502.18080 , year =
Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning.ArXiv, abs/2502.18080, Feb 2502
-
[39]
Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling
Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.ArXiv, abs/2502.06703, Feb 2025
-
[40]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.ArXiv, abs/2408.03314, Aug 2408
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Yunho Jin, Gu-Yeon Wei, and David Brooks. The energy cost of reasoning: Analyzing energy usage in llms with test-time compute.ArXiv, abs/2505.14733, May 2505
-
[42]
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024
Lin Shi, Chiyu Ma, Wenhua Liang, Weicheng Ma, and Soroush V osoughi. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge, 2024
work page 2024
-
[43]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
work page 2023
-
[44]
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding, 2025
work page 2025
-
[45]
Self-Preference Bias in LLM-as-a-Judge, 2024
Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-Preference Bias in LLM-as-a-Judge, 2024
work page 2024
-
[46]
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking Cognitive Biases in Large Language Models as Evaluators. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
work page 2023
-
[47]
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025
Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation, 2025
work page 2025
-
[48]
Are Bias Evaluation Methods Biased ?, 2025
Lina Berrayana, Sean Rooney, Luis Garc’es-Erice, and Ioana Giurgiu. Are Bias Evaluation Methods Biased ?, 2025
work page 2025
-
[49]
JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm
Multiple Authors. JUDICIOUS: Evaluating Robustness of Large Language Models in the Legal Realm. Technical report, eScholarship, University of California, 2025. URL: https://escholarship.org/content/qt3w69j2wd/qt3w69j2wd.pdf
work page 2025
-
[50]
Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025
Ashish Sardana. Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?, 2025. 12
work page 2025
-
[51]
Vyas Raina, Adian Liusie, and Mark J. F. Gales. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[52]
Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021
Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies, 2021
work page 2021
-
[53]
Intent detection for code-mix utterances in task oriented dialogue systems
Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587, 2018
work page 2018
-
[54]
Multi stain graph fusion for multimodal integration in pathology
Chaitanya Dwivedi, Shima Nofallah, Maryam Pouryahya, Janani Iyer, Kenneth Leidal, Chuhan Chung, Timothy Watkins, Andrew Billin, Robert Myers, John Abel, and Ali Behrooz. Multi stain graph fusion for multimodal integration in pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1835–1845, June 2022
work page 2022
-
[55]
Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, and T. Yong-Jin Han. Deep kernels with probabilistic embeddings for small-data learning. In Cassio de Campos and Marloes H. Maathuis, editors,Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 ofProceedings of Machine Learning Research, pages 9...
work page 2021
-
[56]
Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025
Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025
-
[57]
Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025
Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Nilay Pande. Huemanity: Probing fine-grained visual perception in mllms.arXiv preprint arXiv:2506.03194, 2025
-
[58]
Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025
Sahiti Yerramilli, Nilay Pande, Rynaa Grover, and Jayant Sravan Tamarapalli. Geochain: Multimodal chain-of-thought for geographic reasoning.arXiv preprint arXiv:2506.00785, 2025
-
[59]
Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023
Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, and Yonatan Bisk. Maea: Multimodal attribution for embodied ai.arXiv preprint arXiv:2307.13850, 2023
-
[60]
Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, and Eric Nyberg. Attribution regularization for multimodal paradigms.arXiv preprint arXiv:2404.02359, 2024
-
[61]
Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Kumar, Aditi Sharma, Ben Suk- boontip, Jayant Tamarapalli, Jingyi Zhang, and Aniruddh Koul. Ai guide dog: Egocentric path prediction on smartphone.Proceedings of the AAAI Symposium Series, 5(1):220–227, May 2025
work page 2025
-
[62]
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, and Eric Nyberg. Semantic augmentation in images using language.arXiv preprint arXiv:2404.02353, 2024. 13 lang Chat Chat Hard Safety Reasoning Average Chat Chat Hard Safety Reasoning Average ar 96.00% 61.19% 76.82% 74.70% 77.18% 92.92% 69.57% 81.37% 95.70% 84.89% cs 95...
-
[63]
While [53] showcases the impact of multilingual and code-mix training on language models
showcase the ability of using encoder based SLMs to generate embeddings for task-oriented multi-turn dialogue systems. While [53] showcases the impact of multilingual and code-mix training on language models. [54] demonstrate the value of using disparate data sources during model training
-
[64]
[[A]]" if assistant A is better,
apply Bayesian inference to improve model performance in small data regime. Recent work benchmarks Multimodal Large Language Model (MLLM) weaknesses in counting (‘CountQA‘ [56]), perception (‘HueManity‘ [57]), and geographic reasoning (‘GeoChain‘ [58]). To address underlying flaws like unimodal dominance, research has explored multimodal attribution (‘MAE...
-
[66]
- Factuality: statements are correct and non-speculative
Checks: - Instruction following: directly satisfies the User Question and all stated constraints. - Factuality: statements are correct and non-speculative. - Completeness: all required parts are covered without gaps. - Clarity: clear, organized, easy to follow. - Reasoning-aware: if steps are shown, they are consistent and lead to a correct result (steps ...
-
[67]
Penalize: confident errors, ignored constraints, irrelevant fluff
-
[68]
Decision: choose Assistant A if it better satisfies these checks for the User Question; otherwise choose Assistant B
-
[69]
Neutrality: ignore presentation order, assistant names, and response length; ignore decorative style. Figure 7: Rubric for MRewardBench subset: alpacaeval-easy Rubric for MRewardBench subset: alpacaeval-hard Pairwise judge for instruction following (hard). Steps:
-
[71]
- Factual rigor: accurate, grounded; no hallucinations
Checks: - Multi-constraint satisfaction: meets all explicit constraints, formats, and edge cases. - Factual rigor: accurate, grounded; no hallucinations. - Disambiguation: sensibly resolves underspecification and states assumptions when needed. - Reasoning-aware: if steps are shown, they are sound and consistent. - Clarity: structured, readable, non-verbose
-
[72]
Penalize: constraint violations, invented details, overconfident but wrong logic
-
[73]
Decision: choose Assistant A if it better satisfies constraints and correctness (and handles ambiguity/clarity better when close); otherwise choose Assistant B
-
[74]
Neutrality: ignore presentation order, assistant names, and length. Figure 8: Rubric for MRewardBench subset: alpacaeval-hard 20 Rubric for MRewardBench subset: alpacaeval-length Pairwise judge for length-bias stress. Steps:
-
[76]
- Factuality: correct and grounded
Checks (judge content, not length): - Instruction adherence: precisely satisfies the User Question. - Factuality: correct and grounded. - Efficiency of content: avoids padding; every sentence adds value. - Reasoning-aware: steps, if present, are consistent and correct
-
[77]
Penalize: padding/verbosity without value, missed constraints, inaccuracies
-
[78]
Do not use length as a tie- breaker
Decision: choose Assistant A if its content better fulfills the checks; otherwise choose Assistant B. Do not use length as a tie- breaker
-
[79]
Neutrality: ignore presentation order, assistant names, and response length. Figure 9: Rubric for MRewardBench subset: alpacaeval-length Rubric for MRewardBench subset: mt-bench-easy Pairwise judge for multi-turn dialogue (easy). Steps:
-
[80]
Read the full conversation context in the User Question and the final-turn answers from Assistant A and Assistant B
-
[81]
- Final task fulfillment: satisfies the final request/format in the User Question
Checks: - Turn consistency: tracks prior turns; no contradictions. - Final task fulfillment: satisfies the final request/format in the User Question. - Factual accuracy: information is correct. - Clarity & tone: clear, appropriately concise, helpful. - Reasoning-aware: if steps are shown, they are coherent with the dialogue
-
[82]
Penalize: loss of context, incorrect facts, meandering/off-task replies
-
[83]
Decision: choose Assistant A if it better fulfills the final turn while staying consistent and accurate; otherwise choose Assistant B
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.