arxiv: 2412.05579 · v2 · submitted 2024-12-07 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li , Qian Dong , Junjie Chen , Huixue Su , Yujia Zhou , Qingyao Ai , Ziyi Ye , Yiqun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:03 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords LLMs-as-judgesLLM evaluationAI assessment methodsevaluation surveymeta-evaluationLLM limitationsnatural language evaluation

0 comments

The pith

Large language models can evaluate AI outputs by generating natural language judgments that generalize across tasks and offer built-in explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines the LLMs-as-judges paradigm and organizes the literature around five perspectives: why these models work as evaluators, how to construct them into evaluation systems, where they apply, how to test their reliability, and what limits their use. It highlights their advantages in effectiveness, task flexibility, and interpretability compared with rigid metrics or full human review. The structure helps researchers see patterns in prompting techniques, domain uses, and bias checks. A reader cares because the map reduces the effort needed to adopt or improve LLM-based evaluation in both academic and practical settings.

Core claim

The paper claims that LLMs-as-judges constitute a distinct evaluation approach in which models assess natural-language outputs through reasoning expressed in text, and it synthesizes the field by defining the paradigm, covering its functionality, detailing construction methods, surveying applications, presenting meta-evaluation techniques, and cataloging limitations to guide future development.

What carries the argument

The five-perspective framework (Functionality, Methodology, Applications, Meta-evaluation, Limitations) that organizes all reviewed work on using LLMs to judge responses.

Load-bearing premise

The reviewed papers and the five-perspective structure together capture the full range of LLM-based evaluation work without major omissions or selection bias.

What would settle it

A substantial new evaluation method or study that cannot be placed in any of the five perspectives and that demonstrates the summarized LLM judges perform worse than the survey indicates.

read the original abstract

The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey maps LLMs-as-judges work into five categories and links a resource list, but skips any account of how papers were selected.

read the letter

The paper is a survey that pulls together work on using LLMs to judge other model outputs. It breaks the topic into five sections: why these judges are used, how to build an evaluation setup with them, where they get applied, how to test the judges themselves, and what their drawbacks are. They also point to a GitHub repo they plan to keep updated with relevant papers and resources. That structure is straightforward and follows the questions people actually ask when they start working with this approach. The organization gives a usable map of the area without claiming new experiments or methods. The resource list is the most practical piece, since it could cut down on time spent hunting for related papers. The main gap is the missing description of the literature search. The abstract and outline call the survey comprehensive, but there is no mention of databases, keywords, date cutoffs, or inclusion rules. Without that, it is difficult to tell whether important papers were left out or whether the selection leans toward certain venues. The analysis of conflicting results across the cited work is also hard to judge from the high-level plan. This is the kind of paper that helps someone new to LLM evaluation get oriented quickly. Researchers already active in the area will probably use it mainly for the linked list rather than for deep new insights. I would send it to peer review. The topic is current, the structure is reasonable, and a survey can be useful if the coverage and balance are tightened up in revision.

Referee Report

1 major / 1 minor

Summary. The paper claims to deliver a comprehensive survey of the LLMs-as-judges paradigm, structured around five perspectives: Functionality (why use LLM judges), Methodology (how to construct evaluation systems with LLMs), Applications (where to apply them), Meta-evaluation (how to evaluate the judges), and Limitations. It begins with a systematic definition, analyzes the framework's effectiveness and interpretability, explores domains of use, discusses evaluation methods for the judges themselves, examines limitations, and outlines future directions, while committing to maintain an open GitHub resource list for ongoing updates.

Significance. If the coverage proves complete and unbiased, the survey would offer substantial value by imposing structure on a fast-growing area of LLM-based evaluation, helping researchers and practitioners navigate methodological choices and applications. The public GitHub resource list is a concrete strength that supports reproducibility and community maintenance. However, the significance is reduced because the unverifiable selection process prevents readers from confirming that the five-perspective taxonomy and cited works accurately reflect the current state of the field without major omissions.

major comments (1)

[Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.

minor comments (1)

[Abstract] Abstract: Typo in the final sentence ('we aim aims to provide') should be corrected to 'we aim to provide'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. The concern about missing literature search details is valid and directly impacts the transparency of our 'comprehensive' claim. We address it below and will revise accordingly.

read point-by-point responses

Referee: [Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.

Authors: We agree that the absence of an explicit search protocol reduces the verifiability of our coverage and weakens the 'comprehensive' and 'systematic' assertions. The initial manuscript relied on a combination of targeted keyword searches across arXiv, ACL Anthology, and Google Scholar (focusing on terms such as 'LLM-as-judge', 'LLM evaluator', 'LLM-based evaluation' from 2023 onward), manual curation of high-impact works, and ongoing monitoring via the public GitHub repository, but these steps were not documented. In the revised version we will insert a dedicated subsection (likely in the Introduction or as a new 'Survey Methodology' paragraph) that reports: (1) databases and repositories searched, (2) exact keywords and Boolean combinations, (3) date cutoff (December 2024), (4) inclusion criteria (peer-reviewed or arXiv papers that propose or evaluate LLM judges for natural-language evaluation tasks), and (5) exclusion criteria (non-English works, purely theoretical papers without empirical evaluation components). We will also note that the fast-moving nature of the field precludes a fully exhaustive PRISMA-style flow diagram, but the added description plus the living GitHub list will allow readers to assess selection bias. This change strengthens rather than alters the five-perspective taxonomy and core analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: survey contains no derivations, predictions, or self-referential reductions

full rationale

This is a literature survey paper with no equations, fitted parameters, predictions, or first-principles derivations. The claimed structure (five perspectives: Functionality, Methodology, Applications, Meta-evaluation, Limitations) is an organizational framework chosen by the authors rather than a result derived from data or prior results within the paper. No step reduces to its own inputs by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and the central claim of providing a 'comprehensive survey' is an assertion of coverage rather than a mathematical or predictive output that could be circular. Absence of a search protocol affects verifiability of completeness but does not create circular reasoning in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the paper introduces no free parameters, mathematical axioms, or invented entities; it relies on standard definitions from the cited literature.

pith-pipeline@v0.9.0 · 5562 in / 1039 out tokens · 43132 ms · 2026-05-11T23:03:41.209353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations.
Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
Analysis and Explainability of LLMs Via Evolutionary Methods
cs.NE 2026-04 unverdicted novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework
cs.CL 2026-04 unverdicted novelty 7.0

UL-XCoT maintains competitive accuracy on multilingual benchmarks while cutting decoding tokens by over 50% through per-query language selection and logic-space trajectory pruning.
Do AI Coding Agents Log Like Humans? An Empirical Study
cs.SE 2026-04 unverdicted novelty 7.0

AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...
Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study
cs.HC 2026-04 unverdicted novelty 7.0

GUIDE instantiates a generative experience paradigm for DMH and significantly reduced stress (p=.02) while improving user experience (p=.04) versus LLM cognitive restructuring in a preregistered RCT (N=237).
Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study
cs.HC 2026-04 unverdicted novelty 7.0

A generative system for digital mental health support dynamically assembles personalized content and multimodal interaction flows, producing lower stress and better user experience than a fixed LLM baseline in a prere...
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
cs.LG 2026-05 unverdicted novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
cs.CL 2026-04 unverdicted novelty 6.0

K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
cs.CV 2026-04 unverdicted novelty 6.0

EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.
MLLM-as-a-Judge Exhibits Model Preference Bias
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
ARuleCon: Agentic Security Rule Conversion
cs.CR 2026-04 unverdicted novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
cs.AI 2026-04 unverdicted novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
cs.RO 2026-05 unverdicted novelty 5.0

Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
cs.RO 2026-05 unverdicted novelty 5.0

LLM agents in a collaborative 2D game exhibit emergent behaviors such as perspective-taking, theory of mind, and clarification, detected by LLM judges and rated positively by human participants.
Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model
cs.CL 2026-05 unverdicted novelty 5.0

Expert re-annotations of a German ABSA dataset serve as ground truth to evaluate how students, crowdworkers, and LLMs affect inter-annotator agreement and downstream performance on ACSA and TASD tasks using BERT, T5, ...
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
cs.AI 2026-04 unverdicted novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
cs.AI 2026-04 unverdicted novelty 5.0

PSA-Eval reframes evaluation of trilingual public-space agents around traceable failures and regression testing, revealing cross-language score drift in a pilot despite high average performance.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders
cs.CL 2026-04 accept novelty 5.0

The authors propose a three-layer trust framework for AI mental health systems and review current evaluation practices to highlight gaps between technical metrics and clinical requirements.
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
cs.CL 2026-04 conditional novelty 5.0

Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study
cs.LG 2026-04 unverdicted novelty 5.0

Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues r...
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
cs.CL 2026-04 unverdicted novelty 5.0

Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.
A Systematic Approach for Large Language Models Debugging
cs.AI 2026-04 unverdicted novelty 4.0

This paper proposes a structured methodology for debugging LLMs that integrates issue detection, diagnosis, prompt and parameter refinement, and data adaptation to improve reproducibility and transparency.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 29 Pith papers · 23 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Samee Arif, Sualeha Farid, Abdul Hameed Azeemi, Awais Athar, and Agha Ali Raza. 2024. The fellowship of the llms: Multi-agent workflows for synthetic preference optimization dataset generation. arXiv preprint arXiv:2408.08688 (2024)

work page arXiv 2024
[3]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

work page internal anchor Pith review arXiv 2023
[4]

Nabiha Asghar. 2016. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362 (2016)

work page arXiv 2016
[5]

Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873 (2024)

work page arXiv 2024
[6]

A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, N Joseph, B Mann, N DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv. Preprint posted online December 1 (2021)

work page 2021
[7]

Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534

work page 2024
[8]

Sher Badshah and Hassan Sajjad. 2024. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text. arXiv preprint arXiv:2408.09235 (2024)

work page arXiv 2024
[9]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. 2024. Benchmarking foundation models with language-model-as-an-examiner. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[11]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

work page internal anchor Pith review arXiv 2016
[12]

Chaithanya Bandi and Abir Harrasse. 2024. Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates. arXiv preprint arXiv:2410.04663 (2024)

work page arXiv 2024
[13]

John J Bartko. 1966. The intraclass correlation coefficient as a measure of reliability. Psychological reports 19, 1 (1966), 3–11

work page 1966
[14]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, et al. 2024. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. arXiv preprint arXiv:2406.18403 (2024)

work page arXiv 2024
[15]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 17682–17690

work page 2024
[16]

Niels J Blunch. 1984. Position bias in multiple-choice questions. Journal of Marketing Research 21, 2 (1984), 216–220

work page 1984
[17]

Nathan Brake and Thomas Schaaf. 2024. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? arXiv preprint arXiv:2404.06503 (2024)

work page arXiv 2024
[18]

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128 (2022)

work page arXiv 2022
[19]

Jonathon D Brown. 1986. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition 4, 4 (1986), 353–376

work page 1986
[20]

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. 2024. CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution. arXiv preprint arXiv:2410.16256 (2024)

work page arXiv 2024
[21]

Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. 2024. Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . 9119–9138

work page 2024
[22]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)

work page internal anchor Pith review arXiv 2023
[23]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey ...

work page 2024
[24]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788 (2024)

work page arXiv 2024
[25]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)

work page arXiv 2024
[26]

Hong Chen, Duc Minh Vo, Hiroya Takamura, Yusuke Miyao, and Hideki Nakayama. 2023. StoryER: Automatic story evaluation via ranking, rating and reasoning. Journal of Natural Language Processing 30, 1 (2023), 243–249

work page 2023
[27]

Junjie Chen, Weihang Su, Zhumin Chu, Haitao Li, Qinyao Ai, Yiqun Liu, Min Zhang, and Shaoping Ma. 2024. An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation. arXiv:2410.12265 [cs.CL] https://arxiv.org/abs/2410.12265

work page arXiv 2024
[28]

Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha. 2023. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689 (2023)

work page arXiv 2023
[29]

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. 2024. Automated evaluation of large vision-language models on self-driving corner cases. arXiv preprint arXiv:2404.10595 (2024)

work page arXiv 2024
[30]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2024. Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061 (2024)

work page arXiv 2024
[32]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)

work page internal anchor Pith review arXiv 2023
[33]

Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, and Yun-Nung Chen. 2024. LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation. arXiv preprint arXiv:2410.20833 (2024)

work page arXiv 2024
[34]

Cyril Chhun, Pierre Colombo, Chloé Clavel, and Fabian M Suchanek. 2022. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation. arXiv preprint arXiv:2208.11646 (2022)

work page arXiv 2022
[35]

Cheng-Han Chiang, Wei-Chih Chen, Chun-Yi Kuan, Chienchou Yang, and Hung-yi Lee. 2024. Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course. arXiv preprint arXiv:2407.05216 (2024)

work page arXiv 2024
[36]

Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657 (2023)

work page arXiv 2023
[37]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6

work page 2023
[38]

Juhwan Choi, Jungmin Yun, Kyohoon Jin, and YoungBin Kim. 2024. Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation. arXiv preprint arXiv:2404.09682 (2024)

work page arXiv 2024
[39]

Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, and Yiqun Liu. 2024. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641 (2024)

work page arXiv 2024
[40]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al . 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53

work page 2024
[41]

Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009), 1–4

work page 2009
[42]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

work page 2021
[43]

Voorhees, and Ian Soboroff

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec31/papers/Overview_deep. pdf

work page 2022
[44]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. 2024. ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback. In Forty-first International Conference on Machine Learning

work page 2024
[45]

Roland Daynauth and Jason Mars. 2024. Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments. arXiv preprint arXiv:2407.12847 (2024)

work page arXiv 2024
[46]

Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, and Yapeng Tian. 2024. Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach. arXiv preprint arXiv:2411.17760 (2024). , Vol. 1, No. 1, Article . Publication date: December 2024. 50 Li, et al

work page arXiv 2024
[47]

Mahesh Deshwal and Apoorva Chawla. 2024. PHUDGE: Phi-3 as Scalable Judge. arXiv preprint arXiv:2405.08029 (2024)

work page arXiv 2024
[48]

Laurence Dierickx, Arjen Van Dalen, Andreas L Opdahl, and Carl-Gustav Lindén. 2024. Striking the balance in using LLMs for fact-checking: A narrative literature review. In Multidisciplinary International Symposium on Disinformation in Open Online Media . Springer, 1–15

work page 2024
[49]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 (2023)

work page arXiv 2023
[50]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[51]

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, and Mitesh M Khapra. 2024. Finding Blind Spots in Evaluator LLMs with Interpretable Checklists. arXiv preprint arXiv:2406.13439 (2024)

work page arXiv 2024
[52]

Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. 2024. Self-Boosting Large Language Models with Synthetic Preference Data. arXiv preprint arXiv:2410.06961 (2024)

work page arXiv 2024
[53]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)

work page internal anchor Pith review arXiv 2022
[54]

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can LLM be a Personalized Judge? arXiv preprint arXiv:2406.11657 (2024)

work page arXiv 2024
[55]

Dorner, Vivian Y

Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt. 2024. Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data. arXiv:2410.13341 [cs.LG] https://arxiv.org/abs/2410.13341

work page arXiv 2024
[56]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 (2024)

work page internal anchor Pith review arXiv 2024
[57]

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751 (2017)

work page arXiv 2017
[58]

Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a- judge. arXiv preprint arXiv:2410.03775 (2024)

work page arXiv 2024
[59]

Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409

work page 2021
[60]

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation.arXiv preprint arXiv:1805.04833 (2018)

work page arXiv 2018
[61]

Zhiting Fan, Ruizhe Chen, Ruiling Xu, and Zuozhu Liu. 2024. Biasalert: A plug-and-play tool for social bias detection in llms. arXiv preprint arXiv:2407.10241 (2024)

work page arXiv 2024
[62]

Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learning. arXiv preprint arXiv:2305.19148 (2023)

work page arXiv 2023
[63]

Chao Feng, Xinyu Zhang, and Zichu Fei. 2023. Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs. arXiv preprint arXiv:2309.03118 (2023)

work page arXiv 2023
[64]

Zhaopeng Feng, Yan Zhang, Hao Li, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2024. Improving llm-based machine translation with systematic self-correction. arXiv preprint arXiv:2402.16379 (2024)

work page arXiv 2024
[65]

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics 9 (2021), 1460–1474

work page 2021
[66]

Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar

work page
[67]

In Proceedings of the Sixth Conference on Machine Translation

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation . 733–774

work page
[68]

Robert M French. 2000. The Turing Test: the first 50 years. Trends in cognitive sciences 4, 3 (2000), 115–122

work page 2000
[69]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)

work page arXiv 2023
[70]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Yicheng Gao, Gonghan Xu, Zhe Wang, and Arman Cohan. 2024. Bayesian Calibration of Win Rate Estimation with LLM Evaluators. arXiv preprint arXiv:2411.04424 (2024)

work page arXiv 2024
[72]

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (2023), e2305016120. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods 51

work page 2023
[73]

Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur. 2023. Topical-chat: Towards knowledge-grounded open-domain conversations.arXiv preprint arXiv:2308.11995 (2023)

work page arXiv 2023
[74]

Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, and Minlie Huang

work page
[75]

arXiv preprint arXiv:2105.08920 (2021)

OpenMEVA: A benchmark for evaluating open-ended story generation metrics. arXiv preprint arXiv:2105.08920 (2021)

work page arXiv 2021
[76]

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. 2024. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792 (2024)

work page arXiv 2024
[77]

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, et al. 2023. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736 (2023)

work page arXiv 2023
[78]

Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan

work page
[79]

arXiv preprint arXiv:2410.21545 (2024)

Unveiling Context-Aware Criteria in Self-Assessing LLMs. arXiv preprint arXiv:2410.21545 (2024)

work page arXiv 2024
[80]

Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2023. Are large language model-based evaluators the solution to scaling up multilingual evaluation? arXiv preprint arXiv:2309.07462 (2023)

work page arXiv 2023

Showing first 80 references.