pith. machine review for the scientific record. sign in

arxiv: 2412.05579 · v2 · submitted 2024-12-07 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:03 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords LLMs-as-judgesLLM evaluationAI assessment methodsevaluation surveymeta-evaluationLLM limitationsnatural language evaluation
0
0 comments X

The pith

Large language models can evaluate AI outputs by generating natural language judgments that generalize across tasks and offer built-in explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines the LLMs-as-judges paradigm and organizes the literature around five perspectives: why these models work as evaluators, how to construct them into evaluation systems, where they apply, how to test their reliability, and what limits their use. It highlights their advantages in effectiveness, task flexibility, and interpretability compared with rigid metrics or full human review. The structure helps researchers see patterns in prompting techniques, domain uses, and bias checks. A reader cares because the map reduces the effort needed to adopt or improve LLM-based evaluation in both academic and practical settings.

Core claim

The paper claims that LLMs-as-judges constitute a distinct evaluation approach in which models assess natural-language outputs through reasoning expressed in text, and it synthesizes the field by defining the paradigm, covering its functionality, detailing construction methods, surveying applications, presenting meta-evaluation techniques, and cataloging limitations to guide future development.

What carries the argument

The five-perspective framework (Functionality, Methodology, Applications, Meta-evaluation, Limitations) that organizes all reviewed work on using LLMs to judge responses.

Load-bearing premise

The reviewed papers and the five-perspective structure together capture the full range of LLM-based evaluation work without major omissions or selection bias.

What would settle it

A substantial new evaluation method or study that cannot be placed in any of the five perspectives and that demonstrates the summarized LLM judges perform worse than the survey indicates.

read the original abstract

The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to deliver a comprehensive survey of the LLMs-as-judges paradigm, structured around five perspectives: Functionality (why use LLM judges), Methodology (how to construct evaluation systems with LLMs), Applications (where to apply them), Meta-evaluation (how to evaluate the judges), and Limitations. It begins with a systematic definition, analyzes the framework's effectiveness and interpretability, explores domains of use, discusses evaluation methods for the judges themselves, examines limitations, and outlines future directions, while committing to maintain an open GitHub resource list for ongoing updates.

Significance. If the coverage proves complete and unbiased, the survey would offer substantial value by imposing structure on a fast-growing area of LLM-based evaluation, helping researchers and practitioners navigate methodological choices and applications. The public GitHub resource list is a concrete strength that supports reproducibility and community maintenance. However, the significance is reduced because the unverifiable selection process prevents readers from confirming that the five-perspective taxonomy and cited works accurately reflect the current state of the field without major omissions.

major comments (1)
  1. [Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.
minor comments (1)
  1. [Abstract] Abstract: Typo in the final sentence ('we aim aims to provide') should be corrected to 'we aim to provide'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. The concern about missing literature search details is valid and directly impacts the transparency of our 'comprehensive' claim. We address it below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.

    Authors: We agree that the absence of an explicit search protocol reduces the verifiability of our coverage and weakens the 'comprehensive' and 'systematic' assertions. The initial manuscript relied on a combination of targeted keyword searches across arXiv, ACL Anthology, and Google Scholar (focusing on terms such as 'LLM-as-judge', 'LLM evaluator', 'LLM-based evaluation' from 2023 onward), manual curation of high-impact works, and ongoing monitoring via the public GitHub repository, but these steps were not documented. In the revised version we will insert a dedicated subsection (likely in the Introduction or as a new 'Survey Methodology' paragraph) that reports: (1) databases and repositories searched, (2) exact keywords and Boolean combinations, (3) date cutoff (December 2024), (4) inclusion criteria (peer-reviewed or arXiv papers that propose or evaluate LLM judges for natural-language evaluation tasks), and (5) exclusion criteria (non-English works, purely theoretical papers without empirical evaluation components). We will also note that the fast-moving nature of the field precludes a fully exhaustive PRISMA-style flow diagram, but the added description plus the living GitHub list will allow readers to assess selection bias. This change strengthens rather than alters the five-perspective taxonomy and core analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: survey contains no derivations, predictions, or self-referential reductions

full rationale

This is a literature survey paper with no equations, fitted parameters, predictions, or first-principles derivations. The claimed structure (five perspectives: Functionality, Methodology, Applications, Meta-evaluation, Limitations) is an organizational framework chosen by the authors rather than a result derived from data or prior results within the paper. No step reduces to its own inputs by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and the central claim of providing a 'comprehensive survey' is an assertion of coverage rather than a mathematical or predictive output that could be circular. Absence of a search protocol affects verifiability of completeness but does not create circular reasoning in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the paper introduces no free parameters, mathematical axioms, or invented entities; it relies on standard definitions from the cited literature.

pith-pipeline@v0.9.0 · 5562 in / 1039 out tokens · 43132 ms · 2026-05-11T23:03:41.209353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations.

  • Foundation.LawOfExistence defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

  2. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  3. Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

    cs.CL 2026-04 unverdicted novelty 7.0

    UL-XCoT maintains competitive accuracy on multilingual benchmarks while cutting decoding tokens by over 50% through per-query language selection and logic-space trajectory pruning.

  4. Do AI Coding Agents Log Like Humans? An Empirical Study

    cs.SE 2026-04 unverdicted novelty 7.0

    AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...

  5. Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study

    cs.HC 2026-04 unverdicted novelty 7.0

    GUIDE instantiates a generative experience paradigm for DMH and significantly reduced stress (p=.02) while improving user experience (p=.04) versus LLM cognitive restructuring in a preregistered RCT (N=237).

  6. Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study

    cs.HC 2026-04 unverdicted novelty 7.0

    A generative system for digital mental health support dynamically assembles personalized content and multimodal interaction flows, producing lower stress and better user experience than a fixed LLM baseline in a prere...

  7. Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

  8. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

  9. K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

    cs.CL 2026-04 unverdicted novelty 6.0

    K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.

  10. Evian: Towards Explainable Visual Instruction-tuning Data Auditing

    cs.CV 2026-04 unverdicted novelty 6.0

    EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.

  11. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.

  12. How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.

  13. MLLM-as-a-Judge Exhibits Model Preference Bias

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.

  14. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  15. ARuleCon: Agentic Security Rule Conversion

    cs.CR 2026-04 unverdicted novelty 6.0

    ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.

  16. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

    cs.AI 2026-04 unverdicted novelty 6.0

    Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

  17. When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.

  18. Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

    cs.RO 2026-05 unverdicted novelty 5.0

    Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.

  19. Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

    cs.RO 2026-05 unverdicted novelty 5.0

    LLM agents in a collaborative 2D game exhibit emergent behaviors such as perspective-taking, theory of mind, and clarification, detected by LLM judges and rated positively by human participants.

  20. Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model

    cs.CL 2026-05 unverdicted novelty 5.0

    Expert re-annotations of a German ABSA dataset serve as ground truth to evaluate how students, crowdworkers, and LLMs affect inter-annotator agreement and downstream performance on ACSA and TASD tasks using BERT, T5, ...

  21. LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

  22. STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    cs.AI 2026-04 unverdicted novelty 5.0

    STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.

  23. Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    PSA-Eval reframes evaluation of trilingual public-space agents around traceable failures and regression testing, revealing cross-language score drift in a pilot despite high average performance.

  24. Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

    cs.AI 2026-04 unverdicted novelty 5.0

    An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

  25. Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

    cs.CL 2026-04 accept novelty 5.0

    The authors propose a three-layer trust framework for AI mental health systems and review current evaluation practices to highlight gaps between technical metrics and clinical requirements.

  26. Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

    cs.CL 2026-04 conditional novelty 5.0

    Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...

  27. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

  28. Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

    cs.LG 2026-04 unverdicted novelty 5.0

    Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues r...

  29. Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

    cs.CL 2026-04 unverdicted novelty 5.0

    Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.

  30. Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model

    cs.AI 2026-04 unverdicted novelty 5.0

    Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.

  31. A Systematic Approach for Large Language Models Debugging

    cs.AI 2026-04 unverdicted novelty 4.0

    This paper proposes a structured methodology for debugging LLMs that integrates issue detection, diagnosis, prompt and parameter refinement, and data adaptation to improve reproducibility and transparency.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 29 Pith papers · 23 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Samee Arif, Sualeha Farid, Abdul Hameed Azeemi, Awais Athar, and Agha Ali Raza. 2024. The fellowship of the llms: Multi-agent workflows for synthetic preference optimization dataset generation. arXiv preprint arXiv:2408.08688 (2024)

  3. [3]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

  4. [4]

    Nabiha Asghar. 2016. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362 (2016)

  5. [5]

    Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873 (2024)

  6. [6]

    A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, N Joseph, B Mann, N DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv. Preprint posted online December 1 (2021)

  7. [7]

    Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534

  8. [8]

    Sher Badshah and Hassan Sajjad. 2024. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text. arXiv preprint arXiv:2408.09235 (2024)

  9. [9]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  10. [10]

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. 2024. Benchmarking foundation models with language-model-as-an-examiner. Advances in Neural Information Processing Systems 36 (2024)

  11. [11]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

  12. [12]

    Chaithanya Bandi and Abir Harrasse. 2024. Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates. arXiv preprint arXiv:2410.04663 (2024)

  13. [13]

    John J Bartko. 1966. The intraclass correlation coefficient as a measure of reliability. Psychological reports 19, 1 (1966), 3–11

  14. [14]

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, et al. 2024. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. arXiv preprint arXiv:2406.18403 (2024)

  15. [15]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 17682–17690

  16. [16]

    Niels J Blunch. 1984. Position bias in multiple-choice questions. Journal of Marketing Research 21, 2 (1984), 216–220

  17. [17]

    Nathan Brake and Thomas Schaaf. 2024. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? arXiv preprint arXiv:2404.06503 (2024)

  18. [18]

    Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128 (2022)

  19. [19]

    Jonathon D Brown. 1986. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition 4, 4 (1986), 353–376

  20. [20]

    Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. 2024. CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution. arXiv preprint arXiv:2410.16256 (2024)

  21. [21]

    Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. 2024. Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . 9119–9138

  22. [22]

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)

  23. [23]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey ...

  24. [24]

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788 (2024)

  25. [25]

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)

  26. [26]

    Hong Chen, Duc Minh Vo, Hiroya Takamura, Yusuke Miyao, and Hideki Nakayama. 2023. StoryER: Automatic story evaluation via ranking, rating and reasoning. Journal of Natural Language Processing 30, 1 (2023), 243–249

  27. [27]

    Junjie Chen, Weihang Su, Zhumin Chu, Haitao Li, Qinyao Ai, Yiqun Liu, Min Zhang, and Shaoping Ma. 2024. An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation. arXiv:2410.12265 [cs.CL] https://arxiv.org/abs/2410.12265

  28. [28]

    Jiefeng Chen, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha. 2023. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689 (2023)

  29. [29]

    Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. 2024. Automated evaluation of large vision-language models on self-driving corner cases. arXiv preprint arXiv:2404.10595 (2024)

  30. [30]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  31. [31]

    Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2024. Internet of agents: Weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061 (2024)

  32. [32]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)

  33. [33]

    Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, and Yun-Nung Chen. 2024. LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation. arXiv preprint arXiv:2410.20833 (2024)

  34. [34]

    Cyril Chhun, Pierre Colombo, Chloé Clavel, and Fabian M Suchanek. 2022. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation. arXiv preprint arXiv:2208.11646 (2022)

  35. [35]

    Cheng-Han Chiang, Wei-Chih Chen, Chun-Yi Kuan, Chienchou Yang, and Hung-yi Lee. 2024. Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course. arXiv preprint arXiv:2407.05216 (2024)

  36. [36]

    Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657 (2023)

  37. [37]

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6

  38. [38]

    Juhwan Choi, Jungmin Yun, Kyohoon Jin, and YoungBin Kim. 2024. Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation. arXiv preprint arXiv:2404.09682 (2024)

  39. [39]

    Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, and Yiqun Liu. 2024. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641 (2024)

  40. [40]

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al . 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53

  41. [41]

    Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009), 1–4

  42. [42]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

  43. [43]

    Voorhees, and Ian Soboroff

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec31/papers/Overview_deep. pdf

  44. [44]

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. 2024. ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback. In Forty-first International Conference on Machine Learning

  45. [45]

    Roland Daynauth and Jason Mars. 2024. Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments. arXiv preprint arXiv:2407.12847 (2024)

  46. [46]

    Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, and Yapeng Tian. 2024. Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach. arXiv preprint arXiv:2411.17760 (2024). , Vol. 1, No. 1, Article . Publication date: December 2024. 50 Li, et al

  47. [47]

    Mahesh Deshwal and Apoorva Chawla. 2024. PHUDGE: Phi-3 as Scalable Judge. arXiv preprint arXiv:2405.08029 (2024)

  48. [48]

    Laurence Dierickx, Arjen Van Dalen, Andreas L Opdahl, and Carl-Gustav Lindén. 2024. Striking the balance in using LLMs for fact-checking: A narrative literature review. In Multidisciplinary International Symposium on Disinformation in Open Online Media . Springer, 1–15

  49. [49]

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 (2023)

  50. [50]

    Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024)

  51. [51]

    Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, and Mitesh M Khapra. 2024. Finding Blind Spots in Evaluator LLMs with Interpretable Checklists. arXiv preprint arXiv:2406.13439 (2024)

  52. [52]

    Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. 2024. Self-Boosting Large Language Models with Synthetic Preference Data. arXiv preprint arXiv:2410.06961 (2024)

  53. [53]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)

  54. [54]

    Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can LLM be a Personalized Judge? arXiv preprint arXiv:2406.11657 (2024)

  55. [55]

    Dorner, Vivian Y

    Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt. 2024. Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data. arXiv:2410.13341 [cs.LG] https://arxiv.org/abs/2410.13341

  56. [56]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 (2024)

  57. [57]

    Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751 (2017)

  58. [58]

    Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a- judge. arXiv preprint arXiv:2410.03775 (2024)

  59. [59]

    Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409

  60. [60]

    Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation.arXiv preprint arXiv:1805.04833 (2018)

  61. [61]

    Zhiting Fan, Ruizhe Chen, Ruiling Xu, and Zuozhu Liu. 2024. Biasalert: A plug-and-play tool for social bias detection in llms. arXiv preprint arXiv:2407.10241 (2024)

  62. [62]

    Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learning. arXiv preprint arXiv:2305.19148 (2023)

  63. [63]

    Chao Feng, Xinyu Zhang, and Zichu Fei. 2023. Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs. arXiv preprint arXiv:2309.03118 (2023)

  64. [64]

    Zhaopeng Feng, Yan Zhang, Hao Li, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2024. Improving llm-based machine translation with systematic self-correction. arXiv preprint arXiv:2402.16379 (2024)

  65. [65]

    Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics 9 (2021), 1460–1474

  66. [66]

    Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar

  67. [67]

    In Proceedings of the Sixth Conference on Machine Translation

    Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation . 733–774

  68. [68]

    Robert M French. 2000. The Turing Test: the first 50 years. Trends in cognitive sciences 4, 3 (2000), 115–122

  69. [69]

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)

  70. [70]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)

  71. [71]

    Yicheng Gao, Gonghan Xu, Zhe Wang, and Arman Cohan. 2024. Bayesian Calibration of Win Rate Estimation with LLM Evaluators. arXiv preprint arXiv:2411.04424 (2024)

  72. [72]

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (2023), e2305016120. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods 51

  73. [73]

    Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur. 2023. Topical-chat: Towards knowledge-grounded open-domain conversations.arXiv preprint arXiv:2308.11995 (2023)

  74. [74]

    Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, and Minlie Huang

  75. [75]

    arXiv preprint arXiv:2105.08920 (2021)

    OpenMEVA: A benchmark for evaluating open-ended story generation metrics. arXiv preprint arXiv:2105.08920 (2021)

  76. [76]

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. 2024. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792 (2024)

  77. [77]

    Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, et al. 2023. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736 (2023)

  78. [78]

    Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan

  79. [79]

    arXiv preprint arXiv:2410.21545 (2024)

    Unveiling Context-Aware Criteria in Self-Assessing LLMs. arXiv preprint arXiv:2410.21545 (2024)

  80. [80]

    Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2023. Are large language model-based evaluators the solution to scaling up multilingual evaluation? arXiv preprint arXiv:2309.07462 (2023)

Showing first 80 references.