Recognition: 2 theorem links
· Lean TheoremLLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Pith reviewed 2026-05-11 23:03 UTC · model grok-4.3
The pith
Large language models can evaluate AI outputs by generating natural language judgments that generalize across tasks and offer built-in explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that LLMs-as-judges constitute a distinct evaluation approach in which models assess natural-language outputs through reasoning expressed in text, and it synthesizes the field by defining the paradigm, covering its functionality, detailing construction methods, surveying applications, presenting meta-evaluation techniques, and cataloging limitations to guide future development.
What carries the argument
The five-perspective framework (Functionality, Methodology, Applications, Meta-evaluation, Limitations) that organizes all reviewed work on using LLMs to judge responses.
Load-bearing premise
The reviewed papers and the five-perspective structure together capture the full range of LLM-based evaluation work without major omissions or selection bias.
What would settle it
A substantial new evaluation method or study that cannot be placed in any of the five perspectives and that demonstrates the summarized LLM judges perform worse than the survey indicates.
read the original abstract
The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver a comprehensive survey of the LLMs-as-judges paradigm, structured around five perspectives: Functionality (why use LLM judges), Methodology (how to construct evaluation systems with LLMs), Applications (where to apply them), Meta-evaluation (how to evaluate the judges), and Limitations. It begins with a systematic definition, analyzes the framework's effectiveness and interpretability, explores domains of use, discusses evaluation methods for the judges themselves, examines limitations, and outlines future directions, while committing to maintain an open GitHub resource list for ongoing updates.
Significance. If the coverage proves complete and unbiased, the survey would offer substantial value by imposing structure on a fast-growing area of LLM-based evaluation, helping researchers and practitioners navigate methodological choices and applications. The public GitHub resource list is a concrete strength that supports reproducibility and community maintenance. However, the significance is reduced because the unverifiable selection process prevents readers from confirming that the five-perspective taxonomy and cited works accurately reflect the current state of the field without major omissions.
major comments (1)
- [Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.
minor comments (1)
- [Abstract] Abstract: Typo in the final sentence ('we aim aims to provide') should be corrected to 'we aim to provide'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. The concern about missing literature search details is valid and directly impacts the transparency of our 'comprehensive' claim. We address it below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract and Introduction] Abstract and Introduction: The manuscript asserts a 'comprehensive survey' and 'systematic definition' of LLMs-as-judges but contains no description of the literature search protocol (databases, keywords, date cutoffs, inclusion/exclusion criteria, or PRISMA-style reporting). This directly undermines the central claim of providing an unbiased synthesis across the five perspectives, as readers cannot assess completeness or selection bias.
Authors: We agree that the absence of an explicit search protocol reduces the verifiability of our coverage and weakens the 'comprehensive' and 'systematic' assertions. The initial manuscript relied on a combination of targeted keyword searches across arXiv, ACL Anthology, and Google Scholar (focusing on terms such as 'LLM-as-judge', 'LLM evaluator', 'LLM-based evaluation' from 2023 onward), manual curation of high-impact works, and ongoing monitoring via the public GitHub repository, but these steps were not documented. In the revised version we will insert a dedicated subsection (likely in the Introduction or as a new 'Survey Methodology' paragraph) that reports: (1) databases and repositories searched, (2) exact keywords and Boolean combinations, (3) date cutoff (December 2024), (4) inclusion criteria (peer-reviewed or arXiv papers that propose or evaluate LLM judges for natural-language evaluation tasks), and (5) exclusion criteria (non-English works, purely theoretical papers without empirical evaluation components). We will also note that the fast-moving nature of the field precludes a fully exhaustive PRISMA-style flow diagram, but the added description plus the living GitHub list will allow readers to assess selection bias. This change strengthens rather than alters the five-perspective taxonomy and core analysis. revision: yes
Circularity Check
No circularity: survey contains no derivations, predictions, or self-referential reductions
full rationale
This is a literature survey paper with no equations, fitted parameters, predictions, or first-principles derivations. The claimed structure (five perspectives: Functionality, Methodology, Applications, Meta-evaluation, Limitations) is an organizational framework chosen by the authors rather than a result derived from data or prior results within the paper. No step reduces to its own inputs by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and the central claim of providing a 'comprehensive survey' is an assertion of coverage rather than a mathematical or predictive output that could be circular. Absence of a search protocol affects verifiability of completeness but does not create circular reasoning in any derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations.
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 31 Pith papers
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
Analysis and Explainability of LLMs Via Evolutionary Methods
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
-
Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework
UL-XCoT maintains competitive accuracy on multilingual benchmarks while cutting decoding tokens by over 50% through per-query language selection and logic-space trajectory pruning.
-
Do AI Coding Agents Log Like Humans? An Empirical Study
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...
-
Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study
GUIDE instantiates a generative experience paradigm for DMH and significantly reduced stress (p=.02) while improving user experience (p=.04) versus LLM cognitive restructuring in a preregistered RCT (N=237).
-
Generative Experiences for Digital Mental Health Interventions: Evidence from a Randomized Study
A generative system for digital mental health support dynamically assembles personalized content and multimodal interaction flows, producing lower stress and better user experience than a fixed LLM baseline in a prere...
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
-
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.
-
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
-
QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.
-
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.
-
MLLM-as-a-Judge Exhibits Model Preference Bias
MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
ARuleCon: Agentic Security Rule Conversion
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
-
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
-
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
-
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
-
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
LLM agents in a collaborative 2D game exhibit emergent behaviors such as perspective-taking, theory of mind, and clarification, detected by LLM judges and rated positively by human participants.
-
Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model
Expert re-annotations of a German ABSA dataset serve as ground truth to evaluate how students, crowdworkers, and LLMs affect inter-annotator agreement and downstream performance on ACSA and TASD tasks using BERT, T5, ...
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
-
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
PSA-Eval reframes evaluation of trilingual public-space agents around traceable failures and regression testing, revealing cross-language score drift in a pilot despite high average performance.
-
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
-
Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders
The authors propose a three-layer trust framework for AI mental health systems and review current evaluation practices to highlight gaps between technical metrics and clinical requirements.
-
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study
Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues r...
-
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
-
Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.
-
A Systematic Approach for Large Language Models Debugging
This paper proposes a structured methodology for debugging LLMs that integrates issue detection, diagnosis, prompt and parameter refinement, and data adaptation to improve reproducibility and transparency.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)
work page internal anchor Pith review arXiv 2023
- [4]
-
[5]
Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873 (2024)
-
[6]
A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, N Joseph, B Mann, N DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv. Preprint posted online December 1 (2021)
work page 2021
-
[7]
Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534
work page 2024
- [8]
-
[9]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. 2024. Benchmarking foundation models with language-model-as-an-examiner. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[11]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)
work page internal anchor Pith review arXiv 2016
- [12]
-
[13]
John J Bartko. 1966. The intraclass correlation coefficient as a measure of reliability. Psychological reports 19, 1 (1966), 3–11
work page 1966
-
[14]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, et al. 2024. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. arXiv preprint arXiv:2406.18403 (2024)
-
[15]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 17682–17690
work page 2024
-
[16]
Niels J Blunch. 1984. Position bias in multiple-choice questions. Journal of Marketing Research 21, 2 (1984), 216–220
work page 1984
- [17]
-
[18]
Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128 (2022)
-
[19]
Jonathon D Brown. 1986. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition 4, 4 (1986), 353–376
work page 1986
- [20]
-
[21]
Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. 2024. Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . 9119–9138
work page 2024
-
[22]
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)
work page internal anchor Pith review arXiv 2023
-
[23]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey ...
work page 2024
- [24]
- [25]
-
[26]
Hong Chen, Duc Minh Vo, Hiroya Takamura, Yusuke Miyao, and Hideki Nakayama. 2023. StoryER: Automatic story evaluation via ranking, rating and reasoning. Journal of Natural Language Processing 30, 1 (2023), 243–249
work page 2023
- [27]
- [28]
- [29]
-
[30]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [31]
-
[32]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)
work page internal anchor Pith review arXiv 2023
- [33]
- [34]
- [35]
- [36]
-
[37]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6
work page 2023
- [38]
- [39]
-
[40]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al . 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53
work page 2024
-
[41]
Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009), 1–4
work page 2009
-
[42]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf
work page 2021
-
[43]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. In TREC. https://trec.nist.gov/pubs/trec31/papers/Overview_deep. pdf
work page 2022
-
[44]
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. 2024. ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback. In Forty-first International Conference on Machine Learning
work page 2024
- [45]
-
[46]
Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, and Yapeng Tian. 2024. Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach. arXiv preprint arXiv:2411.17760 (2024). , Vol. 1, No. 1, Article . Publication date: December 2024. 50 Li, et al
- [47]
-
[48]
Laurence Dierickx, Arjen Van Dalen, Andreas L Opdahl, and Carl-Gustav Lindén. 2024. Striking the balance in using LLMs for fact-checking: A narrative literature review. In Multidisciplinary International Symposium on Disinformation in Open Online Media . Springer, 1–15
work page 2024
- [49]
-
[50]
Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
- [51]
- [52]
-
[53]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)
work page internal anchor Pith review arXiv 2022
- [54]
-
[55]
Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt. 2024. Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data. arXiv:2410.13341 [cs.LG] https://arxiv.org/abs/2410.13341
-
[56]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 (2024)
work page internal anchor Pith review arXiv 2024
- [57]
- [58]
-
[59]
Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409
work page 2021
- [60]
- [61]
- [62]
- [63]
- [64]
-
[65]
Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics 9 (2021), 1460–1474
work page 2021
-
[66]
Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar
-
[67]
In Proceedings of the Sixth Conference on Machine Translation
Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation . 733–774
-
[68]
Robert M French. 2000. The Turing Test: the first 50 years. Trends in cognitive sciences 4, 3 (2000), 115–122
work page 2000
- [69]
-
[70]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [71]
-
[72]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (2023), e2305016120. , Vol. 1, No. 1, Article . Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods 51
work page 2023
- [73]
-
[74]
Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, and Minlie Huang
-
[75]
arXiv preprint arXiv:2105.08920 (2021)
OpenMEVA: A benchmark for evaluating open-ended story generation metrics. arXiv preprint arXiv:2105.08920 (2021)
- [76]
- [77]
-
[78]
Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan
-
[79]
arXiv preprint arXiv:2410.21545 (2024)
Unveiling Context-Aware Criteria in Self-Assessing LLMs. arXiv preprint arXiv:2410.21545 (2024)
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.