pith. machine review for the scientific record. sign in

hub

A Survey on LLM-as-a-Judge

63 Pith papers cite this work. Polarity classification is still indexing.

63 Pith papers citing it
abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

hub tools

citation-role summary

background 2

citation-polarity summary

claims ledger

  • abstract Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of L

co-cited works

years

2026 60 2025 3

roles

background 2

polarities

background 2

representative citing papers

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

citing papers explorer

Showing 50 of 63 citing papers.