pith. sign in

hub Canonical reference

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Canonical reference. 75% of citing Pith papers cite this work as background.

30 Pith papers citing it
Background 75% of classified citations
abstract

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

hub tools

citation-role summary

background 5 dataset 3

citation-polarity summary

clear filters

representative citing papers

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

cs.AI · 2026-06-19 · unverdicted · novelty 7.0

Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

Are LLMs Bad at Moral Reasoning?

cs.CY · 2026-06-10 · unverdicted · novelty 5.0

Reanalyzing MoReBench by assigning LLMs the task of generating scoring rubrics shows better calibration to human rubrics and suggests stronger LLM moral reasoning than previously reported.

citing papers explorer

Showing 2 of 2 citing papers after filters.