Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

cs.CL · 2026-03-31 · unverdicted · novelty 7.0

HumorRank ranks nine LLMs on textual humor using GTVH-grounded pairwise tournaments and Adaptive Swiss aggregation on the SemEval-2026 MWAHAHA dataset, finding that comedic mechanism mastery matters more than scale.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

citing papers explorer

Showing 3 of 3 citing papers.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 55
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models cs.CL · 2026-03-31 · unverdicted · none · ref 46
HumorRank ranks nine LLMs on textual humor using GTVH-grounded pairwise tournaments and Adaptive Swiss aggregation on the SemEval-2026 MWAHAHA dataset, finding that comedic mechanism mastery matters more than scale.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 66
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

Judging llm-as-a-judge with mt-bench and chatbot arena

fields

years

verdicts

representative citing papers

citing papers explorer