Nine LLM judges on three NLI datasets with human labels provide only ~2 effective independent votes due to correlated errors, underperforming independent voting by 8-22 points and matched or beaten by the best single judge.
arXiv preprint arXiv:2602.08003 , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
citing papers explorer
-
Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels
Nine LLM judges on three NLI datasets with human labels provide only ~2 effective independent votes due to correlated errors, underperforming independent voting by 8-22 points and matched or beaten by the best single judge.
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.