pith. sign in
Pith Number

pith:MDF3373R

pith:2024:MDF3373RIIMDNQJZQ5BQYYKIRB
not attested not anchored not stored refs resolved

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis, Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su

A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.

arxiv:2404.18796 v2 · 2024-04-29 · cs.CL · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MDF3373RIIMDNQJZQ5BQYYKIRB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

C2weakest assumption

That the collective judgments of smaller models from disjoint families can capture nuanced quality signals at least as well as a single frontier model without systematic blind spots on the evaluated tasks.

C3one line summary

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

References

291 extracted · 291 resolved · 3 Pith anchors

[1] Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku 2024
[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning 2020
[5] TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension 2017 · doi:10.18653/v1/p17-1147
[6] Dense Passage Retrieval for Open-Domain Question Answering 2020 · arXiv:2004.04906
[7] Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93 1938

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.775178Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec

Aliases

arxiv: 2404.18796 · arxiv_version: 2404.18796v2 · doi: 10.48550/arxiv.2404.18796 · pith_short_12: MDF3373RIIMD · pith_short_16: MDF3373RIIMDNQJZ · pith_short_8: MDF3373R
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4a9f15ac01f3cf9e3f8f70be156ffd8d06fd4f9e398b6820200a3162380b3d3d",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-04-29T15:33:23Z",
    "title_canon_sha256": "9d60ce3a7ac97b31664a1f2f06e1792a1a9bce153ac2c1053b4ae652505ac363"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2404.18796",
    "kind": "arxiv",
    "version": 2
  }
}