Pith Number

pith:MDF3373R

pith:2024:MDF3373RIIMDNQJZQ5BQYYKIRB

not attested not anchored not stored refs resolved

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis, Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su

A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.

arxiv:2404.18796 v2 · 2024-04-29 · cs.CL · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{MDF3373RIIMDNQJZQ5BQYYKIRB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

C2weakest assumption

That the collective judgments of smaller models from disjoint families can capture nuanced quality signals at least as well as a single frontier model without systematic blind spots on the evaluated tasks.

C3one line summary

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

References

291 extracted · 291 resolved · 3 Pith anchors

[1] Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku 2024

[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning 2020

[5] TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension 2017 · doi:10.18653/v1/p17-1147

[6] Dense Passage Retrieval for Open-Domain Question Answering 2020 · arXiv:2004.04906

[7] Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93 1938

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Scalable and Personalized Oral Assessments Using Voice AI

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

Reinforcing Human Behavior Simulation via Verbal Feedback

OpenCoderRank: Personalized Technical Assessments with Generative AI

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Receipt and verification

First computed	2026-05-17T23:38:49.775178Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec

Aliases

arxiv: 2404.18796 · arxiv_version: 2404.18796v2 · doi: 10.48550/arxiv.2404.18796 · pith_short_12: MDF3373RIIMD · pith_short_16: MDF3373RIIMDNQJZ · pith_short_8: MDF3373R

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "4a9f15ac01f3cf9e3f8f70be156ffd8d06fd4f9e398b6820200a3162380b3d3d",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-04-29T15:33:23Z",
    "title_canon_sha256": "9d60ce3a7ac97b31664a1f2f06e1792a1a9bce153ac2c1053b4ae652505ac363"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2404.18796",
    "kind": "arxiv",
    "version": 2
  }
}