{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:MDF3373RIIMDNQJZQ5BQYYKIRB","short_pith_number":"pith:MDF3373R","schema_version":"1.0","canonical_sha256":"60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec","source":{"kind":"arxiv","id":"2404.18796","version":2},"attestation_state":"computed","paper":{"title":"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis, Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su","submitted_at":"2024-04-29T15:33:23Z","abstract_excerpt":"As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in th"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.18796","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-04-29T15:33:23Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"9d60ce3a7ac97b31664a1f2f06e1792a1a9bce153ac2c1053b4ae652505ac363","abstract_canon_sha256":"4a9f15ac01f3cf9e3f8f70be156ffd8d06fd4f9e398b6820200a3162380b3d3d"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.775758Z","signature_b64":"0WTkBfiV9IOSOoCaUOctuxVj74sOrTTMZM3WZbi1mBEVGodWW9KmyC8CxCYrSLdHtK0wyAFMk4rz0VAeCIOACQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"60cbbdff71421836c13987430c61488868ff86841ddc1fc1b48c7811f418ffec","last_reissued_at":"2026-05-17T23:38:49.775178Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.775178Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis, Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su","submitted_at":"2024-04-29T15:33:23Z","abstract_excerpt":"As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in th"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the collective judgments of smaller models from disjoint families can capture nuanced quality signals at least as well as a single frontier model without systematic blind spots on the evaluated tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"02805f7d5fa51e54d63a5dc3cc54822fc0bc1df59e83a04d6b565d8a25403e74"},"source":{"id":"2404.18796","kind":"arxiv","version":2},"verdict":{"id":"f3ec4dfc-1a89-4ab1-be91-28f11c01ba6f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T23:27:15.223901Z","strongest_claim":"using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.","one_line_summary":"A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the collective judgments of smaller models from disjoint families can capture nuanced quality signals at least as well as a single frontier model without systematic blind spots on the evaluated tasks.","pith_extraction_headline":"A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less."},"references":{"count":291,"sample":[{"doi":"","year":2024,"title":"Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku","work_id":"b2994d79-7b31-437d-81e0-ab0c78132716","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning","work_id":"7483e4be-735b-4e12-aa3d-2e1bcb6d1af4","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/p17-1147","year":2017,"title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","work_id":"d05a9c57-9d88-473a-aa65-efb13f9dee25","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Dense Passage Retrieval for Open-Domain Question Answering","work_id":"3d6f2008-b001-4542-ba3f-192f6880c74b","ref_index":6,"cited_arxiv_id":"2004.04906","is_internal_anchor":true},{"doi":"","year":1938,"title":"Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93","work_id":"d65e5c0f-9765-4dc9-97fc-dc5c301f3e21","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":291,"snapshot_sha256":"c0107f137dbf9cc663af39ca04552b1ef9b5ee02ed026d31d728565cac86c635","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a81cadf4bda0bd5e6a33aef6b34918be4254fe641e9801cc4342a15a982b6ddf"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.18796","created_at":"2026-05-17T23:38:49.775251+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.18796v2","created_at":"2026-05-17T23:38:49.775251+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.18796","created_at":"2026-05-17T23:38:49.775251+00:00"},{"alias_kind":"pith_short_12","alias_value":"MDF3373RIIMD","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MDF3373RIIMDNQJZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MDF3373R","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2603.18221","citing_title":"Scalable and Personalized Oral Assessments Using Voice AI","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20351","citing_title":"Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20506","citing_title":"Reinforcing Human Behavior Simulation via Verbal Feedback","ref_index":151,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06774","citing_title":"OpenCoderRank: Personalized Technical Assessments with Generative AI","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23213","citing_title":"Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2601.13262","citing_title":"CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2508.07407","citing_title":"A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02359","citing_title":"Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11232","citing_title":"Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09808","citing_title":"Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants","ref_index":95,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09041","citing_title":"BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05579","citing_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","ref_index":233,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23734","citing_title":"Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23178","citing_title":"Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06201","citing_title":"Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06161","citing_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01311","citing_title":"The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20726","citing_title":"Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18547","citing_title":"FUSE: Ensembling Verifiers with Zero Labeled Data","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10291","citing_title":"TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09409","citing_title":"Do AI Coding Agents Log Like Humans? An Empirical Study","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07986","citing_title":"Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06996","citing_title":"Self-Preference Bias in Rubric-Based Evaluation of Large Language Models","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13717","citing_title":"On Cost-Effective LLM-as-a-Judge Improvement Techniques","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03179","citing_title":"A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts","ref_index":40,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB","json":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB.json","graph_json":"https://pith.science/api/pith-number/MDF3373RIIMDNQJZQ5BQYYKIRB/graph.json","events_json":"https://pith.science/api/pith-number/MDF3373RIIMDNQJZQ5BQYYKIRB/events.json","paper":"https://pith.science/paper/MDF3373R"},"agent_actions":{"view_html":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB","download_json":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB.json","view_paper":"https://pith.science/paper/MDF3373R","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.18796&json=true","fetch_graph":"https://pith.science/api/pith-number/MDF3373RIIMDNQJZQ5BQYYKIRB/graph.json","fetch_events":"https://pith.science/api/pith-number/MDF3373RIIMDNQJZQ5BQYYKIRB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB/action/storage_attestation","attest_author":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB/action/author_attestation","sign_citation":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB/action/citation_signature","submit_replication":"https://pith.science/pith/MDF3373RIIMDNQJZQ5BQYYKIRB/action/replication_record"}},"created_at":"2026-05-17T23:38:49.775251+00:00","updated_at":"2026-05-17T23:38:49.775251+00:00"}