{"paper":{"title":"Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction","license":"http://creativecommons.org/licenses/by/4.0/","headline":"No AI agent reliably beats a physicist when reproducing LHC analyses from public papers alone.","cross_cats":["cs.AI","hep-ex","hep-ph"],"primary_cat":"cs.LG","authors_text":"Darius A. Faroughy, David Shih, Ian Pang, Siddharth Mishra-Sharma, Sofia Palacios Schweitzer","submitted_at":"2026-05-13T18:00:00Z","abstract_excerpt":"Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitab"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our results show that on average no agent reliably beats the physicist-in-the-loop solution.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That published papers and public software contain enough information for agents to fill gaps via physical reasoning and trial-and-error without access to internal experimental details.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"No AI agent reliably beats a physicist when reproducing LHC analyses from public papers alone.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"79d2969d7e0f5537009bce899eac411ff477b96b46735a8699b8c468ba9f4a2a"},"source":{"id":"2605.13950","kind":"arxiv","version":1},"verdict":{"id":"48a7af57-959e-461e-ab79-591db63a5778","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T06:00:10.695853Z","strongest_claim":"Our results show that on average no agent reliably beats the physicist-in-the-loop solution.","one_line_summary":"Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That published papers and public software contain enough information for agents to fill gaps via physical reasoning and trial-and-error without access to internal experimental details.","pith_extraction_headline":"No AI agent reliably beats a physicist when reproducing LHC analyses from public papers alone."},"references":{"count":44,"sample":[{"doi":"","year":2026,"title":"Plehn, Tilman and Schiller, Daniel and Schmal, Nikita. MadAgents. arXiv:2601.21015. 2026","work_id":"a4e4e427-11c0-45dd-9873-15c631a8576d","ref_index":1,"cited_arxiv_id":"2601.21015","is_internal_anchor":true},{"doi":"","year":2026,"title":"An End-to-end Architecture for Collider Physics and Beyond","work_id":"5f2644fd-344c-4e3b-9744-34fd19331d2a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"The FERMIACC: Agents for Particle Theory","work_id":"d5e30788-8b8d-45d3-b4d8-6bf1e5ebd110","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.21468/scipostphyscodeb.8","year":2022,"title":"A comprehensive guide to the physics and usage of PYTHIA 8.3","work_id":"a9948af2-1653-4e68-8ab8-abc838e3ac66","ref_index":4,"cited_arxiv_id":"2203.11601","is_internal_anchor":true},{"doi":"10.1007/jhep02(2014)057","year":2014,"title":"DELPHES 3, A modular framework for fast simulation of a generic collider experiment","work_id":"c7ea0115-e875-451b-96a0-93e66ef77099","ref_index":5,"cited_arxiv_id":"1307.6346","is_internal_anchor":true}],"resolved_work":44,"snapshot_sha256":"a22bed6b69799740c01fc4772365bd14cf03e45a16e829fae1c1d096c321048a","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"e56b8e06c07838fd78b080c41909c70ee125b8ad6b48737a1142d0137fdc5eb0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}