{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:TB66XJQXP62HMOI5DRFOTEK6ZZ","short_pith_number":"pith:TB66XJQX","schema_version":"1.0","canonical_sha256":"987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc","source":{"kind":"arxiv","id":"2605.15341","version":1},"attestation_state":"computed","paper":{"title":"LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Ankita Rathod, Fabi\\'an Barzuna, Marilyn Zhang, Mark E. Whiting, Tianfeng Chen","submitted_at":"2026-05-14T19:10:45Z","abstract_excerpt":"LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.15341","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-14T19:10:45Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"df3aaacfb8fc9e9175420b6c83a2532fe721c0342a08e236cab51d3f28b3e088","abstract_canon_sha256":"fe18e1230cb354f369fa5ab8700ff90875aa6025ce808c88b1825da02bf4c18e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:00:53.384635Z","signature_b64":"LDFHj2wToUEDEBDUDT/thCKZ8paT8EBio+oDaI68r3IaQAhkQu5T2Reeq8P3ve4AVDvIUxFGSTHJIyJ1M4JJCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc","last_reissued_at":"2026-05-20T00:00:53.383758Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:00:53.383758Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Ankita Rathod, Fabi\\'an Barzuna, Marilyn Zhang, Mark E. Whiting, Tianfeng Chen","submitted_at":"2026-05-14T19:10:45Z","abstract_excerpt":"LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks, domain-aware prompting matches the published-best approximately 10 percentage points less often than domain-agnostic prompting at iteration 30.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the oracle reward signal alignment with published-best configurations (and divergence from literature-typical ones) provides a valid external ground truth for judging LLM prompting choices, as invoked when reporting the 16 biology tasks and the 6-task subset where patterns are sharpest.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"eb2375ebca5ce945dbf9835e82d98a679deac407208ea0822b12b209c5ffcb89"},"source":{"id":"2605.15341","kind":"arxiv","version":1},"verdict":{"id":"d55862ab-08da-4094-869a-11c4ac6baa4e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T15:45:50.499801Z","strongest_claim":"Switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks, domain-aware prompting matches the published-best approximately 10 percentage points less often than domain-agnostic prompting at iteration 30.","one_line_summary":"LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the oracle reward signal alignment with published-best configurations (and divergence from literature-typical ones) provides a valid external ground truth for judging LLM prompting choices, as invoked when reporting the 16 biology tasks and the 6-task subset where patterns are sharpest.","pith_extraction_headline":"Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization."},"integrity":{"clean":false,"summary":{"advisory":1,"critical":0,"by_detector":{"doi_compliance":{"total":1,"advisory":1,"critical":0,"informational":0}},"informational":0},"endpoint":"/pith/2605.15341/integrity.json","findings":[{"note":"DOI in the printed bibliography is fragmented by whitespace or line breaks. A longer candidate (10.64898/2026.02) was visible in the surrounding text but could not be confirmed against doi.org as printed.","detector":"doi_compliance","severity":"advisory","ref_index":14,"audited_at":"2026-05-19T15:54:06.990710Z","detected_doi":"10.64898/2026.02","finding_type":"recoverable_identifier","verdict_class":"incontrovertible","detected_arxiv_id":null}],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T16:01:18.123367Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T15:54:06.990710Z","status":"completed","version":"1.0.0","findings_count":1},{"name":"claim_evidence","ran_at":"2026-05-19T14:41:54.175901Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.756111Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"475dec5f17c1458e06f7f342a6851859fa970ae82019e210ab30a65f97c46e56"},"references":{"count":45,"sample":[{"doi":"","year":2026,"title":"Parth Asawa, Chris Glaze, Gabe Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, and Joseph E. Gonzalez. Con- tinual learning bench. https://continu","work_id":"63d625c0-7504-4977-9aed-18426f9bb4d2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1038/s41586-023-06792-0","year":null,"title":"Autonomous chemical research with large language models","work_id":"e15cebd6-c137-47c6-975e-41b70ed20de9","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1911,"title":"On the Measure of Intelligence","work_id":"d8980a59-aa48-447b-8852-b7aca2b41b2c","ref_index":3,"cited_arxiv_id":"1911.01547","is_internal_anchor":true},{"doi":"","year":null,"title":"Towards an AI co-scientist","work_id":"485486b1-a1a2-4cde-bdda-768930c403e6","ref_index":4,"cited_arxiv_id":"2502.18864","is_internal_anchor":true},{"doi":"","year":2026,"title":"Ideabench: Benchmarking large language models for research idea generation","work_id":"046c9779-f697-4bc1-935f-dfb9765d93d6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":45,"snapshot_sha256":"bfd0a27323f7fa900d667f65c04146795c753334c4b5ee710f88b6a16022ffaa","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"fda620b32e4160c3db925ecedf93099e8f1a1f9acc2254b95f5409b79e122f91"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.15341","created_at":"2026-05-20T00:00:53.383904+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.15341v1","created_at":"2026-05-20T00:00:53.383904+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.15341","created_at":"2026-05-20T00:00:53.383904+00:00"},{"alias_kind":"pith_short_12","alias_value":"TB66XJQXP62H","created_at":"2026-05-20T00:00:53.383904+00:00"},{"alias_kind":"pith_short_16","alias_value":"TB66XJQXP62HMOI5","created_at":"2026-05-20T00:00:53.383904+00:00"},{"alias_kind":"pith_short_8","alias_value":"TB66XJQX","created_at":"2026-05-20T00:00:53.383904+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ","json":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ.json","graph_json":"https://pith.science/api/pith-number/TB66XJQXP62HMOI5DRFOTEK6ZZ/graph.json","events_json":"https://pith.science/api/pith-number/TB66XJQXP62HMOI5DRFOTEK6ZZ/events.json","paper":"https://pith.science/paper/TB66XJQX"},"agent_actions":{"view_html":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ","download_json":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ.json","view_paper":"https://pith.science/paper/TB66XJQX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.15341&json=true","fetch_graph":"https://pith.science/api/pith-number/TB66XJQXP62HMOI5DRFOTEK6ZZ/graph.json","fetch_events":"https://pith.science/api/pith-number/TB66XJQXP62HMOI5DRFOTEK6ZZ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ/action/storage_attestation","attest_author":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ/action/author_attestation","sign_citation":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ/action/citation_signature","submit_replication":"https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ/action/replication_record"}},"created_at":"2026-05-20T00:00:53.383904+00:00","updated_at":"2026-05-20T00:00:53.383904+00:00"}