pith:TB66XJQX
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization.
arxiv:2605.15341 v1 · 2026-05-14 · cs.LG · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TB66XJQXP62HMOI5DRFOTEK6ZZ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks, domain-aware prompting matches the published-best approximately 10 percentage points less often than domain-agnostic prompting at iteration 30.
The assumption that the oracle reward signal alignment with published-best configurations (and divergence from literature-typical ones) provides a valid external ground truth for judging LLM prompting choices, as invoked when reporting the 16 biology tasks and the 6-task subset where patterns are sharpest.
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
References
Formal links
Receipt and verification
| First computed | 2026-05-20T00:00:53.383758Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "fe18e1230cb354f369fa5ab8700ff90875aa6025ce808c88b1825da02bf4c18e",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-14T19:10:45Z",
"title_canon_sha256": "df3aaacfb8fc9e9175420b6c83a2532fe721c0342a08e236cab51d3f28b3e088"
},
"schema_version": "1.0",
"source": {
"id": "2605.15341",
"kind": "arxiv",
"version": 1
}
}