pith. sign in
Pith Number

pith:TB66XJQX

pith:2026:TB66XJQXP62HMOI5DRFOTEK6ZZ
not attested not anchored not stored refs resolved

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

Ankita Rathod, Fabi\'an Barzuna, Marilyn Zhang, Mark E. Whiting, Tianfeng Chen

Trajectory scoring changes which LLMs rank best at iterative scientific design and shows they fall short of Bayesian optimization.

arxiv:2605.15341 v1 · 2026-05-14 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TB66XJQXP62HMOI5DRFOTEK6ZZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks, domain-aware prompting matches the published-best approximately 10 percentage points less often than domain-agnostic prompting at iteration 30.

C2weakest assumption

The assumption that the oracle reward signal alignment with published-best configurations (and divergence from literature-typical ones) provides a valid external ground truth for judging LLM prompting choices, as invoked when reporting the 16 biology tasks and the 6-task subset where patterns are sharpest.

C3one line summary

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

References

45 extracted · 45 resolved · 6 Pith anchors

[1] Parth Asawa, Chris Glaze, Gabe Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, and Joseph E. Gonzalez. Con- tinual learning bench. https://continu 2026
[2] Autonomous chemical research with large language models · doi:10.1038/s41586-023-06792-0
[3] On the Measure of Intelligence 1911 · arXiv:1911.01547
[4] Towards an AI co-scientist · arXiv:2502.18864
[5] Ideabench: Benchmarking large language models for research idea generation 2026

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:00:53.383758Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc

Aliases

arxiv: 2605.15341 · arxiv_version: 2605.15341v1 · doi: 10.48550/arxiv.2605.15341 · pith_short_12: TB66XJQXP62H · pith_short_16: TB66XJQXP62HMOI5 · pith_short_8: TB66XJQX
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TB66XJQXP62HMOI5DRFOTEK6ZZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 987deba6177fb476391d1c4ae9915ece46a059401c86f525567a8b2f65e082fc
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "fe18e1230cb354f369fa5ab8700ff90875aa6025ce808c88b1825da02bf4c18e",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T19:10:45Z",
    "title_canon_sha256": "df3aaacfb8fc9e9175420b6c83a2532fe721c0342a08e236cab51d3f28b3e088"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15341",
    "kind": "arxiv",
    "version": 1
  }
}