pith. sign in
Pith Number

pith:7QRGQQQJ

pith:2026:7QRGQQQJ3TSQ3V2JK3CI72UM4K
not attested not anchored not stored refs pending

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Feng Luo, Gang Li, Guijuan He, Jingjing Wang, Joshua Luo, Ling Liu, Srikanth Pilla, Tianyu Zhu, Yi Hu

An automated framework parses IUPAC names into structural metadata to guide LLMs in creating a 163000-pair molecule-description dataset at 98.6 percent precision.

arxiv:2602.02320 v4 · 2026-02-02 · cs.CL · cs.AI · q-bio.BM

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{7QRGQQQJ3TSQ3V2JK3CI72UM4K}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Using this framework, we curate a large-scale dataset of approximately 163k molecule--description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of 2,000 molecules demonstrates a high description precision of 98.6%.

C2weakest assumption

The extended rule-based parser correctly extracts complete structural details from every IUPAC name into XML metadata, and the subsequent LLM generations faithfully reflect those details without introducing structural errors or hallucinations.

C3one line summary

An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.

Receipt and verification
First computed 2026-06-30T02:18:06.651175Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

fc22684209dce50dd74956c48fea8ce2af451c5c6476a80c3c7d183b66474880

Aliases

arxiv: 2602.02320 · arxiv_version: 2602.02320v4 · doi: 10.48550/arxiv.2602.02320 · pith_short_12: 7QRGQQQJ3TSQ · pith_short_16: 7QRGQQQJ3TSQ3V2J · pith_short_8: 7QRGQQQJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/7QRGQQQJ3TSQ3V2JK3CI72UM4K \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: fc22684209dce50dd74956c48fea8ce2af451c5c6476a80c3c7d183b66474880
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "205f61779229925a96580ce3e3b266faa995dc72d5dd1a3453d755511b3b74ae",
    "cross_cats_sorted": [
      "cs.AI",
      "q-bio.BM"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-02-02T16:49:19Z",
    "title_canon_sha256": "50607597811c6c08878a9e94dbb41951ef704c5a6aded7042ce0e79d12aba4f2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2602.02320",
    "kind": "arxiv",
    "version": 4
  }
}