pith:7QRGQQQJ
A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
An automated framework parses IUPAC names into structural metadata to guide LLMs in creating a 163000-pair molecule-description dataset at 98.6 percent precision.
arxiv:2602.02320 v4 · 2026-02-02 · cs.CL · cs.AI · q-bio.BM
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{7QRGQQQJ3TSQ3V2JK3CI72UM4K}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Using this framework, we curate a large-scale dataset of approximately 163k molecule--description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of 2,000 molecules demonstrates a high description precision of 98.6%.
The extended rule-based parser correctly extracts complete structural details from every IUPAC name into XML metadata, and the subsequent LLM generations faithfully reflect those details without introducing structural errors or hallucinations.
An automated rule-based parser plus LLM pipeline creates a 163k-pair molecular structure-language dataset validated at 98.6% precision on a 2,000-sample subset.
Receipt and verification
| First computed | 2026-06-30T02:18:06.651175Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
fc22684209dce50dd74956c48fea8ce2af451c5c6476a80c3c7d183b66474880
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/7QRGQQQJ3TSQ3V2JK3CI72UM4K \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: fc22684209dce50dd74956c48fea8ce2af451c5c6476a80c3c7d183b66474880
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "205f61779229925a96580ce3e3b266faa995dc72d5dd1a3453d755511b3b74ae",
"cross_cats_sorted": [
"cs.AI",
"q-bio.BM"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-02-02T16:49:19Z",
"title_canon_sha256": "50607597811c6c08878a9e94dbb41951ef704c5a6aded7042ce0e79d12aba4f2"
},
"schema_version": "1.0",
"source": {
"id": "2602.02320",
"kind": "arxiv",
"version": 4
}
}