pith:CMDBZD66
MinerU: An Open-Source Solution for Precise Document Content Extraction
MinerU combines PDF-Extract-Kit models with custom rules to deliver high-precision document content extraction in open source.
arxiv:2409.18839 v1 · 2024-09-27 · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{CMDBZD66D25STJEVFILEYMFOWV}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction.
That the PDF-Extract-Kit models plus the authors' preprocessing and postprocessing rules generalize beyond the tested document collection and that the reported performance metrics reflect real-world usage without hidden data selection.
MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:49.166438Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
13061c8fde1ebb29a4952a164c30aeb575585f6faadc075ce8f64822a9da0bc3
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/CMDBZD66D25STJEVFILEYMFOWV \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 13061c8fde1ebb29a4952a164c30aeb575585f6faadc075ce8f64822a9da0bc3
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "e86d00729bac6e949a7c8694a6ca9a66683ccb6e67b205c75372e165b03567f5",
"cross_cats_sorted": [],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CV",
"submitted_at": "2024-09-27T15:35:15Z",
"title_canon_sha256": "818fa4332e6a9e69b932219b941c25e6d0005107a01f923c193435bdef7819b0"
},
"schema_version": "1.0",
"source": {
"id": "2409.18839",
"kind": "arxiv",
"version": 1
}
}