pith. sign in
Pith Number

pith:CMDBZD66

pith:2024:CMDBZD66D25STJEVFILEYMFOWV
not attested not anchored not stored refs resolved

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Botian Shi, Bo Zhang, Chao Xu, Conghui He, Dahua Lin, Fan Wu, Fukai Shang, Kaiwen Liu, Linke Ouyang, Liqun Wei, Rui Xu, Wei Li, Xiaomeng Zhao, Yuan Qu, Yu Qiao, Zhihao Sui, Zhiyuan Zhao

MinerU combines PDF-Extract-Kit models with custom rules to deliver high-precision document content extraction in open source.

arxiv:2409.18839 v1 · 2024-09-27 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{CMDBZD66D25STJEVFILEYMFOWV}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction.

C2weakest assumption

That the PDF-Extract-Kit models plus the authors' preprocessing and postprocessing rules generalize beyond the tested document collection and that the reported performance metrics reflect real-world usage without hidden data selection.

C3one line summary

MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.

References

42 extracted · 42 resolved · 14 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774
[2] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection 2023 · arXiv:2310.11511
[3] pix2tex - latex ocr
[4] Nougat: Neural Optical Understanding for Academic Documents 2023 · arXiv:2308.13418
[5] Language Models are Few-Shot Learners 2005 · arXiv:2005.14165

Formal links

2 machine-checked theorem links

Cited by

29 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.166438Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

13061c8fde1ebb29a4952a164c30aeb575585f6faadc075ce8f64822a9da0bc3

Aliases

arxiv: 2409.18839 · arxiv_version: 2409.18839v1 · doi: 10.48550/arxiv.2409.18839 · pith_short_12: CMDBZD66D25S · pith_short_16: CMDBZD66D25STJEV · pith_short_8: CMDBZD66
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/CMDBZD66D25STJEVFILEYMFOWV \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 13061c8fde1ebb29a4952a164c30aeb575585f6faadc075ce8f64822a9da0bc3
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e86d00729bac6e949a7c8694a6ca9a66683ccb6e67b205c75372e165b03567f5",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-09-27T15:35:15Z",
    "title_canon_sha256": "818fa4332e6a9e69b932219b941c25e6d0005107a01f923c193435bdef7819b0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2409.18839",
    "kind": "arxiv",
    "version": 1
  }
}