pith. machine review for the scientific record. sign in

arxiv: 2509.22186 · v2 · pith:KNM3LVVKnew · submitted 2025-09-26 · 💻 cs.CV · cs.CL

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Pith reviewed 2026-05-17 13:20 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords document parsingvision-language modelhigh-resolution processinglayout analysiscoarse-to-fine strategyefficient inferencetable recognitionformula recognition
0
0 comments X

The pith

MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MinerU2.5 as a 1.2 billion parameter vision-language model for document parsing that splits the work into two stages. The first stage runs efficient layout analysis on a downsampled version of the full page to locate structural elements such as text blocks, tables, and formulas. The second stage then extracts and recognizes content only from the corresponding high-resolution crops, preserving fine details without processing the entire high-resolution image at once. A supporting data engine creates large-scale training sets for both stages. If the separation works as intended, accurate document understanding becomes feasible at lower computational cost than models that handle high-resolution inputs uniformly.

Core claim

MinerU2.5 employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis performed on downsampled images from local content recognition performed on native-resolution crops extracted according to the layout guidance, thereby achieving state-of-the-art recognition accuracy across multiple benchmarks while incurring significantly lower computational overhead than prior general-purpose or domain-specific models.

What carries the argument

The coarse-to-fine two-stage parsing strategy that uses downsampled layout analysis to guide targeted native-resolution crop recognition.

If this is right

  • High-resolution document parsing becomes practical under tighter compute budgets without sacrificing accuracy on text, formulas, or tables.
  • The model outperforms both general vision-language models and specialized document parsers on standard recognition benchmarks.
  • A single trained model can handle diverse document layouts because the layout stage supplies explicit guidance to the recognition stage.
  • Training data volume can be scaled efficiently since the data engine separately supports layout and content tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layout-first guidance pattern could be tested on other high-resolution vision tasks such as scene text or diagram understanding.
  • Adaptive downsampling rates based on predicted layout density might further reduce average compute while preserving accuracy.
  • The decoupling suggests that explicit structural intermediates remain useful even as end-to-end vision-language models grow larger.

Load-bearing premise

Layout analysis performed on downsampled images supplies sufficiently accurate region boundaries and element types to guide error-free recognition on the corresponding high-resolution crops.

What would settle it

A benchmark document containing dense text or complex table structures where the downsampled layout stage misplaces a boundary or misclassifies an element, producing measurable recognition errors in the high-resolution stage that exceed those of uniform high-resolution baselines.

read the original abstract

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MinerU2.5, a 1.2B-parameter vision-language model for document parsing. It proposes a decoupled coarse-to-fine two-stage strategy: efficient global layout analysis on downsampled images followed by targeted content recognition on native-resolution crops extracted according to the detected layout. A data engine generates large-scale training corpora for pretraining and fine-tuning. The central claim is that this yields state-of-the-art recognition accuracy on multiple benchmarks while incurring significantly lower computational overhead than both general-purpose and domain-specific models.

Significance. If the empirical results and efficiency claims hold under rigorous evaluation, the decoupled architecture offers a practical route to high-resolution document parsing without full-image high-res processing throughout the network. The separation of layout and recognition stages could influence future work on resource-efficient VLMs for structured content such as tables and formulas. The data engine component also provides a reusable contribution for training such models.

major comments (2)
  1. [Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.
  2. [Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.
minor comments (2)
  1. [Method] Clarify the exact downsampling factor used in the first stage and whether it is fixed or adaptive; this detail affects reproducibility of the efficiency claims.
  2. [Figures] Add a figure or diagram explicitly showing the crop extraction pipeline, including how layout labels are mapped back to the original high-resolution image.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their constructive and detailed feedback. The comments have helped us identify areas where additional validation and clarity would strengthen the manuscript. We address each major comment below and describe the revisions made.

read point-by-point responses
  1. Referee: [Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.

    Authors: We thank the referee for highlighting the importance of validating the layout stage explicitly. While the end-to-end results support the overall effectiveness of the decoupled approach, we agree that intermediate metrics and analysis would increase transparency and address potential concerns about error propagation. In the revised manuscript, we have added a dedicated evaluation of the layout stage, reporting bounding-box IoU and element detection F1 scores on downsampled images. We have also included an error-propagation analysis quantifying the impact of layout inaccuracies on final recognition accuracy, as well as an ablation study comparing downsampled layout guidance against full-resolution alternatives. These additions confirm that the precision achieved is sufficient for the second stage while preserving the efficiency benefits. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.

    Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support to allow readers to fully evaluate the claimed tradeoffs. In the revised manuscript, we have updated the abstract to include specific benchmark accuracy scores and efficiency metrics (such as parameter count, FLOPs, and inference speed) along with direct comparisons to baselines. The experiments section has been expanded with detailed tables containing numerical results, error bars from multiple runs where applicable, comprehensive ablation studies, and side-by-side comparisons against the listed general-purpose and domain-specific models. These changes make the strength of the efficiency-accuracy tradeoff readily assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results

full rationale

The paper describes a two-stage decoupled vision-language model trained on a data engine for document parsing. No equations, fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on empirical training and benchmark evaluation rather than any derivation that reduces to its own inputs by construction. The two-stage strategy (downsampled layout then native-resolution crops) is presented as an engineering choice whose validity is tested externally on benchmarks, not assumed or renamed from prior results within the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes standard supervised vision-language training succeeds when data is generated by the described engine and that downsampled layout cues transfer reliably to high-resolution crops; no explicit free parameters or invented entities are named beyond the model architecture itself.

free parameters (1)
  • model parameter count
    1.2B parameters chosen as the scale for the vision-language backbone.
axioms (1)
  • domain assumption Coarse layout analysis on downsampled images is sufficient to guide accurate high-resolution crop extraction and recognition.
    Invoked in the description of the two-stage strategy.

pith-pipeline@v0.9.0 · 5707 in / 1156 out tokens · 85811 ms · 2026-05-17T13:20:11.638576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

  3. Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval

    cs.CL 2026-05 unverdicted novelty 7.0

    Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...

  4. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  5. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  6. Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.

  7. Visual-ERM: Reward Modeling for Visual Equivalence

    cs.CV 2026-03 unverdicted novelty 7.0

    Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

  8. FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

    cs.CV 2025-11 unverdicted novelty 7.0

    FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.

  9. Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

    physics.optics 2026-05 unverdicted novelty 6.0

    JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.

  10. The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs improve high-resolution reasoning by framing it as sequential Bayesian optimal experimental design, using a coverage-resolution proxy and the FOVEA procedure to acquire task-relevant visual evidence, yielding gai...

  11. The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the tr...

  12. InstructTable: Improving Table Structure Recognition Through Instructions

    cs.CV 2026-04 unverdicted novelty 6.0

    InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...

  13. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  14. Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

    cs.CV 2026-04 unverdicted novelty 6.0

    A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.

  15. Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

    cs.CV 2026-03 conditional novelty 6.0

    PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.

  16. Logics-Parsing-Omni Technical Report

    cs.AI 2026-03 unverdicted novelty 6.0

    Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.

  17. ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

    cs.CV 2026-01 conditional novelty 6.0

    ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.

  18. PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    cs.CV 2026-01 unverdicted novelty 5.0

    PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 17 Pith papers · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022

    Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, et al. Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Nougat: Neural Optical Understanding for Academic Documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

  5. [5]

    Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025

    chatdoc com. Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25

  6. [6]

    Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025

    Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  8. [8]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

  9. [9]

    Vision grid transformer for document layout analysis

    Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 19462–19472, 2023

  10. [10]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36: 2252–2274, 2023

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36: 2252–2274, 2023

  11. [11]

    Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chun- hui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

  12. [12]

    OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

  13. [13]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  14. [14]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091, 2022

  15. [15]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Conference on Computer Vision, pages 498–517. Springer, 2022

  16. [16]

    Gon- zalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  17. [17]

    Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 26 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

  18. [18]

    Doctr: Document transformer for structured information extraction in documents

    Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R Manmatha, and Vijay Mahadevan. Doctr: Document transformer for structured information extraction in documents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19584–19594, 2023

  19. [19]

    Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition.arXiv preprint arXiv:2401.12599, 2024

    Demiao Lin. Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition.arXiv preprint arXiv:2401.12599, 2024

  20. [20]

    Hrvda: High-resolution visual document assistant

    Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, and Linli Xu. Hrvda: High-resolution visual document assistant. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15534–15545, 2024

  21. [21]

    Pp-formulanet: Bridging accuracy and efficiency in advanced formula recognition.arXiv preprint arXiv:2503.18382, 2025

    Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, and Gang Pan. Pp-formulanet: Bridging accuracy and efficiency in advanced formula recognition.arXiv preprint arXiv:2503.18382, 2025

  22. [22]

    Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025

    Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025

  23. [23]

    arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024

  24. [24]

    Docling: An efficient open- source toolkit for ai-driven document conversion,

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

  25. [25]

    Optimized table tokenization for table structure recognition

    Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokenization for table structure recognition. InInternational Conference on Document Analysis and Recognition, pages 37–50. Springer, 2023

  26. [26]

    Nanonets-ocr-s

    Souvik Mandalm. Nanonets-ocr-s. https://nanonets.com/research/nanonets-ocr-s/, 2025. Accessed:2025- 09-25

  27. [27]

    Mathpix.https://mathpix.com/, 2025

    Mathpix. Mathpix.https://mathpix.com/, 2025. Accessed:2025-09-25

  28. [28]

    Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576, 2025

    Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A Said Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576, 2025

  29. [29]

    Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

    Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, et al. Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

  30. [30]

    Pdf-extract-kit

    OpenDataLab. Pdf-extract-kit. https://github.com/opendatalab/PDF-Extract-Kit, 2025. Accessed:2025- 09-25

  31. [31]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

  32. [32]

    Marker.https://github.com/datalab-to/marker, 2025

    Vik Paruchuri. Marker.https://github.com/datalab-to/marker, 2025. Accessed:2025-09-25

  33. [33]

    Surya: A lightweight document ocr and analysis toolkit

    Vikas Paruchuri and Datalab Team. Surya: A lightweight document ocr and analysis toolkit. https: //github.com/VikParuchuri/surya, 2025. Accessed:2025-09-25

  34. [34]

    Doclaynet: A large human-annotated dataset for document-layout segmentation

    Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3743–3751, 2022

  35. [35]

    Available: https://arxiv.org/abs/2502.18443

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 27 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

  36. [36]

    Rapid table.https://github.com/RapidAI/RapidTable, 2024

    RapidAI. Rapid table.https://github.com/RapidAI/RapidTable, 2024. Accessed: 2025-9-25

  37. [37]

    dots.ocr: Multilingual document layout parsing in a single vision-language model

    rednote. dots.ocr: Multilingual document layout parsing in a single vision-language model. https://github. com/rednote-hilab/dots.ocr, 2025. Accessed:2025-09-25

  38. [38]

    Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016

  39. [39]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  40. [40]

    Unifying vision, text, and layout for universal document processing

    Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19254–19264, 2023

  41. [41]

    Mistral-ocr

    Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm source=ai-bot.cn, 2025. Accessed:2025-09-25

  42. [42]

    Qwen2 Technical Report

    Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  43. [43]

    Omniparser: A unified framework for text spotting key information extraction and table recognition

    Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15641–15653, 2024

  44. [44]

    Yolov10: Real-time end-to-end object detection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. Yolov10: Real-time end-to-end object detection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024

  45. [45]

    Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024

  46. [46]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  47. [47]

    Image over text: Transforming formula recognition evaluation with character detection matching

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19681–19690, 2025

  48. [48]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  49. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  50. [50]

    Layoutreader: Pre-training of text and layout for reading order detection.arXiv preprint arXiv:2108.11591, 2021

    Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection.arXiv preprint arXiv:2108.11591, 2021

  51. [51]

    Vrdu: A benchmark for visually-rich document understanding

    Zilong Wang, Yichao Zhou, Wei Wei, Chen-Yu Lee, and Sandeep Tata. Vrdu: A benchmark for visually-rich document understanding. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5184–5193, 2023

  52. [52]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

  53. [53]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy.arXiv preprint arXiv:2412.02210, 2024. 28 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Doc...

  54. [54]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  55. [55]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...

  56. [56]

    Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation.arXiv preprint arXiv:2412.02592, 2024

    Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation.arXiv preprint arXiv:2412.02592, 2024

  57. [57]

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.arXiv preprint arXiv:2410.21169, 2024

  58. [58]

    Retrieval-Augmented Generation for AI-Generated Content: A Survey

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

  59. [59]

    Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

    Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

  60. [60]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

  61. [61]

    Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context

    Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 697–706, 2021

  62. [62]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020

  63. [63]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 29 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Appe...