arxiv: 2509.22186 · v2 · pith:KNM3LVVKnew · submitted 2025-09-26 · 💻 cs.CV · cs.CL

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu , Zheng Liu , Zhuangcheng Gu , Bin Wang , Linke Ouyang , Zhiyuan Zhao , Tao Chu , Tianyao He

show 53 more authors

Fan Wu Qintong Zhang Zhenjiang Jin Guang Liang Rui Zhang Wenzheng Zhang Yuan Qu Zhifei Ren Yuefeng Sun Yuanhong Zheng Dongsheng Ma Zirui Tang Boyu Niu Ziyang Miao Hejun Dong Siyi Qian Junyuan Zhang Jingzhou Chen Fangdong Wang Xiaomeng Zhao Liqun Wei Wei Li Shasha Wang Ruiliang Xu Yuanyuan Cao Lu Chen Qianqian Wu Huaiyu Gu Lindong Lu Keming Wang Dechen Lin Guanlin Shen Xuanhe Zhou Linfeng Zhang Yuhang Zang Xiaoyi Dong Jiaqi Wang Bo Zhang Lei Bai Pei Chu Weijia Li Jiang Wu Lijun Wu Zhenxiang Li Guangyu Wang Zhongying Tu Chao Xu Kai Chen Yu Qiao Bowen Zhou Dahua Lin Wentao Zhang Conghui He

This is my paper

Pith reviewed 2026-05-17 13:20 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords document parsingvision-language modelhigh-resolution processinglayout analysiscoarse-to-fine strategyefficient inferencetable recognitionformula recognition

0 comments

The pith

MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MinerU2.5 as a 1.2 billion parameter vision-language model for document parsing that splits the work into two stages. The first stage runs efficient layout analysis on a downsampled version of the full page to locate structural elements such as text blocks, tables, and formulas. The second stage then extracts and recognizes content only from the corresponding high-resolution crops, preserving fine details without processing the entire high-resolution image at once. A supporting data engine creates large-scale training sets for both stages. If the separation works as intended, accurate document understanding becomes feasible at lower computational cost than models that handle high-resolution inputs uniformly.

Core claim

MinerU2.5 employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis performed on downsampled images from local content recognition performed on native-resolution crops extracted according to the layout guidance, thereby achieving state-of-the-art recognition accuracy across multiple benchmarks while incurring significantly lower computational overhead than prior general-purpose or domain-specific models.

What carries the argument

The coarse-to-fine two-stage parsing strategy that uses downsampled layout analysis to guide targeted native-resolution crop recognition.

If this is right

High-resolution document parsing becomes practical under tighter compute budgets without sacrificing accuracy on text, formulas, or tables.
The model outperforms both general vision-language models and specialized document parsers on standard recognition benchmarks.
A single trained model can handle diverse document layouts because the layout stage supplies explicit guidance to the recognition stage.
Training data volume can be scaled efficiently since the data engine separately supports layout and content tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layout-first guidance pattern could be tested on other high-resolution vision tasks such as scene text or diagram understanding.
Adaptive downsampling rates based on predicted layout density might further reduce average compute while preserving accuracy.
The decoupling suggests that explicit structural intermediates remain useful even as end-to-end vision-language models grow larger.

Load-bearing premise

Layout analysis performed on downsampled images supplies sufficiently accurate region boundaries and element types to guide error-free recognition on the corresponding high-resolution crops.

What would settle it

A benchmark document containing dense text or complex table structures where the downsampled layout stage misplaces a boundary or misclassifies an element, producing measurable recognition errors in the high-resolution stage that exceed those of uniform high-resolution baselines.

read the original abstract

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinerU2.5's explicit split between downsampled layout and native-res crops is a practical efficiency move, but the paper needs clearer checks that the first stage actually gives clean guidance without crop errors.

read the letter

The main point on this paper is the two-stage decoupling: run cheap layout analysis on a downsampled image to locate blocks, then extract and recognize native-resolution crops for the details. That split is the concrete new piece, along with the data engine they built to feed both stages. It lets them claim lower compute than full high-res models while still hitting strong numbers on document benchmarks, which is the part that could matter for real pipelines handling scans or PDFs. The efficiency angle holds up as a direct response to the cost of processing dense pages end-to-end. The results section apparently shows it beating both general VLMs and specialist parsers on recognition tasks, and the architecture description is clear enough to reproduce the flow. The soft spot is the guidance assumption. Downsampling for the layout pass can soften boundaries in tight text, formulas, or table grids, and any offset in the boxes feeds bad crops to the second stage. The abstract and stress-test note flag the lack of layout precision metrics or an ablation that compares downsampled guidance against full-resolution alternatives on the same data. If those checks are missing or weak in the full text, the SOTA claims rest more on end-to-end numbers than on proving the decoupling itself is robust. This is aimed at groups building document AI tools who already care about throughput on high-res inputs. A reader who needs a drop-in parser with documented speed gains would find the architecture and training setup useful. It deserves peer review because the core design is testable and the efficiency results are the kind of thing that gets adopted if the numbers check out. Send it to referees but ask specifically for layout-stage error analysis and any crop-quality ablations.

Referee Report

2 major / 2 minor

Summary. The paper introduces MinerU2.5, a 1.2B-parameter vision-language model for document parsing. It proposes a decoupled coarse-to-fine two-stage strategy: efficient global layout analysis on downsampled images followed by targeted content recognition on native-resolution crops extracted according to the detected layout. A data engine generates large-scale training corpora for pretraining and fine-tuning. The central claim is that this yields state-of-the-art recognition accuracy on multiple benchmarks while incurring significantly lower computational overhead than both general-purpose and domain-specific models.

Significance. If the empirical results and efficiency claims hold under rigorous evaluation, the decoupled architecture offers a practical route to high-resolution document parsing without full-image high-res processing throughout the network. The separation of layout and recognition stages could influence future work on resource-efficient VLMs for structured content such as tables and formulas. The data engine component also provides a reusable contribution for training such models.

major comments (2)

[Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.
[Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.

minor comments (2)

[Method] Clarify the exact downsampling factor used in the first stage and whether it is fixed or adaptive; this detail affects reproducibility of the efficiency claims.
[Figures] Add a figure or diagram explicitly showing the crop extraction pipeline, including how layout labels are mapped back to the original high-resolution image.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their constructive and detailed feedback. The comments have helped us identify areas where additional validation and clarity would strengthen the manuscript. We address each major comment below and describe the revisions made.

read point-by-point responses

Referee: [Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.

Authors: We thank the referee for highlighting the importance of validating the layout stage explicitly. While the end-to-end results support the overall effectiveness of the decoupled approach, we agree that intermediate metrics and analysis would increase transparency and address potential concerns about error propagation. In the revised manuscript, we have added a dedicated evaluation of the layout stage, reporting bounding-box IoU and element detection F1 scores on downsampled images. We have also included an error-propagation analysis quantifying the impact of layout inaccuracies on final recognition accuracy, as well as an ablation study comparing downsampled layout guidance against full-resolution alternatives. These additions confirm that the precision achieved is sufficient for the second stage while preserving the efficiency benefits. revision: yes
Referee: [Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.

Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support to allow readers to fully evaluate the claimed tradeoffs. In the revised manuscript, we have updated the abstract to include specific benchmark accuracy scores and efficiency metrics (such as parameter count, FLOPs, and inference speed) along with direct comparisons to baselines. The experiments section has been expanded with detailed tables containing numerical results, error bars from multiple runs where applicable, comprehensive ablation studies, and side-by-side comparisons against the listed general-purpose and domain-specific models. These changes make the strength of the efficiency-accuracy tradeoff readily assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results

full rationale

The paper describes a two-stage decoupled vision-language model trained on a data engine for document parsing. No equations, fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on empirical training and benchmark evaluation rather than any derivation that reduces to its own inputs by construction. The two-stage strategy (downsampled layout then native-resolution crops) is presented as an engineering choice whose validity is tested externally on benchmarks, not assumed or renamed from prior results within the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes standard supervised vision-language training succeeds when data is generated by the described engine and that downsampled layout cues transfer reliably to high-resolution crops; no explicit free parameters or invented entities are named beyond the model architecture itself.

free parameters (1)

model parameter count
1.2B parameters chosen as the scale for the vision-language backbone.

axioms (1)

domain assumption Coarse layout analysis on downsampled images is sufficient to guide accurate high-resolution crop extraction and recognition.
Invoked in the description of the two-stage strategy.

pith-pipeline@v0.9.0 · 5707 in / 1156 out tokens · 85811 ms · 2026-05-17T13:20:11.638576+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
cs.CV 2026-05 conditional novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
Visual-ERM: Reward Modeling for Visual Equivalence
cs.CV 2026-03 unverdicted novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
cs.CV 2025-11 unverdicted novelty 7.0

FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
physics.optics 2026-05 unverdicted novelty 6.0

JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
cs.CV 2026-05 unverdicted novelty 6.0

VLMs improve high-resolution reasoning by framing it as sequential Bayesian optimal experimental design, using a coverage-resolution proxy and the FOVEA procedure to acquire task-relevant visual evidence, yielding gai...
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
cs.CV 2026-05 unverdicted novelty 6.0

VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the tr...
InstructTable: Improving Table Structure Recognition Through Instructions
cs.CV 2026-04 unverdicted novelty 6.0

InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
cs.CV 2026-04 unverdicted novelty 6.0

A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Logics-Parsing-Omni Technical Report
cs.AI 2026-03 unverdicted novelty 6.0

Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
cs.CV 2026-01 conditional novelty 6.0

ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
cs.CV 2026-01 unverdicted novelty 5.0

PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 17 Pith papers · 17 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022

Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, et al. Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022

work page arXiv 2022
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025

chatdoc com. Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25

work page 2025
[6]

Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025

Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025

work page arXiv 2025
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Vision grid transformer for document layout analysis

Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 19462–19472, 2023

work page 2023
[10]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36: 2252–2274, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36: 2252–2274, 2023

work page 2023
[11]

Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chun- hui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

work page arXiv 2025
[12]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

work page internal anchor Pith review arXiv 2024
[13]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091, 2022

work page 2022
[15]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Conference on Computer Vision, pages 498–517. Springer, 2022

work page 2022
[16]

Gon- zalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[17]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 26 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

work page arXiv 2025
[18]

Doctr: Document transformer for structured information extraction in documents

Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R Manmatha, and Vijay Mahadevan. Doctr: Document transformer for structured information extraction in documents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19584–19594, 2023

work page 2023
[19]

Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition.arXiv preprint arXiv:2401.12599, 2024

Demiao Lin. Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition.arXiv preprint arXiv:2401.12599, 2024

work page arXiv 2024
[20]

Hrvda: High-resolution visual document assistant

Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, and Linli Xu. Hrvda: High-resolution visual document assistant. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15534–15545, 2024

work page 2024
[21]

Pp-formulanet: Bridging accuracy and efficiency in advanced formula recognition.arXiv preprint arXiv:2503.18382, 2025

Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, and Gang Pan. Pp-formulanet: Bridging accuracy and efficiency in advanced formula recognition.arXiv preprint arXiv:2503.18382, 2025

work page arXiv 2025
[22]

Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025

work page arXiv 2025
[23]

arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024

work page arXiv 2024
[24]

Docling: An efficient open- source toolkit for ai-driven document conversion,

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

work page arXiv 2025
[25]

Optimized table tokenization for table structure recognition

Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokenization for table structure recognition. InInternational Conference on Document Analysis and Recognition, pages 37–50. Springer, 2023

work page 2023
[26]

Nanonets-ocr-s

Souvik Mandalm. Nanonets-ocr-s. https://nanonets.com/research/nanonets-ocr-s/, 2025. Accessed:2025- 09-25

work page 2025
[27]

Mathpix.https://mathpix.com/, 2025

Mathpix. Mathpix.https://mathpix.com/, 2025. Accessed:2025-09-25

work page 2025
[28]

Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576, 2025

Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A Said Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576, 2025

work page arXiv 2025
[29]

Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, et al. Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

work page arXiv 2025
[30]

Pdf-extract-kit

OpenDataLab. Pdf-extract-kit. https://github.com/opendatalab/PDF-Extract-Kit, 2025. Accessed:2025- 09-25

work page 2025
[31]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

work page 2025
[32]

Marker.https://github.com/datalab-to/marker, 2025

Vik Paruchuri. Marker.https://github.com/datalab-to/marker, 2025. Accessed:2025-09-25

work page 2025
[33]

Surya: A lightweight document ocr and analysis toolkit

Vikas Paruchuri and Datalab Team. Surya: A lightweight document ocr and analysis toolkit. https: //github.com/VikParuchuri/surya, 2025. Accessed:2025-09-25

work page 2025
[34]

Doclaynet: A large human-annotated dataset for document-layout segmentation

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3743–3751, 2022

work page 2022
[35]

Available: https://arxiv.org/abs/2502.18443

Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 27 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

work page arXiv 2025
[36]

Rapid table.https://github.com/RapidAI/RapidTable, 2024

RapidAI. Rapid table.https://github.com/RapidAI/RapidTable, 2024. Accessed: 2025-9-25

work page 2024
[37]

dots.ocr: Multilingual document layout parsing in a single vision-language model

rednote. dots.ocr: Multilingual document layout parsing in a single vision-language model. https://github. com/rednote-hilab/dots.ocr, 2025. Accessed:2025-09-25

work page 2025
[38]

Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016

work page 2016
[39]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[40]

Unifying vision, text, and layout for universal document processing

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19254–19264, 2023

work page 2023
[41]

Mistral-ocr

Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm source=ai-bot.cn, 2025. Accessed:2025-09-25

work page 2025
[42]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Omniparser: A unified framework for text spotting key information extraction and table recognition

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15641–15653, 2024

work page 2024
[44]

Yolov10: Real-time end-to-end object detection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. Yolov10: Real-time end-to-end object detection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024

work page 2024
[45]

Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024

Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024

work page arXiv 2024
[46]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Image over text: Transforming formula recognition evaluation with character detection matching

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19681–19690, 2025

work page 2025
[48]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Layoutreader: Pre-training of text and layout for reading order detection.arXiv preprint arXiv:2108.11591, 2021

Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection.arXiv preprint arXiv:2108.11591, 2021

work page arXiv 2021
[51]

Vrdu: A benchmark for visually-rich document understanding

Zilong Wang, Yichao Zhou, Wei Wei, Chen-Yu Lee, and Sandeep Tata. Vrdu: A benchmark for visually-rich document understanding. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5184–5193, 2023

work page 2023
[52]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review arXiv 2024
[53]

Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy.arXiv preprint arXiv:2412.02210, 2024. 28 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Doc...

work page arXiv 2024
[54]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation.arXiv preprint arXiv:2412.02592, 2024

Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation.arXiv preprint arXiv:2412.02592, 2024

work page arXiv 2024
[57]

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.arXiv preprint arXiv:2410.21169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

work page arXiv 2024
[60]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

work page 2024
[61]

Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context

Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 697–706, 2021

work page 2021
[62]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020

work page 2020
[63]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 29 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Appe...

work page internal anchor Pith review Pith/arXiv arXiv 2025