MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Pith reviewed 2026-05-17 13:20 UTC · model grok-4.3
The pith
MinerU2.5 decouples global layout analysis on downsampled images from local content recognition on native-resolution crops to parse high-resolution documents with state-of-the-art accuracy and lower compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MinerU2.5 employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis performed on downsampled images from local content recognition performed on native-resolution crops extracted according to the layout guidance, thereby achieving state-of-the-art recognition accuracy across multiple benchmarks while incurring significantly lower computational overhead than prior general-purpose or domain-specific models.
What carries the argument
The coarse-to-fine two-stage parsing strategy that uses downsampled layout analysis to guide targeted native-resolution crop recognition.
If this is right
- High-resolution document parsing becomes practical under tighter compute budgets without sacrificing accuracy on text, formulas, or tables.
- The model outperforms both general vision-language models and specialized document parsers on standard recognition benchmarks.
- A single trained model can handle diverse document layouts because the layout stage supplies explicit guidance to the recognition stage.
- Training data volume can be scaled efficiently since the data engine separately supports layout and content tasks.
Where Pith is reading between the lines
- The same layout-first guidance pattern could be tested on other high-resolution vision tasks such as scene text or diagram understanding.
- Adaptive downsampling rates based on predicted layout density might further reduce average compute while preserving accuracy.
- The decoupling suggests that explicit structural intermediates remain useful even as end-to-end vision-language models grow larger.
Load-bearing premise
Layout analysis performed on downsampled images supplies sufficiently accurate region boundaries and element types to guide error-free recognition on the corresponding high-resolution crops.
What would settle it
A benchmark document containing dense text or complex table structures where the downsampled layout stage misplaces a boundary or misclassifies an element, producing measurable recognition errors in the high-resolution stage that exceed those of uniform high-resolution baselines.
read the original abstract
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MinerU2.5, a 1.2B-parameter vision-language model for document parsing. It proposes a decoupled coarse-to-fine two-stage strategy: efficient global layout analysis on downsampled images followed by targeted content recognition on native-resolution crops extracted according to the detected layout. A data engine generates large-scale training corpora for pretraining and fine-tuning. The central claim is that this yields state-of-the-art recognition accuracy on multiple benchmarks while incurring significantly lower computational overhead than both general-purpose and domain-specific models.
Significance. If the empirical results and efficiency claims hold under rigorous evaluation, the decoupled architecture offers a practical route to high-resolution document parsing without full-image high-res processing throughout the network. The separation of layout and recognition stages could influence future work on resource-efficient VLMs for structured content such as tables and formulas. The data engine component also provides a reusable contribution for training such models.
major comments (2)
- [Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.
- [Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.
minor comments (2)
- [Method] Clarify the exact downsampling factor used in the first stage and whether it is fixed or adaptive; this detail affects reproducibility of the efficiency claims.
- [Figures] Add a figure or diagram explicitly showing the crop extraction pipeline, including how layout labels are mapped back to the original high-resolution image.
Simulated Author's Rebuttal
We sincerely thank the referee for their constructive and detailed feedback. The comments have helped us identify areas where additional validation and clarity would strengthen the manuscript. We address each major comment below and describe the revisions made.
read point-by-point responses
-
Referee: [Method (two-stage parsing strategy) and Experiments] The central claim rests on the premise that layout elements detected on downsampled images supply sufficiently precise bounding boxes and structure labels for correct native-resolution crop extraction. No layout-stage precision metrics (e.g., bounding-box IoU, element detection F1), error-propagation analysis, or ablation comparing downsampled versus full-resolution guidance appear in the method or experiments sections. This is load-bearing because any misalignment in dense text blocks, formulas, or table cells directly feeds incorrect regions to the second-stage recognizer and undermines the reported SOTA accuracy.
Authors: We thank the referee for highlighting the importance of validating the layout stage explicitly. While the end-to-end results support the overall effectiveness of the decoupled approach, we agree that intermediate metrics and analysis would increase transparency and address potential concerns about error propagation. In the revised manuscript, we have added a dedicated evaluation of the layout stage, reporting bounding-box IoU and element detection F1 scores on downsampled images. We have also included an error-propagation analysis quantifying the impact of layout inaccuracies on final recognition accuracy, as well as an ablation study comparing downsampled layout guidance against full-resolution alternatives. These additions confirm that the precision achieved is sufficient for the second stage while preserving the efficiency benefits. revision: yes
-
Referee: [Abstract and Experiments section] The abstract asserts SOTA performance and efficiency gains, yet the manuscript description supplies no quantitative benchmark scores, error bars, ablation tables, or direct comparisons against listed baselines. Without these, the strength of the efficiency-accuracy tradeoff cannot be assessed.
Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support to allow readers to fully evaluate the claimed tradeoffs. In the revised manuscript, we have updated the abstract to include specific benchmark accuracy scores and efficiency metrics (such as parameter count, FLOPs, and inference speed) along with direct comparisons to baselines. The experiments section has been expanded with detailed tables containing numerical results, error bars from multiple runs where applicable, comprehensive ablation studies, and side-by-side comparisons against the listed general-purpose and domain-specific models. These changes make the strength of the efficiency-accuracy tradeoff readily assessable. revision: yes
Circularity Check
No circularity: empirical architecture and benchmark results
full rationale
The paper describes a two-stage decoupled vision-language model trained on a data engine for document parsing. No equations, fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on empirical training and benchmark evaluation rather than any derivation that reduces to its own inputs by construction. The two-stage strategy (downsampled layout then native-resolution crops) is presented as an engineering choice whose validity is tested externally on benchmarks, not assumed or renamed from prior results within the paper itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- model parameter count
axioms (1)
- domain assumption Coarse layout analysis on downsampled images is sufficient to guide accurate high-resolution crop extraction and recognition.
Forward citations
Cited by 18 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
-
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
Visual-ERM: Reward Modeling for Visual Equivalence
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
-
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
-
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
-
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
VLMs improve high-resolution reasoning by framing it as sequential Bayesian optimal experimental design, using a coverage-resolution proxy and the FOVEA procedure to acquire task-relevant visual evidence, yielding gai...
-
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the tr...
-
InstructTable: Improving Table Structure Recognition Through Instructions
InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
Logics-Parsing-Omni Technical Report
Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
-
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
-
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
PaddleOCR-VL-1.5 is a 0.9B VLM achieving 94.5% SOTA accuracy on OmniDocBench v1.5, with added robustness to physical distortions and support for seal recognition plus text spotting.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, et al. Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding.arXiv preprint arXiv:2212.09621, 2022
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Nougat: Neural Optical Understanding for Academic Documents
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025
chatdoc com. Ocrflux.https://github.com/chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25
work page 2025
-
[6]
Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Vision grid transformer for document layout analysis
Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 19462–19472, 2023
work page 2023
-
[10]
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36: 2252–2274, 2023
work page 2023
-
[11]
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chun- hui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025
-
[12]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Layoutlmv3: Pre-training for document ai with unified text and image masking
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia, pages 4083–4091, 2022
work page 2022
-
[15]
Ocr-free document understanding transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Conference on Computer Vision, pages 498–517. Springer, 2022
work page 2022
-
[16]
Gon- zalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[17]
Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 26 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
-
[18]
Doctr: Document transformer for structured information extraction in documents
Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R Manmatha, and Vijay Mahadevan. Doctr: Document transformer for structured information extraction in documents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19584–19594, 2023
work page 2023
-
[19]
Demiao Lin. Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition.arXiv preprint arXiv:2401.12599, 2024
-
[20]
Hrvda: High-resolution visual document assistant
Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, and Linli Xu. Hrvda: High-resolution visual document assistant. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15534–15545, 2024
work page 2024
-
[21]
Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, and Gang Pan. Pp-formulanet: Bridging accuracy and efficiency in advanced formula recognition.arXiv preprint arXiv:2503.18382, 2025
-
[22]
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025
-
[23]
arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024
-
[24]
Docling: An efficient open- source toolkit for ai-driven document conversion,
Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025
-
[25]
Optimized table tokenization for table structure recognition
Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokenization for table structure recognition. InInternational Conference on Document Analysis and Recognition, pages 37–50. Springer, 2023
work page 2023
-
[26]
Souvik Mandalm. Nanonets-ocr-s. https://nanonets.com/research/nanonets-ocr-s/, 2025. Accessed:2025- 09-25
work page 2025
-
[27]
Mathpix.https://mathpix.com/, 2025
Mathpix. Mathpix.https://mathpix.com/, 2025. Accessed:2025-09-25
work page 2025
-
[28]
Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A Said Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576, 2025
-
[29]
Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, et al. Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025
-
[30]
OpenDataLab. Pdf-extract-kit. https://github.com/opendatalab/PDF-Extract-Kit, 2025. Accessed:2025- 09-25
work page 2025
-
[31]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
work page 2025
-
[32]
Marker.https://github.com/datalab-to/marker, 2025
Vik Paruchuri. Marker.https://github.com/datalab-to/marker, 2025. Accessed:2025-09-25
work page 2025
-
[33]
Surya: A lightweight document ocr and analysis toolkit
Vikas Paruchuri and Datalab Team. Surya: A lightweight document ocr and analysis toolkit. https: //github.com/VikParuchuri/surya, 2025. Accessed:2025-09-25
work page 2025
-
[34]
Doclaynet: A large human-annotated dataset for document-layout segmentation
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3743–3751, 2022
work page 2022
-
[35]
Available: https://arxiv.org/abs/2502.18443
Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 27 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
-
[36]
Rapid table.https://github.com/RapidAI/RapidTable, 2024
RapidAI. Rapid table.https://github.com/RapidAI/RapidTable, 2024. Accessed: 2025-9-25
work page 2024
-
[37]
dots.ocr: Multilingual document layout parsing in a single vision-language model
rednote. dots.ocr: Multilingual document layout parsing in a single vision-language model. https://github. com/rednote-hilab/dots.ocr, 2025. Accessed:2025-09-25
work page 2025
-
[38]
Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolu- tional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016
work page 2016
-
[39]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[40]
Unifying vision, text, and layout for universal document processing
Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19254–19264, 2023
work page 2023
-
[41]
Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm source=ai-bot.cn, 2025. Accessed:2025-09-25
work page 2025
-
[42]
Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Omniparser: A unified framework for text spotting key information extraction and table recognition
Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting key information extraction and table recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15641–15653, 2024
work page 2024
-
[44]
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. Yolov10: Real-time end-to-end object detection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024
work page 2024
-
[45]
Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024
-
[46]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Image over text: Transforming formula recognition evaluation with character detection matching
Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19681–19690, 2025
work page 2025
-
[48]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection.arXiv preprint arXiv:2108.11591, 2021
-
[51]
Vrdu: A benchmark for visually-rich document understanding
Zilong Wang, Yichao Zhou, Wei Wei, Chen-Yu Lee, and Sandeep Tata. Vrdu: A benchmark for visually-rich document understanding. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5184–5193, 2023
work page 2023
-
[52]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024
work page internal anchor Pith review arXiv 2024
-
[53]
Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy.arXiv preprint arXiv:2412.02210, 2024. 28 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Doc...
-
[54]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation.arXiv preprint arXiv:2412.02592, 2024
-
[57]
Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.arXiv preprint arXiv:2410.21169, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024
-
[60]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024
work page 2024
-
[61]
Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 697–706, 2021
work page 2021
-
[62]
Image-based table recognition: data, model, and evaluation
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020
work page 2020
-
[63]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 29 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Appe...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.