Recognition: no theorem link
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3
The pith
MinerU2.5-Pro shows that data engineering alone can push a 1.2B document parser past all larger models to 95.69 on OmniDocBench v1.6.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art document parsing models of many architectures and sizes show highly consistent failure patterns on the same hard samples, which indicates that the bottleneck is shared deficiencies in training data rather than architectural differences. MinerU2.5-Pro keeps its 1.2B-parameter architecture unchanged and advances performance purely through data engineering: Diversity-and-Difficulty-Aware Sampling scales the data from under 10M to 65.5M samples while reducing shift; Cross-Model Consistency Verification generates reliable annotations from model consensus; and the Judge-and-Refine pipeline corrects hard-sample labels via render-then-verify iteration. Combined with three-stage pre-
What carries the argument
The Data Engine, built around Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, and the Judge-and-Refine pipeline, together with a three-stage progressive training strategy of large-scale pre-training, hard-sample fine-tuning, and GRPO alignment.
If this is right
- Data engineering and staged training can deliver larger gains than increasing model size in document parsing.
- A revised OmniDocBench v1.6 with corrected element-matching biases and a dedicated Hard subset gives a more reliable evaluation of progress on difficult cases.
- The same data engine and progressive training approach can be applied to other models without changing their architectures.
- Consistent failure patterns across models imply that data improvements transfer across different architectures and scales.
- Performance above 95 on the benchmark becomes achievable without models exceeding 1.2B parameters.
Where Pith is reading between the lines
- If data deficiencies explain most failures across models, then shared public datasets of verified hard samples could accelerate progress for the whole field.
- The method suggests that computational budgets might shift from training ever-larger models toward curation and verification of training data.
- Similar cross-model verification and refinement steps could improve training data quality in other vision-language tasks that exhibit overlapping error patterns.
- Future benchmarks may need to prioritize hard-sample coverage and annotation accuracy to drive further data-centric advances.
Load-bearing premise
The Cross-Model Consistency Verification and Judge-and-Refine pipeline produce unbiased, high-accuracy annotations for hard samples without introducing systematic selection biases or errors that affect the reported benchmark gains.
What would settle it
Re-training the original baseline model with the new 65.5M-sample dataset and three-stage strategy produces no 2.71-point gain on OmniDocBench v1.6, or the set of consistently hard samples changes after the new training.
Figures
read the original abstract
Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift; Cross-Model Consistency Verification leverages output consensus among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy--large-scale pre-training, hard sample fine-tuning, and GRPO alignment--sequentially exploits these data at different quality tiers. On the evaluation front, we rectify element-matching biases in OmniDocBench v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench v1.6 protocol. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods, including those based on models with over 200x more parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that document parsing performance is limited by training data deficiencies rather than architecture, as evidenced by consistent failure patterns across diverse models. It introduces MinerU2.5-Pro, which retains the 1.2B-parameter MinerU2.5 architecture but uses a co-designed Data Engine (Diversity-and-Difficulty-Aware Sampling to reach 65.5M samples, Cross-Model Consistency Verification for difficulty and annotations, and Judge-and-Refine for hard-sample correction) plus a three-stage training strategy (pre-training, hard-sample fine-tuning, GRPO alignment) to achieve 95.69 on the authors' rectified OmniDocBench v1.6, a 2.71-point gain over the identical-architecture baseline and surpassing models with >200x parameters.
Significance. If the gains are attributable solely to the data strategies, this is a notable demonstration that systematic data engineering can outperform architectural scaling in document parsing, where models share failure modes on hard samples. The observation of cross-model consistency on difficult examples and the scale of the 65.5M-sample dataset with progressive training tiers are concrete strengths that could shift research emphasis toward data curation pipelines.
major comments (3)
- [OmniDocBench v1.6 protocol] OmniDocBench v1.6 protocol (abstract and evaluation description): the rectification of element-matching biases from v1.5 and creation of the Hard subset must be shown to be independent of the Cross-Model Consistency Verification pipeline, because both the training annotations for hard samples and the benchmark changes rely on model consensus; any shared error patterns would directly inflate the reported 2.71-point gain on the Hard subset.
- [Data Engine] Data Engine description (abstract): no ablation results isolate the contribution of Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, or Judge-and-Refine to the 2.71-point improvement; without these controls it remains unclear whether the gains stem from the claimed data strategies or from unstated factors such as training schedule changes.
- [Three-stage training strategy] Three-stage training strategy (abstract): the paper provides no error analysis, human audit, or cross-validation of the Judge-and-Refine annotations against an independent source, leaving open the possibility that correlated model biases on hard samples are reinforced in both the 65.5M training set and the test protocol.
minor comments (2)
- [Data Engine] The abstract states expansion 'from under 10M to 65.5M samples' but does not specify the exact sources, diversity metrics, or difficulty thresholds used in sampling; these implementation details should be added for reproducibility.
- [Training strategy] GRPO alignment hyperparameters are listed as free parameters in the approach but no values or selection procedure are provided; this should be clarified in the training section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing that further clarifications and analyses will improve the manuscript. We will incorporate the suggested revisions in the next version.
read point-by-point responses
-
Referee: [OmniDocBench v1.6 protocol] OmniDocBench v1.6 protocol (abstract and evaluation description): the rectification of element-matching biases from v1.5 and creation of the Hard subset must be shown to be independent of the Cross-Model Consistency Verification pipeline, because both the training annotations for hard samples and the benchmark changes rely on model consensus; any shared error patterns would directly inflate the reported 2.71-point gain on the Hard subset.
Authors: We acknowledge the need to explicitly demonstrate independence to avoid any perception of circularity. The OmniDocBench v1.6 rectification was based on statistical analysis of element-matching discrepancies in v1.5 outputs combined with manual review of sampled cases, using a distinct collection of models and evaluation scripts separate from the Cross-Model Consistency Verification models employed for training data curation. The Hard subset threshold was derived from aggregate failure rates across a wide range of publicly available models not involved in our Data Engine. In the revised manuscript we will add a dedicated subsection in the evaluation protocol description that details the exact models, scripts, and manual audit procedures used for benchmark updates, along with a comparison showing that performance gains remain consistent when evaluated against the original v1.5 protocol on the same test samples. revision: yes
-
Referee: [Data Engine] Data Engine description (abstract): no ablation results isolate the contribution of Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, or Judge-and-Refine to the 2.71-point improvement; without these controls it remains unclear whether the gains stem from the claimed data strategies or from unstated factors such as training schedule changes.
Authors: We agree that component-wise ablations are necessary to isolate the contributions of each Data Engine module. Although the current manuscript emphasizes end-to-end results, we will add a new ablation study subsection. This will report three controlled experiments that reuse the identical three-stage training schedule and hyper-parameters: (i) baseline with only Diversity-and-Difficulty-Aware Sampling, (ii) addition of Cross-Model Consistency Verification, and (iii) full pipeline including Judge-and-Refine. Incremental accuracy deltas on OmniDocBench v1.6 will be presented to quantify the marginal benefit of each strategy while holding training schedule constant. revision: yes
-
Referee: [Three-stage training strategy] Three-stage training strategy (abstract): the paper provides no error analysis, human audit, or cross-validation of the Judge-and-Refine annotations against an independent source, leaving open the possibility that correlated model biases on hard samples are reinforced in both the 65.5M training set and the test protocol.
Authors: This concern about potential bias reinforcement is well-taken. We will expand the training strategy section with a human audit of 2,000 randomly sampled Judge-and-Refine outputs, reporting inter-annotator agreement (Cohen's kappa) and error-type breakdown. We will also add cross-validation results comparing the refined annotations against an independent held-out model ensemble and a small manually annotated reference set. The revised text will include quantitative error analysis demonstrating the reduction in annotation inconsistencies for hard samples and will discuss the use of heterogeneous model ensembles within Cross-Model Consistency Verification as a safeguard against correlated biases. revision: yes
Circularity Check
No significant circularity in the derivation chain.
full rationale
The paper attributes its 2.71-point gain on OmniDocBench v1.6 solely to data curation (Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, Judge-and-Refine) and a three-stage training schedule applied to an unchanged 1.2B architecture. These steps are described as independent engineering choices whose outputs are evaluated against an explicitly rectified external benchmark protocol; no equations, fitted parameters, or self-definitions reduce the reported scores to the inputs by construction. Benchmark rectification and hard-sample selection are presented as separate processes from the training pipeline, with no load-bearing self-citation chain or ansatz that forces the result. The comparison to the same-architecture baseline remains falsifiable on the shared protocol.
Axiom & Free-Parameter Ledger
free parameters (2)
- Diversity-and-difficulty sampling thresholds and weights
- GRPO alignment hyperparameters
axioms (2)
- domain assumption Output consensus among heterogeneous models reliably indicates sample difficulty and produces accurate annotations
- domain assumption Render-then-verify iterative correction improves annotation quality for hard samples without introducing new systematic errors
Forward citations
Cited by 2 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2308.13418 , year=
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023
-
[3]
arXiv preprint arXiv:2501.15558 , year=
Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025
-
[4]
Logics-parsing technical report.arXiv preprint arXiv:2509.19760, 2025
Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025
-
[7]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025
work page internal anchor Pith review arXiv 2025
-
[8]
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi-task 0.9 b vlm for robust in-the-wild document parsing.arXiv preprint arXiv:2601.21957, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026
-
[10]
Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters.arXiv preprint arXiv:2512.21095, 2025
-
[11]
Glm-ocr technical report.arXiv preprint arXiv:2603.10910,
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026
-
[12]
Dolphin: Document image parsing via heterogeneous anchor prompting
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Document image parsing via heterogeneous anchor prompting. InProceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[13]
Selective sampling using the query by committee algorithm.Machine learning, 28(2):133–168, 1997
Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm.Machine learning, 28(2):133–168, 1997
1997
-
[14]
Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023
2023
-
[15]
Rag-anything: All-in-one rag framework
Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and Chao Huang. Rag-anything: All-in-one rag framework. arXiv preprint arXiv:2510.12323, 2025
-
[16]
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without ocr.arXiv preprint arXiv:2111.15664, 7(15):2, 2021
-
[17]
Binary codes capable of correcting deletions, insertions, and reversals
Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. InSoviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966
1966
- [18]
-
[19]
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025
-
[20]
Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
2024
-
[21]
Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025
-
[22]
Ovis: Structural embedding alignment for multimodal large language model, 2024
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405. 20797
2024
-
[23]
Ovis2.5 technical report, 2025
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...
-
[24]
Data-centric ai competition.DeepLearning AI
Andrew Ng, Dillon Laird, and Lynn He. Data-centric ai competition.DeepLearning AI. Available online: https://https-deeplearning-ai. github. io/data-centric-comp/(accessed on 9 December 2021), 2021
2021
-
[25]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...
-
[26]
Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, et al. Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025
-
[27]
Omnidocbench: Benchmarking diverse pdf document parsing with com- prehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with com- prehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025
2025
-
[28]
Marker.https://github.com/datalab-to/marker, 2025
Vik Paruchuri. Marker.https://github.com/datalab-to/marker, 2025. Accessed:2025-09-25
2025
-
[29]
Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025
-
[30]
Query by committee
H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. InProceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992
1992
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report.arXiv preprint arXiv:2511.19575, 2025. 20 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
-
[33]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
2026
-
[34]
Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024
-
[35]
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
-
[36]
Image over text: Transforming formula recognition evaluation with character detection matching
Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19681–19690, 2025
2025
-
[37]
Shu Wang, Yingli Zhou, and Yixiang Fang. Bookrag: A hierarchical structure-aware index-based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025
-
[38]
Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
-
[39]
URLhttps://arxiv.org/abs/2508.18265
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024
-
[41]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
Firered-ocr technical report, 2026
Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, and Changhao Qiao. Firered-ocr technical report, 2026. URL https://arxiv.org/abs/2603.01840
-
[43]
arXiv preprint arXiv:2406.11633 , year=
Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, et al. Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy
Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025
2025
-
[46]
Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, 21 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and reco...
-
[47]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Data-centric ai: Perspectives and challenges
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. Data-centric ai: Perspectives and challenges. InProceedings of the 2023 SIAM international conference on data mining (SDM), pages 945–948. SIAM, 2023
2023
-
[49]
Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17443–17453, 2025
2025
-
[50]
Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.arXiv preprint arXiv:2410.21169, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024
-
[52]
Image-based table recognition: data, model, and evaluation
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020
2020
-
[53]
Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URLhttps://arxiv.org/abs/2601.21639
-
[54]
春天里的强军号角
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023. 22 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Appendix A Prompt Design and Task Examples Thi...
2023
-
[55]
Each detected in-table image is replaced with a special placeholder token in the table crop, effectively masking the image region
Detection.Layout Detection identifies image regions that fall spatially within a table bounding box. Each detected in-table image is replaced with a special placeholder token in the table crop, effectively masking the image region
-
[56]
Recognition.The masked table image is fed to Table Recognition, which generates the OTSL sequence with placeholder tokens marking the positions of masked images
-
[57]
Restoration.In the final output, placeholder tokens are resolved back to references to the original image regions, producing HTML table cells that contain <img> tags with unique identifiers linking to the extracted image content blocks. This approach allows the table structure and textual content to be recognized without interference from embedded images,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.