PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Pith reviewed 2026-05-16 09:36 UTC · model grok-4.3
The pith
PaddleOCR-VL-1.5 is a 0.9 billion parameter model that achieves 94.5 percent accuracy on OmniDocBench v1.5 for in-the-wild document parsing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaddleOCR-VL-1.5 achieves a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5. The model is a multi-task 0.9B VLM that incorporates seal recognition and text spotting capabilities. To evaluate robustness, the Real5-OmniDocBench benchmark is introduced, covering distortions including scanning, skew, warping, screen-photography, and illumination. The enhanced model attains SOTA performance on this benchmark as well.
What carries the argument
The 0.9B parameter multi-task vision-language model extended with seal recognition and text spotting tasks, benchmarked on OmniDocBench v1.5 and the proposed Real5-OmniDocBench for physical distortion robustness.
If this is right
- The model maintains high accuracy on document parsing even with real-world physical distortions.
- Incorporating seal recognition and text spotting broadens the applications without exceeding 0.9B parameters.
- The compact size supports efficient deployment in various environments.
- Strong results on the new benchmark indicate improved handling of practical document capture scenarios.
Where Pith is reading between the lines
- Such compact multi-task models could enable real-time document processing on mobile devices.
- Future work might test the model on a wider variety of document languages and formats to confirm broad applicability.
- The benchmark creation process could be replicated for other vision tasks to standardize robustness evaluation.
- Integration with existing OCR pipelines might accelerate adoption in industry settings.
Load-bearing premise
The Real5-OmniDocBench benchmark sufficiently represents the diversity of real-world physical distortions in document images and that benchmark performance predicts real deployment success.
What would settle it
Demonstrating that PaddleOCR-VL-1.5 underperforms competing models on a collection of actual user-submitted document photos with distortions not represented in Real5-OmniDocBench would challenge the claim.
read the original abstract
We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaddleOCR-VL-1.5, a 0.9B-parameter vision-language model for multi-task document parsing that incorporates OCR, seal recognition, and text spotting. It claims a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5 and introduces the Real5-OmniDocBench benchmark to demonstrate robustness against five categories of real-world physical distortions (scanning, skew, warping, screen-photography, and illumination), reporting SOTA results on this new benchmark while maintaining high efficiency.
Significance. If the performance claims hold under detailed scrutiny, the work would demonstrate that ultra-compact VLMs can deliver robust in-the-wild document understanding, which is valuable for deployment on resource-constrained devices. The open release of code supports reproducibility. However, the absence of training data details, evaluation protocols, and independent validation currently limits the ability to assess whether the results represent a genuine advance or are tied to benchmark-specific choices.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.
- [§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.
minor comments (2)
- [Abstract] The abstract mentions extension to seal recognition and text spotting but does not specify how these tasks are integrated into the multi-task training objective or evaluated separately.
- [§2] Model size is repeatedly stated as 0.9B, but no breakdown of parameter allocation across vision encoder, language model, or task heads is provided to support the efficiency claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the verifiability and robustness claims in our manuscript. We will revise the paper to address these points by adding the requested details on training, evaluation, and benchmark construction.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.
Authors: We agree that the manuscript currently provides insufficient detail to fully verify the 94.5% SOTA claim. In the revised version, we will expand §4 with: (1) a complete description of the training dataset composition and sources, (2) explicit listing of all baseline models and their configurations, (3) the full evaluation protocol including metrics, data splits, and preprocessing steps, and (4) statistical significance testing (e.g., bootstrap confidence intervals and paired comparisons) to substantiate the performance gains. These additions will be placed in both the main text and supplementary material. revision: yes
-
Referee: [§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.
Authors: We acknowledge the concern regarding potential bias in a self-curated benchmark. In the revision, we will add to §3.2 and §4.3: detailed construction methodology for each of the five distortion categories, explicit checks for train-test contamination with our training data, and any internal cross-validation performed. We will also include a limitations discussion and examples from additional held-out real-world images. Full external independent validation is beyond the scope of a single paper but will be facilitated by open-sourcing the benchmark. revision: partial
- Independent external cross-validation of Real5-OmniDocBench by third-party researchers, which cannot be provided by the authors at this stage.
Circularity Check
No circularity: SOTA claims rest on standard benchmark evaluation without self-referential reduction.
full rationale
The paper reports empirical accuracy (94.5% on OmniDocBench v1.5) and robustness on a newly proposed Real5-OmniDocBench benchmark. No equations, derivations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Model training and multi-task extension follow conventional supervised learning; benchmark scores are externally falsifiable and do not loop back to the paper's own inputs. This is the normal non-circular case for an empirical VLM paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- model size =
0.9B
axioms (1)
- domain assumption Standard OCR and document parsing benchmarks measure meaningful real-world capability
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce PP-DocLayoutV3... mask-based detection head... Global Pointer Mechanism... Voting-based Ranking strategy
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Distortion-Aware Data Augmentation pipeline... simulates complex physical deformations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
-
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
-
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
-
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2506.05218 , year=
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure- recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 15
-
[2]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025
-
[4]
arXiv preprint arXiv:2509.01215 , year=
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. arXiv preprint arXiv:2509.01215, 2025
- [5]
-
[6]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Retrieval- augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[9]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilin- gual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025
-
[10]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns. arXiv preprint arXiv:2511.10390, 2025
-
[12]
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025
-
[13]
Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025
-
[14]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
work page 2025
-
[15]
Google DeepMind. Gemini 3.0. https://blog.google/products-and-platforms/p roducts/gemini/gemini-3-collection/, 2025. 16
work page 2025
-
[16]
Detrs beat yolos on real-time object detection
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024
work page 2024
-
[17]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023
work page 2023
-
[18]
PaddlePaddle Authors. Paddleformers. https://github.com/PaddlePaddle/Padd leFormers, 2025
work page 2025
-
[19]
Paddlepaddle: An open-source deep learning platform from industrial practice
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1(1):105–115, 2019
work page 2019
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[22]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Image over text: Transforming formula recognition evaluation with character detection matching
Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, June 2025
work page 2025
-
[24]
Vik Paruchuri. Marker. https://github.com/datalab-to/marker , 2025. Accessed: 2025-09-25
work page 2025
-
[25]
opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025
work page 2025
-
[26]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source mul- timodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
OpenAI. Gpt-5.2 system card, 2025. URL https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 17
work page 2025
-
[29]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Google DeepMind. Gemini 2.5. https://blog.google/technology/google-deepm ind/gemini-model-thinking-updates-march-2025/, 2025
work page 2025
-
[31]
chatdoc com. Ocrflux. https://github.com/chatdoc- com/OCRFlux , 2025. Accessed:2025-09-25
work page 2025
-
[32]
Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm_sourc e=ai-bot.cn, 2025
work page 2025
-
[33]
Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Ran- gapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025
-
[34]
Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025
work page 2025
-
[35]
dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025
rednote-hilab. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025
work page 2025
-
[36]
Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205
-
[37]
arXiv preprint arXiv:2510.12798 (2025)
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. arXiv preprint arXiv:2510.12798, 2025
-
[38]
PaddlePaddle Authors. Fastdeploy. https://github.com/PaddlePaddle/FastDepl oy, 2025
work page 2025
-
[39]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[40]
Sglang: Effi- cient execution of structured language model programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Effi- cient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024. 18 Appendix A. Comparison of PaddleOCR-VL-1.5 and 1.0 Mode...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.