PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Changda Zhou; Cheng Cui; Dianhai Yu; Hongen Liu; Jiaxuan Liu; Manhui Lin; Suyin Liang; Tingquan Gao; Ting Sun; Xueqing Wang

arxiv: 2601.21957 · v2 · submitted 2026-01-29 · 💻 cs.CV

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui , Ting Sun , Suyin Liang , Tingquan Gao , Zelun Zhang , Jiaxuan Liu , Xueqing Wang , Changda Zhou

show 7 more authors

Hongen Liu Manhui Lin Yue Zhang Yubo Zhang Yi Liu Dianhai Yu Yanjun Ma

This is my paper

Pith reviewed 2026-05-16 09:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords document parsingvision-language modelOCRmulti-taskbenchmarkrobustnesscompact VLMseal recognition

0 comments

The pith

PaddleOCR-VL-1.5 is a 0.9 billion parameter model that achieves 94.5 percent accuracy on OmniDocBench v1.5 for in-the-wild document parsing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PaddleOCR-VL-1.5, an upgraded 0.9B vision-language model designed for robust document parsing in real-world conditions. It reports a new state-of-the-art accuracy of 94.5 percent on the OmniDocBench v1.5 benchmark. The authors also propose Real5-OmniDocBench to test performance against physical distortions like skew, warping, and illumination changes from scanning and photography. The model extends to handle seal recognition and text spotting tasks while staying compact and efficient. A sympathetic reader would care because accurate document parsing from imperfect photos and scans is essential for digitization, archiving, and automation in many industries.

Core claim

PaddleOCR-VL-1.5 achieves a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5. The model is a multi-task 0.9B VLM that incorporates seal recognition and text spotting capabilities. To evaluate robustness, the Real5-OmniDocBench benchmark is introduced, covering distortions including scanning, skew, warping, screen-photography, and illumination. The enhanced model attains SOTA performance on this benchmark as well.

What carries the argument

The 0.9B parameter multi-task vision-language model extended with seal recognition and text spotting tasks, benchmarked on OmniDocBench v1.5 and the proposed Real5-OmniDocBench for physical distortion robustness.

If this is right

The model maintains high accuracy on document parsing even with real-world physical distortions.
Incorporating seal recognition and text spotting broadens the applications without exceeding 0.9B parameters.
The compact size supports efficient deployment in various environments.
Strong results on the new benchmark indicate improved handling of practical document capture scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such compact multi-task models could enable real-time document processing on mobile devices.
Future work might test the model on a wider variety of document languages and formats to confirm broad applicability.
The benchmark creation process could be replicated for other vision tasks to standardize robustness evaluation.
Integration with existing OCR pipelines might accelerate adoption in industry settings.

Load-bearing premise

The Real5-OmniDocBench benchmark sufficiently represents the diversity of real-world physical distortions in document images and that benchmark performance predicts real deployment success.

What would settle it

Demonstrating that PaddleOCR-VL-1.5 underperforms competing models on a collection of actual user-submitted document photos with distortions not represented in Real5-OmniDocBench would challenge the claim.

read the original abstract

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A compact 0.9B document VLM with a new robustness benchmark, but claims rest on author-curated test data that needs close scrutiny.

read the letter

The main point is PaddleOCR-VL-1.5, a 0.9B VLM that reports 94.5% SOTA on OmniDocBench v1.5 along with a new Real5-OmniDocBench for testing robustness to physical distortions and extensions to seal recognition and text spotting. The paper does a solid job extending the model family to handle more document-related tasks while staying ultra-compact. Introducing a benchmark focused on scanning, skew, warping, screen-photography, and illumination fills a practical need for evaluating in-the-wild performance. For applications in document digitization, this combination of small size and robustness testing could be directly useful if the results hold up. The soft spots are in the experimental details and benchmark independence. The authors created the benchmark themselves, which raises the usual questions about whether the distortion categories or document selection inadvertently favor their approach. The provided abstract gives no information on training data, evaluation protocols, or analysis of failures, making it tough to confirm the claims without the full paper. The stress-test concern about potential bias or leakage in the benchmark is reasonable and needs addressing. This work is aimed at practitioners in computer vision and document processing who prioritize efficient models for real-world conditions. Readers interested in applied robustness or compact VLMs would find the benchmark and task extensions worth looking at. It deserves a serious referee to verify the data handling and benchmark validity. I would recommend sending it for peer review, with attention to the benchmark construction and any missing experimental specifics.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaddleOCR-VL-1.5, a 0.9B-parameter vision-language model for multi-task document parsing that incorporates OCR, seal recognition, and text spotting. It claims a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5 and introduces the Real5-OmniDocBench benchmark to demonstrate robustness against five categories of real-world physical distortions (scanning, skew, warping, screen-photography, and illumination), reporting SOTA results on this new benchmark while maintaining high efficiency.

Significance. If the performance claims hold under detailed scrutiny, the work would demonstrate that ultra-compact VLMs can deliver robust in-the-wild document understanding, which is valuable for deployment on resource-constrained devices. The open release of code supports reproducibility. However, the absence of training data details, evaluation protocols, and independent validation currently limits the ability to assess whether the results represent a genuine advance or are tied to benchmark-specific choices.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.
[§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.

minor comments (2)

[Abstract] The abstract mentions extension to seal recognition and text spotting but does not specify how these tasks are integrated into the multi-task training objective or evaluated separately.
[§2] Model size is repeatedly stated as 0.9B, but no breakdown of parameter allocation across vision encoder, language model, or task heads is provided to support the efficiency claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the verifiability and robustness claims in our manuscript. We will revise the paper to address these points by adding the requested details on training, evaluation, and benchmark construction.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.

Authors: We agree that the manuscript currently provides insufficient detail to fully verify the 94.5% SOTA claim. In the revised version, we will expand §4 with: (1) a complete description of the training dataset composition and sources, (2) explicit listing of all baseline models and their configurations, (3) the full evaluation protocol including metrics, data splits, and preprocessing steps, and (4) statistical significance testing (e.g., bootstrap confidence intervals and paired comparisons) to substantiate the performance gains. These additions will be placed in both the main text and supplementary material. revision: yes
Referee: [§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.

Authors: We acknowledge the concern regarding potential bias in a self-curated benchmark. In the revision, we will add to §3.2 and §4.3: detailed construction methodology for each of the five distortion categories, explicit checks for train-test contamination with our training data, and any internal cross-validation performed. We will also include a limitations discussion and examples from additional held-out real-world images. Full external independent validation is beyond the scope of a single paper but will be facilitated by open-sourcing the benchmark. revision: partial

standing simulated objections not resolved

Independent external cross-validation of Real5-OmniDocBench by third-party researchers, which cannot be provided by the authors at this stage.

Circularity Check

0 steps flagged

No circularity: SOTA claims rest on standard benchmark evaluation without self-referential reduction.

full rationale

The paper reports empirical accuracy (94.5% on OmniDocBench v1.5) and robustness on a newly proposed Real5-OmniDocBench benchmark. No equations, derivations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Model training and multi-task extension follow conventional supervised learning; benchmark scores are externally falsifiable and do not loop back to the paper's own inputs. This is the normal non-circular case for an empirical VLM paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about benchmark validity and the transferability of test-set accuracy to real deployments; no new entities are postulated.

free parameters (1)

model size = 0.9B
The 0.9B parameter count is a deliberate design choice for efficiency.

axioms (1)

domain assumption Standard OCR and document parsing benchmarks measure meaningful real-world capability
The SOTA claims depend on OmniDocBench v1.5 and the new Real5-OmniDocBench being representative.

pith-pipeline@v0.9.0 · 5476 in / 1386 out tokens · 47768 ms · 2026-05-16T09:36:42.769153+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PP-DocLayoutV3... mask-based detection head... Global Pointer Mechanism... Voting-based Ranking strategy
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Distortion-Aware Data Augmentation pipeline... simulates complex physical deformations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
cs.CV 2026-05 conditional novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
cs.AI 2026-05 unverdicted novelty 7.0

MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
cs.CL 2026-04 unverdicted novelty 7.0

GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing
cs.CV 2026-04 unverdicted novelty 6.0

A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
cs.CV 2026-05 unverdicted novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 7 Pith papers · 9 internal anchors

[1]

arXiv preprint arXiv:2506.05218 , year=

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure- recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 15

work page arXiv 2025
[2]

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186, 2025

work page internal anchor Pith review arXiv 2025
[3]

Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

work page arXiv 2025
[4]

arXiv preprint arXiv:2509.01215 , year=

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. arXiv preprint arXiv:2509.01215, 2025

work page arXiv 2025
[5]

Ernie 4.5 technical report, 2025

Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025

work page 2025
[6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[9]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilin- gual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025
[10]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Monkeyocr v1

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns. arXiv preprint arXiv:2511.10390, 2025

work page arXiv 2025
[12]

Hunyuanocr technical report

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025

work page arXiv 2025
[13]

Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

work page arXiv 2025
[14]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

work page 2025
[15]

Gemini 3.0

Google DeepMind. Gemini 3.0. https://blog.google/products-and-platforms/p roducts/gemini/gemini-3-collection/, 2025. 16

work page 2025
[16]

Detrs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024

work page 2024
[17]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

work page 2023
[18]

Paddleformers

PaddlePaddle Authors. Paddleformers. https://github.com/PaddlePaddle/Padd leFormers, 2025

work page 2025
[19]

Paddlepaddle: An open-source deep learning platform from industrial practice

Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1(1):105–115, 2019

work page 2019
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[22]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Image over text: Transforming formula recognition evaluation with character detection matching

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, June 2025

work page 2025
[24]

Vik Paruchuri. Marker. https://github.com/datalab-to/marker , 2025. Accessed: 2025-09-25

work page 2025
[25]

Mineru2.0-2505-0.9b

opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025

work page 2025
[26]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source mul- timodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Gpt-5.2 system card, 2025

OpenAI. Gpt-5.2 system card, 2025. URL https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 17

work page 2025
[29]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Gemini 2.5

Google DeepMind. Gemini 2.5. https://blog.google/technology/google-deepm ind/gemini-model-thinking-updates-march-2025/, 2025

work page 2025
[31]

chatdoc com. Ocrflux. https://github.com/chatdoc- com/OCRFlux , 2025. Accessed:2025-09-25

work page 2025
[32]

Mistral-ocr

Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm_sourc e=ai-bot.cn, 2025

work page 2025
[33]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Ran- gapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

work page arXiv 2025
[34]

Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

work page 2025
[35]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

rednote-hilab. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

work page 2025
[36]

Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026

Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205

work page arXiv 2026
[37]

arXiv preprint arXiv:2510.12798 (2025)

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. arXiv preprint arXiv:2510.12798, 2025

work page arXiv 2025
[38]

Fastdeploy

PaddlePaddle Authors. Fastdeploy. https://github.com/PaddlePaddle/FastDepl oy, 2025

work page 2025
[39]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[40]

Sglang: Effi- cient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Effi- cient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024. 18 Appendix A. Comparison of PaddleOCR-VL-1.5 and 1.0 Mode...

work page 2024

[1] [1]

arXiv preprint arXiv:2506.05218 , year=

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure- recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 15

work page arXiv 2025

[2] [2]

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186, 2025

work page internal anchor Pith review arXiv 2025

[3] [3]

Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

work page arXiv 2025

[4] [4]

arXiv preprint arXiv:2509.01215 , year=

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. arXiv preprint arXiv:2509.01215, 2025

work page arXiv 2025

[5] [5]

Ernie 4.5 technical report, 2025

Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025

work page 2025

[6] [6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[9] [9]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilin- gual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025

[10] [10]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Monkeyocr v1

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns. arXiv preprint arXiv:2511.10390, 2025

work page arXiv 2025

[12] [12]

Hunyuanocr technical report

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025

work page arXiv 2025

[13] [13]

Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

work page arXiv 2025

[14] [14]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

work page 2025

[15] [15]

Gemini 3.0

Google DeepMind. Gemini 3.0. https://blog.google/products-and-platforms/p roducts/gemini/gemini-3-collection/, 2025. 16

work page 2025

[16] [16]

Detrs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024

work page 2024

[17] [17]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

work page 2023

[18] [18]

Paddleformers

PaddlePaddle Authors. Paddleformers. https://github.com/PaddlePaddle/Padd leFormers, 2025

work page 2025

[19] [19]

Paddlepaddle: An open-source deep learning platform from industrial practice

Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1(1):105–115, 2019

work page 2019

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[22] [22]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Image over text: Transforming formula recognition evaluation with character detection matching

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, June 2025

work page 2025

[24] [24]

Vik Paruchuri. Marker. https://github.com/datalab-to/marker , 2025. Accessed: 2025-09-25

work page 2025

[25] [25]

Mineru2.0-2505-0.9b

opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025

work page 2025

[26] [26]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source mul- timodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Gpt-5.2 system card, 2025

OpenAI. Gpt-5.2 system card, 2025. URL https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 17

work page 2025

[29] [29]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Gemini 2.5

Google DeepMind. Gemini 2.5. https://blog.google/technology/google-deepm ind/gemini-model-thinking-updates-march-2025/, 2025

work page 2025

[31] [31]

chatdoc com. Ocrflux. https://github.com/chatdoc- com/OCRFlux , 2025. Accessed:2025-09-25

work page 2025

[32] [32]

Mistral-ocr

Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm_sourc e=ai-bot.cn, 2025

work page 2025

[33] [33]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Ran- gapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

work page arXiv 2025

[34] [34]

Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

work page 2025

[35] [35]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

rednote-hilab. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

work page 2025

[36] [36]

Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026

Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205

work page arXiv 2026

[37] [37]

arXiv preprint arXiv:2510.12798 (2025)

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. arXiv preprint arXiv:2510.12798, 2025

work page arXiv 2025

[38] [38]

Fastdeploy

PaddlePaddle Authors. Fastdeploy. https://github.com/PaddlePaddle/FastDepl oy, 2025

work page 2025

[39] [39]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[40] [40]

Sglang: Effi- cient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Effi- cient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024. 18 Appendix A. Comparison of PaddleOCR-VL-1.5 and 1.0 Mode...

work page 2024