pith. sign in

arxiv: 2601.21957 · v2 · submitted 2026-01-29 · 💻 cs.CV

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Pith reviewed 2026-05-16 09:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords document parsingvision-language modelOCRmulti-taskbenchmarkrobustnesscompact VLMseal recognition
0
0 comments X

The pith

PaddleOCR-VL-1.5 is a 0.9 billion parameter model that achieves 94.5 percent accuracy on OmniDocBench v1.5 for in-the-wild document parsing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PaddleOCR-VL-1.5, an upgraded 0.9B vision-language model designed for robust document parsing in real-world conditions. It reports a new state-of-the-art accuracy of 94.5 percent on the OmniDocBench v1.5 benchmark. The authors also propose Real5-OmniDocBench to test performance against physical distortions like skew, warping, and illumination changes from scanning and photography. The model extends to handle seal recognition and text spotting tasks while staying compact and efficient. A sympathetic reader would care because accurate document parsing from imperfect photos and scans is essential for digitization, archiving, and automation in many industries.

Core claim

PaddleOCR-VL-1.5 achieves a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5. The model is a multi-task 0.9B VLM that incorporates seal recognition and text spotting capabilities. To evaluate robustness, the Real5-OmniDocBench benchmark is introduced, covering distortions including scanning, skew, warping, screen-photography, and illumination. The enhanced model attains SOTA performance on this benchmark as well.

What carries the argument

The 0.9B parameter multi-task vision-language model extended with seal recognition and text spotting tasks, benchmarked on OmniDocBench v1.5 and the proposed Real5-OmniDocBench for physical distortion robustness.

If this is right

  • The model maintains high accuracy on document parsing even with real-world physical distortions.
  • Incorporating seal recognition and text spotting broadens the applications without exceeding 0.9B parameters.
  • The compact size supports efficient deployment in various environments.
  • Strong results on the new benchmark indicate improved handling of practical document capture scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such compact multi-task models could enable real-time document processing on mobile devices.
  • Future work might test the model on a wider variety of document languages and formats to confirm broad applicability.
  • The benchmark creation process could be replicated for other vision tasks to standardize robustness evaluation.
  • Integration with existing OCR pipelines might accelerate adoption in industry settings.

Load-bearing premise

The Real5-OmniDocBench benchmark sufficiently represents the diversity of real-world physical distortions in document images and that benchmark performance predicts real deployment success.

What would settle it

Demonstrating that PaddleOCR-VL-1.5 underperforms competing models on a collection of actual user-submitted document photos with distortions not represented in Real5-OmniDocBench would challenge the claim.

read the original abstract

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaddleOCR-VL-1.5, a 0.9B-parameter vision-language model for multi-task document parsing that incorporates OCR, seal recognition, and text spotting. It claims a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5 and introduces the Real5-OmniDocBench benchmark to demonstrate robustness against five categories of real-world physical distortions (scanning, skew, warping, screen-photography, and illumination), reporting SOTA results on this new benchmark while maintaining high efficiency.

Significance. If the performance claims hold under detailed scrutiny, the work would demonstrate that ultra-compact VLMs can deliver robust in-the-wild document understanding, which is valuable for deployment on resource-constrained devices. The open release of code supports reproducibility. However, the absence of training data details, evaluation protocols, and independent validation currently limits the ability to assess whether the results represent a genuine advance or are tied to benchmark-specific choices.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.
  2. [§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.
minor comments (2)
  1. [Abstract] The abstract mentions extension to seal recognition and text spotting but does not specify how these tasks are integrated into the multi-task training objective or evaluated separately.
  2. [§2] Model size is repeatedly stated as 0.9B, but no breakdown of parameter allocation across vision encoder, language model, or task heads is provided to support the efficiency claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the verifiability and robustness claims in our manuscript. We will revise the paper to address these points by adding the requested details on training, evaluation, and benchmark construction.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim of 94.5% accuracy on OmniDocBench v1.5 is stated without any description of the training dataset, baseline models, evaluation protocol, or statistical significance testing, rendering the performance improvement unverifiable from the provided information.

    Authors: We agree that the manuscript currently provides insufficient detail to fully verify the 94.5% SOTA claim. In the revised version, we will expand §4 with: (1) a complete description of the training dataset composition and sources, (2) explicit listing of all baseline models and their configurations, (3) the full evaluation protocol including metrics, data splits, and preprocessing steps, and (4) statistical significance testing (e.g., bootstrap confidence intervals and paired comparisons) to substantiate the performance gains. These additions will be placed in both the main text and supplementary material. revision: yes

  2. Referee: [§3.2 and §4.3] §3.2 and §4.3 (Real5-OmniDocBench): The newly proposed benchmark's five distortion categories are presented as comprehensive for real-world conditions, yet no evidence is given of external cross-validation, held-out real-world corpora, or checks for train-test contamination, which is load-bearing for the robustness claim given that the benchmark is curated by the same authors.

    Authors: We acknowledge the concern regarding potential bias in a self-curated benchmark. In the revision, we will add to §3.2 and §4.3: detailed construction methodology for each of the five distortion categories, explicit checks for train-test contamination with our training data, and any internal cross-validation performed. We will also include a limitations discussion and examples from additional held-out real-world images. Full external independent validation is beyond the scope of a single paper but will be facilitated by open-sourcing the benchmark. revision: partial

standing simulated objections not resolved
  • Independent external cross-validation of Real5-OmniDocBench by third-party researchers, which cannot be provided by the authors at this stage.

Circularity Check

0 steps flagged

No circularity: SOTA claims rest on standard benchmark evaluation without self-referential reduction.

full rationale

The paper reports empirical accuracy (94.5% on OmniDocBench v1.5) and robustness on a newly proposed Real5-OmniDocBench benchmark. No equations, derivations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Model training and multi-task extension follow conventional supervised learning; benchmark scores are externally falsifiable and do not loop back to the paper's own inputs. This is the normal non-circular case for an empirical VLM paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about benchmark validity and the transferability of test-set accuracy to real deployments; no new entities are postulated.

free parameters (1)
  • model size = 0.9B
    The 0.9B parameter count is a deliberate design choice for efficiency.
axioms (1)
  • domain assumption Standard OCR and document parsing benchmarks measure meaningful real-world capability
    The SOTA claims depend on OmniDocBench v1.5 and the new Real5-OmniDocBench being representative.

pith-pipeline@v0.9.0 · 5476 in / 1386 out tokens · 47768 ms · 2026-05-16T09:36:42.769153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

  2. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

    cs.AI 2026-05 unverdicted novelty 7.0

    MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.

  3. GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

    cs.CL 2026-04 unverdicted novelty 7.0

    GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.

  4. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  5. Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

    cs.CV 2026-04 unverdicted novelty 6.0

    A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.

  6. LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

    cs.CV 2026-05 unverdicted novelty 5.0

    LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.

  7. JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

  8. JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 7 Pith papers · 9 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.05218 , year=

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure- recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025. 15

  2. [2]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186, 2025

  3. [3]

    Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

  4. [4]

    arXiv preprint arXiv:2509.01215 , year=

    Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. arXiv preprint arXiv:2509.01215, 2025

  5. [5]

    Ernie 4.5 technical report, 2025

    Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025

  6. [6]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  7. [7]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  9. [9]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilin- gual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

  10. [10]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

  11. [11]

    Monkeyocr v1

    Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, et al. Monkeyocr v1. 5 technical report: Unlocking robust document parsing for complex patterns. arXiv preprint arXiv:2511.10390, 2025

  12. [12]

    Hunyuanocr technical report

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025

  13. [13]

    Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

    Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp-doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

  14. [14]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

  15. [15]

    Gemini 3.0

    Google DeepMind. Gemini 3.0. https://blog.google/products-and-platforms/p roducts/gemini/gemini-3-collection/, 2025. 16

  16. [16]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024

  17. [17]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

  18. [18]

    Paddleformers

    PaddlePaddle Authors. Paddleformers. https://github.com/PaddlePaddle/Padd leFormers, 2025

  19. [19]

    Paddlepaddle: An open-source deep learning platform from industrial practice

    Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1(1):105–115, 2019

  20. [20]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  22. [22]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

  23. [23]

    Image over text: Transforming formula recognition evaluation with character detection matching

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, June 2025

  24. [24]

    Vik Paruchuri. Marker. https://github.com/datalab-to/marker , 2025. Accessed: 2025-09-25

  25. [25]

    Mineru2.0-2505-0.9b

    opendatalab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025

  26. [26]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

  27. [27]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source mul- timodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025

  28. [28]

    Gpt-5.2 system card, 2025

    OpenAI. Gpt-5.2 system card, 2025. URL https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 17

  29. [29]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  30. [30]

    Gemini 2.5

    Google DeepMind. Gemini 2.5. https://blog.google/technology/google-deepm ind/gemini-model-thinking-updates-march-2025/, 2025

  31. [31]

    chatdoc com. Ocrflux. https://github.com/chatdoc- com/OCRFlux , 2025. Accessed:2025-09-25

  32. [32]

    Mistral-ocr

    Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm_sourc e=ai-bot.cn, 2025

  33. [33]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Ran- gapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

  34. [34]

    Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

    Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025

  35. [35]

    dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

    rednote-hilab. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

  36. [36]

    Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026

    Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205

  37. [37]

    arXiv preprint arXiv:2510.12798 (2025)

    Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. arXiv preprint arXiv:2510.12798, 2025

  38. [38]

    Fastdeploy

    PaddlePaddle Authors. Fastdeploy. https://github.com/PaddlePaddle/FastDepl oy, 2025

  39. [39]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  40. [40]

    Sglang: Effi- cient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Effi- cient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024. 18 Appendix A. Comparison of PaddleOCR-VL-1.5 and 1.0 Mode...