pith. machine review for the scientific record. sign in

arxiv: 2601.21957 · v2 · submitted 2026-01-29 · 💻 cs.CV

Recognition: unknown

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Authors on Pith no claims yet
classification 💻 cs.CV
keywords modelbenchmarkpaddleocr-vl-1sotaaccuracyachievingattainscapabilities
0
0 comments X
read the original abstract

We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

  2. GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

    cs.CL 2026-04 unverdicted novelty 7.0

    GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.

  3. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  4. Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

    cs.CV 2026-04 unverdicted novelty 6.0

    A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.

  5. LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

    cs.CV 2026-05 unverdicted novelty 5.0

    LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.

  6. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.