ABot-OCR Technical Report

Kaitao Jiang; Kangning Niu; Mu Xu; Ruiyan Gong; Tianlun Li; Xiaolong Cheng

arxiv: 2605.27978 · v1 · pith:XWMZAUOMnew · submitted 2026-05-27 · 💻 cs.CV

ABot-OCR Technical Report

Kaitao Jiang , Ruiyan Gong , Xiaolong Cheng , Kangning Niu , Tianlun Li , Mu Xu This is my paper

Pith reviewed 2026-06-29 13:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords end-to-end OCRvision-language modeldocument to Markdownreinforcement learningOmniDocBenchmultilingual text recognitiondata engine

0 comments

The pith

An end-to-end vision-language model transcribes document images to Markdown in one pass and achieves state-of-the-art end-to-end scores on OmniDocBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ABot-OCR, which processes a full page image through a single forward pass of a vision-language model to produce clean Markdown output. This design removes the need for separate modules that can introduce errors at each step. Training relies on a dedicated data engine for consistent large-scale examples and a new reinforcement learning technique called Decoupled Heterogeneous Document Optimization to ensure both accurate text and proper formatting. Results on OmniDocBench v1.5 and v1.6 show top scores among end-to-end methods at 92.81 and 93.30, reducing the difference from strong pipeline approaches. The model also performs well on text recognition in ten languages.

Core claim

ABot-OCR is an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By developing a dedicated data engine for large-scale, structurally consistent supervision and proposing Decoupled Heterogeneous Document Optimization as a structure-constrained reinforcement learning method, the approach sharpens textual accuracy and enforces markup well-formedness. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines, with additional confirmation from multilingual evaluations.

What carries the argument

Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that improves textual accuracy and enforces markup well-formedness beyond supervised fine-tuning.

If this is right

Removes the error accumulation typical of modular pipelines by using a single forward pass.
Achieves higher scores than previous end-to-end systems on standard document benchmarks.
Demonstrates strong performance across ten diverse languages in text recognition.
Enables direct production of well-formed Markdown without additional post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-pass design could reduce computational overhead in high-volume document processing.
Similar techniques might apply to other tasks requiring structured output from images, such as chart understanding.
Future work could test the model on more varied document types beyond the current benchmarks.

Load-bearing premise

The dedicated data engine provides large-scale, structurally consistent supervision sufficient for the single model to handle complex document layouts effectively.

What would settle it

Running ABot-OCR on the OmniDocBench benchmarks and finding scores that do not reach 92.81 or 93.30, or do not narrow the gap to pipelines, would challenge the reported results.

read the original abstract

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABot-OCR claims SOTA scores on OmniDocBench with an end-to-end VLM plus RL, but the abstract supplies zero tables, baselines, or method details to support any of it.

read the letter

The core pitch is an end-to-end vision-language model that maps a page image to clean Markdown in one pass, backed by a custom data engine and a reinforcement learning procedure called Decoupled Heterogeneous Document Optimization. The report says this beats other end-to-end systems on OmniDocBench v1.5 and v1.6 (92.81 and 93.30) and narrows the gap to pipeline methods, plus it handles ten languages.

What stands out as new is the specific pairing of the data engine with that named RL step for enforcing markup structure. The goal of cutting out modular error accumulation is clear enough on paper.

The problem is that none of the claims can be checked. There are no results tables, no listed baselines, no ablation numbers, no description of the data engine, and no equations or pseudocode for the RL method. The abstract just states the scores and calls the evaluations extensive. Without those pieces the SOTA assertion and the superiority over supervised fine-tuning stay unverified.

This is a short technical report, not a full methods paper. It would interest people already working on document parsing who want to test an end-to-end alternative if code or weights appear later. Right now there is not enough substance to discuss in a reading group or to cite. It does not look ready for peer review either; the central numerical claims need the missing experimental record before a referee could do anything useful with them.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ABot-OCR, an end-to-end vision-language model that transcribes document page images directly into clean Markdown in a single forward pass. It describes a dedicated data engine for large-scale structurally consistent supervision and proposes Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method intended to improve textual accuracy and enforce markup well-formedness beyond supervised fine-tuning. The paper claims state-of-the-art scores of 92.81 and 93.30 on OmniDocBench v1.5 and v1.6 among end-to-end systems, substantially narrowing the gap to strong pipeline baselines, along with robust multilingual performance across ten languages.

Significance. If the performance claims hold under detailed scrutiny, the work would represent a meaningful step toward simplifying document parsing by replacing error-prone modular pipelines with a unified model. The combination of large-scale data engineering and RL-based structure enforcement is a plausible direction for improving fidelity in layout-sensitive tasks. The multilingual results, if quantified, would further support generalizability claims.

major comments (2)

[Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.
[Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the specific comments on the abstract. We agree that the abstract should better support its central claims for readers who may not immediately consult the full text. We will revise the abstract accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.

Authors: We agree the abstract as written does not define the metric or list baselines. The full manuscript contains a results section with the comparison table, the OmniDocBench metric definition, and the experimental protocol. To address the concern directly in the abstract, we will add a short clause referencing the evaluation benchmark and the fact that scores are reported against both end-to-end and pipeline systems. This will make the claims verifiable from the abstract alone. revision: yes
Referee: [Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.

Authors: The abstract is a high-level summary; the technical description, equations for the structure-constrained RL objective, ablation studies, and implementation details of both the data engine and Decoupled Heterogeneous Document Optimization appear in Sections 3 and 4 of the manuscript. We will revise the abstract to include one additional sentence that briefly characterizes the data engine and the RL method, directing readers to the relevant sections for the supporting evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims only

full rationale

The paper contains no equations, derivations, or predictions. All load-bearing claims are empirical benchmark scores (92.81 / 93.30 on OmniDocBench) presented as experimental outcomes of the model, data engine, and RL method. No self-definitional steps, fitted inputs renamed as predictions, or self-citation chains appear. The derivation chain is empty, so no reduction to inputs by construction is possible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, free parameters, or postulated entities; all claims are empirical.

pith-pipeline@v0.9.1-grok · 5694 in / 1140 out tokens · 36448 ms · 2026-06-29T13:26:09.462195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 38 canonical work pages · 14 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Beagle: Automated extraction and interpretation of visualizations from the web

Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. Beagle: Automated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI Conference on Human Factorsin Computing Systems, CHI ’18, pages 1–8, New York, NY, USA, 2018. Association for Computing Machinery. Dataset/tool fo...

2018
[3]

LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022

Lukas Blecher. LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022. Optical character recogni- tion toolkit for mathematical expressions; accessed 2026-05-13

2022
[4]

Logics-parsing technical report, 2025

Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, 14 and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760. We report results using the Logics-Parsing-v2 released model

work page arXiv 2025
[5]

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025. URLhttps://arxiv.org/abs/2510.14528

work page arXiv 2025
[6]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL-1.5: Towards a multi-task 0.9B VLM for robust in-the-Wild document parsing, 2026. URLhttps://arxiv.org/abs/ 2601.21957

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Kenny Davila, Rupak Lazarus, Fei Xu, Nicole Rodríguez Alcántara, Srirangaraj Setlur, Venu Govindaraju, Ajoy Mondal, and C. V. Jawahar. CHART-Info 2024: A dataset for chart analysis and recognition. InProceedings of the 27th International Conference on Pattern Recognition (ICPR). Springer, 2024. doi: 10.1007/978-3-031-78495-8_

work page doi:10.1007/978-3-031-78495-8_ 2024
[9]

URLhttps://doi.org/10.1007/978-3-031-78495-8_19

work page doi:10.1007/978-3-031-78495-8_19
[10]

Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025

Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025. URLhttps: //arxiv.org/abs/2512.21095

work page arXiv 2025
[11]

GLM-OCR technical report, 2026

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. GLM-OCR technical report, 2026. URLhttps://arxiv.org/abs/2603.10910

work page arXiv 2026
[12]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

2025
[13]

Dolphin-v2: Universal document parsing via scalable anchor prompting,

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting,
[14]

URLhttps://arxiv.org/abs/2602.05384

work page arXiv
[15]

Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

FireRed Team. Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

work page arXiv 2025
[16]

Mathwriting: A dataset for handwritten mathematical expression recognition, 2025

Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition, 2025. URLhttps://arxiv.org/abs/2404.10690

work page arXiv 2025
[17]

Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025

Google DeepMind. Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025. Model card; accessed 2026-05-13

2025
[18]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

DVQA: Understanding Data Visualizations via Question Answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. URLhttps://arxiv.org/abs/1801.08163

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Chart-to-text: A large-scale benchmark for chart summarization, 2022

Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart summarization, 2022. URLhttps://arxiv.org/ abs/2203.06486

work page arXiv 2022
[22]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498. 15

work page arXiv 2025
[23]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,
[24]

We evaluate the MonkeyOCR-pro-3B checkpoint

URLhttps://arxiv.org/abs/2506.05218. We evaluate the MonkeyOCR-pro-3B checkpoint

work page arXiv
[25]

CASIA online and offline Chinese handwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIA online and offline Chinese handwriting databases. InProceedings of the 11th InternationalConference on Document Analysis and Recognition (ICDAR), pages 37–41. IEEE Computer Society, 2011. doi: 10.1109/ICDAR.2011.17

work page doi:10.1109/icdar.2011.17 2011
[26]

J. Liu, M. Zhang, et al. Gdpo: Group decoupled preference optimization for multi-reward reinforcement learning of language models.arXiv preprint arXiv:2602.xxxxx, 2026

2026
[27]

Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

work page arXiv 2025
[28]

Ovis2.5 Technical Report

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

ChartOCR: Data extraction from charts images via a deep hybrid framework

Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. ChartOCR: Data extraction from charts images via a deep hybrid framework. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1917–1925, 2021. URLhttps://openaccess.thecvf.com/content/WACV2021/html/Luo_ ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_H...

1917
[30]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022. URLhttps://arxiv.org/abs/2203.10244

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023. URLhttps://arxiv.org/abs/ 2305.14761

work page arXiv 2023
[32]

Khapra, and Pratyush Kumar

Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots, 2020. URLhttps://arxiv.org/abs/1909.00997

work page arXiv 2020
[33]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025

2025
[35]

latex-formulas-80m (hugging face dataset)

OleehyO. latex-formulas-80m (hugging face dataset). https://huggingface.co/datasets/OleehyO/ latex-formulas-80M, 2025. Large-scale rendered formula images with LaTeX supervision; accessed May 28, 2026

2025
[36]

Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

R OpenAI. Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

2023
[37]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. arXiv preprint arXiv:2412.07626, 2024

work page arXiv 2024
[38]

Qwen3-vl: Technical report

Qwen Team. Qwen3-vl: Technical report. Technical report, Alibaba DAMO Academy, 2025

2025
[39]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5. 16

2026
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Patch-as-decodable-token: Towards unified multi-modal vision tasks in MLLMs

Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning, 2023. URLhttps://arxiv.org/abs/2307.05356

work page arXiv 2023
[43]

Hunyuanocr technical report, 2025

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang...

work page arXiv 2025
[44]

Unimernet: A universal network for real-world mathematical expression recognition, 2024

Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition, 2024. URLhttps://arxiv.org/abs/2404. 15254

2024
[46]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

arXiv preprint arXiv:2601.20552 (2026)

Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552

work page arXiv 2026
[50]

Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,
[51]

URLhttps://arxiv.org/abs/2601.20430

work page arXiv
[52]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

work page arXiv 2024
[53]

Image-based table recognition: Data, model, and evaluation

Xu Zhong, Elahe ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision, 2020

2020
[54]

Image-based table recognition: data, model, and evaluation, 2020

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation, 2020. URLhttps://arxiv.org/abs/1911.10683. 17

work page arXiv 2020
[55]

Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URL https://arxiv.org/abs/2601.21639. 18 Appendix A Qualitative Examples The following cases demonstrate the model’s capability to bridge the gap between ...

work page arXiv 2026

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Beagle: Automated extraction and interpretation of visualizations from the web

Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. Beagle: Automated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI Conference on Human Factorsin Computing Systems, CHI ’18, pages 1–8, New York, NY, USA, 2018. Association for Computing Machinery. Dataset/tool fo...

2018

[3] [3]

LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022

Lukas Blecher. LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022. Optical character recogni- tion toolkit for mathematical expressions; accessed 2026-05-13

2022

[4] [4]

Logics-parsing technical report, 2025

Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, 14 and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760. We report results using the Logics-Parsing-v2 released model

work page arXiv 2025

[5] [5]

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025. URLhttps://arxiv.org/abs/2510.14528

work page arXiv 2025

[6] [6]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL-1.5: Towards a multi-task 0.9B VLM for robust in-the-Wild document parsing, 2026. URLhttps://arxiv.org/abs/ 2601.21957

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Kenny Davila, Rupak Lazarus, Fei Xu, Nicole Rodríguez Alcántara, Srirangaraj Setlur, Venu Govindaraju, Ajoy Mondal, and C. V. Jawahar. CHART-Info 2024: A dataset for chart analysis and recognition. InProceedings of the 27th International Conference on Pattern Recognition (ICPR). Springer, 2024. doi: 10.1007/978-3-031-78495-8_

work page doi:10.1007/978-3-031-78495-8_ 2024

[9] [9]

URLhttps://doi.org/10.1007/978-3-031-78495-8_19

work page doi:10.1007/978-3-031-78495-8_19

[10] [10]

Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025

Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025. URLhttps: //arxiv.org/abs/2512.21095

work page arXiv 2025

[11] [11]

GLM-OCR technical report, 2026

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. GLM-OCR technical report, 2026. URLhttps://arxiv.org/abs/2603.10910

work page arXiv 2026

[12] [12]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

2025

[13] [13]

Dolphin-v2: Universal document parsing via scalable anchor prompting,

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting,

[14] [14]

URLhttps://arxiv.org/abs/2602.05384

work page arXiv

[15] [15]

Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

FireRed Team. Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

work page arXiv 2025

[16] [16]

Mathwriting: A dataset for handwritten mathematical expression recognition, 2025

Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition, 2025. URLhttps://arxiv.org/abs/2404.10690

work page arXiv 2025

[17] [17]

Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025

Google DeepMind. Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025. Model card; accessed 2026-05-13

2025

[18] [18]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

DVQA: Understanding Data Visualizations via Question Answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. URLhttps://arxiv.org/abs/1801.08163

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Chart-to-text: A large-scale benchmark for chart summarization, 2022

Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart summarization, 2022. URLhttps://arxiv.org/ abs/2203.06486

work page arXiv 2022

[22] [22]

dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498. 15

work page arXiv 2025

[23] [23]

Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,

[24] [24]

We evaluate the MonkeyOCR-pro-3B checkpoint

URLhttps://arxiv.org/abs/2506.05218. We evaluate the MonkeyOCR-pro-3B checkpoint

work page arXiv

[25] [25]

CASIA online and offline Chinese handwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIA online and offline Chinese handwriting databases. InProceedings of the 11th InternationalConference on Document Analysis and Recognition (ICDAR), pages 37–41. IEEE Computer Society, 2011. doi: 10.1109/ICDAR.2011.17

work page doi:10.1109/icdar.2011.17 2011

[26] [26]

J. Liu, M. Zhang, et al. Gdpo: Group decoupled preference optimization for multi-reward reinforcement learning of language models.arXiv preprint arXiv:2602.xxxxx, 2026

2026

[27] [27]

Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

work page arXiv 2025

[28] [28]

Ovis2.5 Technical Report

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

ChartOCR: Data extraction from charts images via a deep hybrid framework

Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. ChartOCR: Data extraction from charts images via a deep hybrid framework. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1917–1925, 2021. URLhttps://openaccess.thecvf.com/content/WACV2021/html/Luo_ ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_H...

1917

[30] [30]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022. URLhttps://arxiv.org/abs/2203.10244

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023. URLhttps://arxiv.org/abs/ 2305.14761

work page arXiv 2023

[32] [32]

Khapra, and Pratyush Kumar

Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots, 2020. URLhttps://arxiv.org/abs/1909.00997

work page arXiv 2020

[33] [33]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025

2025

[35] [35]

latex-formulas-80m (hugging face dataset)

OleehyO. latex-formulas-80m (hugging face dataset). https://huggingface.co/datasets/OleehyO/ latex-formulas-80M, 2025. Large-scale rendered formula images with LaTeX supervision; accessed May 28, 2026

2025

[36] [36]

Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

R OpenAI. Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

2023

[37] [37]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. arXiv preprint arXiv:2412.07626, 2024

work page arXiv 2024

[38] [38]

Qwen3-vl: Technical report

Qwen Team. Qwen3-vl: Technical report. Technical report, Alibaba DAMO Academy, 2025

2025

[39] [39]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5. 16

2026

[40] [40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Patch-as-decodable-token: Towards unified multi-modal vision tasks in MLLMs

Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning, 2023. URLhttps://arxiv.org/abs/2307.05356

work page arXiv 2023

[43] [43]

Hunyuanocr technical report, 2025

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang...

work page arXiv 2025

[44] [44]

Unimernet: A universal network for real-world mathematical expression recognition, 2024

Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition, 2024. URLhttps://arxiv.org/abs/2404. 15254

2024

[45] [46]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

arXiv preprint arXiv:2601.20552 (2026)

Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552

work page arXiv 2026

[49] [50]

Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,

[50] [51]

URLhttps://arxiv.org/abs/2601.20430

work page arXiv

[51] [52]

Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

work page arXiv 2024

[52] [53]

Image-based table recognition: Data, model, and evaluation

Xu Zhong, Elahe ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision, 2020

2020

[53] [54]

Image-based table recognition: data, model, and evaluation, 2020

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation, 2020. URLhttps://arxiv.org/abs/1911.10683. 17

work page arXiv 2020

[54] [55]

Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URL https://arxiv.org/abs/2601.21639. 18 Appendix A Qualitative Examples The following cases demonstrate the model’s capability to bridge the gap between ...

work page arXiv 2026