pith. sign in

arxiv: 2605.27978 · v1 · pith:XWMZAUOMnew · submitted 2026-05-27 · 💻 cs.CV

ABot-OCR Technical Report

Pith reviewed 2026-06-29 13:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords end-to-end OCRvision-language modeldocument to Markdownreinforcement learningOmniDocBenchmultilingual text recognitiondata engine
0
0 comments X

The pith

An end-to-end vision-language model transcribes document images to Markdown in one pass and achieves state-of-the-art end-to-end scores on OmniDocBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ABot-OCR, which processes a full page image through a single forward pass of a vision-language model to produce clean Markdown output. This design removes the need for separate modules that can introduce errors at each step. Training relies on a dedicated data engine for consistent large-scale examples and a new reinforcement learning technique called Decoupled Heterogeneous Document Optimization to ensure both accurate text and proper formatting. Results on OmniDocBench v1.5 and v1.6 show top scores among end-to-end methods at 92.81 and 93.30, reducing the difference from strong pipeline approaches. The model also performs well on text recognition in ten languages.

Core claim

ABot-OCR is an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By developing a dedicated data engine for large-scale, structurally consistent supervision and proposing Decoupled Heterogeneous Document Optimization as a structure-constrained reinforcement learning method, the approach sharpens textual accuracy and enforces markup well-formedness. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines, with additional confirmation from multilingual evaluations.

What carries the argument

Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that improves textual accuracy and enforces markup well-formedness beyond supervised fine-tuning.

If this is right

  • Removes the error accumulation typical of modular pipelines by using a single forward pass.
  • Achieves higher scores than previous end-to-end systems on standard document benchmarks.
  • Demonstrates strong performance across ten diverse languages in text recognition.
  • Enables direct production of well-formed Markdown without additional post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-pass design could reduce computational overhead in high-volume document processing.
  • Similar techniques might apply to other tasks requiring structured output from images, such as chart understanding.
  • Future work could test the model on more varied document types beyond the current benchmarks.

Load-bearing premise

The dedicated data engine provides large-scale, structurally consistent supervision sufficient for the single model to handle complex document layouts effectively.

What would settle it

Running ABot-OCR on the OmniDocBench benchmarks and finding scores that do not reach 92.81 or 93.30, or do not narrow the gap to pipelines, would challenge the reported results.

read the original abstract

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ABot-OCR, an end-to-end vision-language model that transcribes document page images directly into clean Markdown in a single forward pass. It describes a dedicated data engine for large-scale structurally consistent supervision and proposes Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method intended to improve textual accuracy and enforce markup well-formedness beyond supervised fine-tuning. The paper claims state-of-the-art scores of 92.81 and 93.30 on OmniDocBench v1.5 and v1.6 among end-to-end systems, substantially narrowing the gap to strong pipeline baselines, along with robust multilingual performance across ten languages.

Significance. If the performance claims hold under detailed scrutiny, the work would represent a meaningful step toward simplifying document parsing by replacing error-prone modular pipelines with a unified model. The combination of large-scale data engineering and RL-based structure enforcement is a plausible direction for improving fidelity in layout-sensitive tasks. The multilingual results, if quantified, would further support generalizability claims.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.
  2. [Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the specific comments on the abstract. We agree that the abstract should better support its central claims for readers who may not immediately consult the full text. We will revise the abstract accordingly while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.

    Authors: We agree the abstract as written does not define the metric or list baselines. The full manuscript contains a results section with the comparison table, the OmniDocBench metric definition, and the experimental protocol. To address the concern directly in the abstract, we will add a short clause referencing the evaluation benchmark and the fact that scores are reported against both end-to-end and pipeline systems. This will make the claims verifiable from the abstract alone. revision: yes

  2. Referee: [Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.

    Authors: The abstract is a high-level summary; the technical description, equations for the structure-constrained RL objective, ablation studies, and implementation details of both the data engine and Decoupled Heterogeneous Document Optimization appear in Sections 3 and 4 of the manuscript. We will revise the abstract to include one additional sentence that briefly characterizes the data engine and the RL method, directing readers to the relevant sections for the supporting evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims only

full rationale

The paper contains no equations, derivations, or predictions. All load-bearing claims are empirical benchmark scores (92.81 / 93.30 on OmniDocBench) presented as experimental outcomes of the model, data engine, and RL method. No self-definitional steps, fitted inputs renamed as predictions, or self-citation chains appear. The derivation chain is empty, so no reduction to inputs by construction is possible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, free parameters, or postulated entities; all claims are empirical.

pith-pipeline@v0.9.1-grok · 5694 in / 1140 out tokens · 36448 ms · 2026-06-29T13:26:09.462195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 38 canonical work pages · 14 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Beagle: Automated extraction and interpretation of visualizations from the web

    Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. Beagle: Automated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI Conference on Human Factorsin Computing Systems, CHI ’18, pages 1–8, New York, NY, USA, 2018. Association for Computing Machinery. Dataset/tool fo...

  3. [3]

    LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022

    Lukas Blecher. LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022. Optical character recogni- tion toolkit for mathematical expressions; accessed 2026-05-13

  4. [4]

    Logics-parsing technical report, 2025

    Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, 14 and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760. We report results using the Logics-Parsing-v2 released model

  5. [5]

    PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025. URLhttps://arxiv.org/abs/2510.14528

  6. [6]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595

  7. [7]

    PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL-1.5: Towards a multi-task 0.9B VLM for robust in-the-Wild document parsing, 2026. URLhttps://arxiv.org/abs/ 2601.21957

  8. [8]

    Kenny Davila, Rupak Lazarus, Fei Xu, Nicole Rodríguez Alcántara, Srirangaraj Setlur, Venu Govindaraju, Ajoy Mondal, and C. V. Jawahar. CHART-Info 2024: A dataset for chart analysis and recognition. InProceedings of the 27th International Conference on Pattern Recognition (ICPR). Springer, 2024. doi: 10.1007/978-3-031-78495-8_

  9. [9]

    URLhttps://doi.org/10.1007/978-3-031-78495-8_19

  10. [10]

    Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025

    Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025. URLhttps: //arxiv.org/abs/2512.21095

  11. [11]

    GLM-OCR technical report, 2026

    Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. GLM-OCR technical report, 2026. URLhttps://arxiv.org/abs/2603.10910

  12. [12]

    Dolphin: Document image parsing via heterogeneous anchor prompting

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

  13. [13]

    Dolphin-v2: Universal document parsing via scalable anchor prompting,

    Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting,

  14. [14]

    URLhttps://arxiv.org/abs/2602.05384

  15. [15]

    Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

    FireRed Team. Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025

  16. [16]

    Mathwriting: A dataset for handwritten mathematical expression recognition, 2025

    Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition, 2025. URLhttps://arxiv.org/abs/2404.10690

  17. [17]

    Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025

    Google DeepMind. Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025. Model card; accessed 2026-05-13

  18. [18]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  19. [19]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  20. [20]

    DVQA: Understanding Data Visualizations via Question Answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. URLhttps://arxiv.org/abs/1801.08163

  21. [21]

    Chart-to-text: A large-scale benchmark for chart summarization, 2022

    Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart summarization, 2022. URLhttps://arxiv.org/ abs/2203.06486

  22. [22]

    dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498. 15

  23. [23]

    Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,

  24. [24]

    We evaluate the MonkeyOCR-pro-3B checkpoint

    URLhttps://arxiv.org/abs/2506.05218. We evaluate the MonkeyOCR-pro-3B checkpoint

  25. [25]

    CASIA online and offline Chinese handwriting databases

    Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIA online and offline Chinese handwriting databases. InProceedings of the 11th InternationalConference on Document Analysis and Recognition (ICDAR), pages 37–41. IEEE Computer Society, 2011. doi: 10.1109/ICDAR.2011.17

  26. [26]

    J. Liu, M. Zhang, et al. Gdpo: Group decoupled preference optimization for multi-reward reinforcement learning of language models.arXiv preprint arXiv:2602.xxxxx, 2026

  27. [27]

    Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

  28. [28]

    Ovis2.5 Technical Report

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...

  29. [29]

    ChartOCR: Data extraction from charts images via a deep hybrid framework

    Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. ChartOCR: Data extraction from charts images via a deep hybrid framework. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1917–1925, 2021. URLhttps://openaccess.thecvf.com/content/WACV2021/html/Luo_ ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_H...

  30. [30]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022. URLhttps://arxiv.org/abs/2203.10244

  31. [31]

    Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023

    Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023. URLhttps://arxiv.org/abs/ 2305.14761

  32. [32]

    Khapra, and Pratyush Kumar

    Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots, 2020. URLhttps://arxiv.org/abs/1909.00997

  33. [33]

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

  34. [34]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025

  35. [35]

    latex-formulas-80m (hugging face dataset)

    OleehyO. latex-formulas-80m (hugging face dataset). https://huggingface.co/datasets/OleehyO/ latex-formulas-80M, 2025. Large-scale rendered formula images with LaTeX supervision; accessed May 28, 2026

  36. [36]

    Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

    R OpenAI. Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023

  37. [37]

    Ouyang, Y

    L. Ouyang, Y. Qu, H. Zhou, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. arXiv preprint arXiv:2412.07626, 2024

  38. [38]

    Qwen3-vl: Technical report

    Qwen Team. Qwen3-vl: Technical report. Technical report, Alibaba DAMO Academy, 2025

  39. [39]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5. 16

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  42. [42]

    Patch-as-decodable-token: Towards unified multi-modal vision tasks in MLLMs

    Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning, 2023. URLhttps://arxiv.org/abs/2307.05356

  43. [43]

    Hunyuanocr technical report, 2025

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang...

  44. [44]

    Unimernet: A universal network for real-world mathematical expression recognition, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition, 2024. URLhttps://arxiv.org/abs/2404. 15254

  45. [46]

    Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026

  46. [47]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  47. [48]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

  48. [49]

    arXiv preprint arXiv:2601.20552 (2026)

    Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552

  49. [50]

    Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,

    Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,

  50. [51]

    URLhttps://arxiv.org/abs/2601.20430

  51. [52]

    Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

    Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

  52. [53]

    Image-based table recognition: Data, model, and evaluation

    Xu Zhong, Elahe ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision, 2020

  53. [54]

    Image-based table recognition: data, model, and evaluation, 2020

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation, 2020. URLhttps://arxiv.org/abs/1911.10683. 17

  54. [55]

    Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026

    Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URL https://arxiv.org/abs/2601.21639. 18 Appendix A Qualitative Examples The following cases demonstrate the model’s capability to bridge the gap between ...