ABot-OCR Technical Report
Pith reviewed 2026-06-29 13:26 UTC · model grok-4.3
The pith
An end-to-end vision-language model transcribes document images to Markdown in one pass and achieves state-of-the-art end-to-end scores on OmniDocBench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ABot-OCR is an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By developing a dedicated data engine for large-scale, structurally consistent supervision and proposing Decoupled Heterogeneous Document Optimization as a structure-constrained reinforcement learning method, the approach sharpens textual accuracy and enforces markup well-formedness. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines, with additional confirmation from multilingual evaluations.
What carries the argument
Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that improves textual accuracy and enforces markup well-formedness beyond supervised fine-tuning.
If this is right
- Removes the error accumulation typical of modular pipelines by using a single forward pass.
- Achieves higher scores than previous end-to-end systems on standard document benchmarks.
- Demonstrates strong performance across ten diverse languages in text recognition.
- Enables direct production of well-formed Markdown without additional post-processing steps.
Where Pith is reading between the lines
- The single-pass design could reduce computational overhead in high-volume document processing.
- Similar techniques might apply to other tasks requiring structured output from images, such as chart understanding.
- Future work could test the model on more varied document types beyond the current benchmarks.
Load-bearing premise
The dedicated data engine provides large-scale, structurally consistent supervision sufficient for the single model to handle complex document layouts effectively.
What would settle it
Running ABot-OCR on the OmniDocBench benchmarks and finding scores that do not reach 92.81 or 93.30, or do not narrow the gap to pipelines, would challenge the reported results.
read the original abstract
We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ABot-OCR, an end-to-end vision-language model that transcribes document page images directly into clean Markdown in a single forward pass. It describes a dedicated data engine for large-scale structurally consistent supervision and proposes Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method intended to improve textual accuracy and enforce markup well-formedness beyond supervised fine-tuning. The paper claims state-of-the-art scores of 92.81 and 93.30 on OmniDocBench v1.5 and v1.6 among end-to-end systems, substantially narrowing the gap to strong pipeline baselines, along with robust multilingual performance across ten languages.
Significance. If the performance claims hold under detailed scrutiny, the work would represent a meaningful step toward simplifying document parsing by replacing error-prone modular pipelines with a unified model. The combination of large-scale data engineering and RL-based structure enforcement is a plausible direction for improving fidelity in layout-sensitive tasks. The multilingual results, if quantified, would further support generalizability claims.
major comments (2)
- [Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.
- [Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the specific comments on the abstract. We agree that the abstract should better support its central claims for readers who may not immediately consult the full text. We will revise the abstract accordingly while preserving its brevity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims—that ABot-OCR achieves SOTA scores of 92.81 (v1.5) and 93.30 (v1.6) among end-to-end systems and narrows the gap to pipeline baselines—are stated without any results table, list of compared systems, definition of the evaluation metric, baseline scores, or experimental protocol. These numbers are load-bearing for the paper's primary contribution yet cannot be verified or reproduced from the provided text.
Authors: We agree the abstract as written does not define the metric or list baselines. The full manuscript contains a results section with the comparison table, the OmniDocBench metric definition, and the experimental protocol. To address the concern directly in the abstract, we will add a short clause referencing the evaluation benchmark and the fact that scores are reported against both end-to-end and pipeline systems. This will make the claims verifiable from the abstract alone. revision: yes
-
Referee: [Abstract] Abstract: Assertions regarding the superiority of Decoupled Heterogeneous Document Optimization over supervised fine-tuning and the enabling role of the dedicated data engine are presented without technical descriptions, equations, ablation studies, or implementation details. These elements are required to substantiate how the proposed components produce the reported gains.
Authors: The abstract is a high-level summary; the technical description, equations for the structure-constrained RL objective, ablation studies, and implementation details of both the data engine and Decoupled Heterogeneous Document Optimization appear in Sections 3 and 4 of the manuscript. We will revise the abstract to include one additional sentence that briefly characterizes the data engine and the RL method, directing readers to the relevant sections for the supporting evidence. revision: yes
Circularity Check
No circularity; empirical claims only
full rationale
The paper contains no equations, derivations, or predictions. All load-bearing claims are empirical benchmark scores (92.81 / 93.30 on OmniDocBench) presented as experimental outcomes of the model, data engine, and RL method. No self-definitional steps, fitted inputs renamed as predictions, or self-citation chains appear. The derivation chain is empty, so no reduction to inputs by construction is possible.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Beagle: Automated extraction and interpretation of visualizations from the web
Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. Beagle: Automated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI Conference on Human Factorsin Computing Systems, CHI ’18, pages 1–8, New York, NY, USA, 2018. Association for Computing Machinery. Dataset/tool fo...
2018
-
[3]
LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022
Lukas Blecher. LaTeX-OCR.https://lukas-blecher.github.io/LaTeX-OCR/, 2022. Optical character recogni- tion toolkit for mathematical expressions; accessed 2026-05-13
2022
-
[4]
Logics-parsing technical report, 2025
Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, 14 and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760. We report results using the Logics-Parsing-v2 released model
-
[5]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025. URLhttps://arxiv.org/abs/2510.14528
-
[6]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. URLhttps://arxiv.org/abs/2507.05595
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL-1.5: Towards a multi-task 0.9B VLM for robust in-the-Wild document parsing, 2026. URLhttps://arxiv.org/abs/ 2601.21957
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Kenny Davila, Rupak Lazarus, Fei Xu, Nicole Rodríguez Alcántara, Srirangaraj Setlur, Venu Govindaraju, Ajoy Mondal, and C. V. Jawahar. CHART-Info 2024: A dataset for chart analysis and recognition. InProceedings of the 27th International Conference on Pattern Recognition (ICPR). Springer, 2024. doi: 10.1007/978-3-031-78495-8_
-
[9]
URLhttps://doi.org/10.1007/978-3-031-78495-8_19
-
[10]
Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025
Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025. URLhttps: //arxiv.org/abs/2512.21095
-
[11]
GLM-OCR technical report, 2026
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. GLM-OCR technical report, 2026. URLhttps://arxiv.org/abs/2603.10910
-
[12]
Dolphin: Document image parsing via heterogeneous anchor prompting
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025
2025
-
[13]
Dolphin-v2: Universal document parsing via scalable anchor prompting,
Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, and Can Huang. Dolphin-v2: Universal document parsing via scalable anchor prompting,
- [14]
-
[15]
Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025
FireRed Team. Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2025
-
[16]
Mathwriting: A dataset for handwritten mathematical expression recognition, 2025
Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition, 2025. URLhttps://arxiv.org/abs/2404.10690
-
[17]
Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025
Google DeepMind. Gemini 3 Pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025. Model card; accessed 2026-05-13
2025
-
[18]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
DVQA: Understanding Data Visualizations via Question Answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. URLhttps://arxiv.org/abs/1801.08163
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Chart-to-text: A large-scale benchmark for chart summarization, 2022
Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart summarization, 2022. URLhttps://arxiv.org/ abs/2203.06486
-
[22]
dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025
Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/abs/2512.02498. 15
-
[23]
Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm,
-
[24]
We evaluate the MonkeyOCR-pro-3B checkpoint
URLhttps://arxiv.org/abs/2506.05218. We evaluate the MonkeyOCR-pro-3B checkpoint
-
[25]
CASIA online and offline Chinese handwriting databases
Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIA online and offline Chinese handwriting databases. InProceedings of the 11th InternationalConference on Document Analysis and Recognition (ICDAR), pages 37–41. IEEE Computer Society, 2011. doi: 10.1109/ICDAR.2011.17
-
[26]
J. Liu, M. Zhang, et al. Gdpo: Group decoupled preference optimization for multi-reward reinforcement learning of language models.arXiv preprint arXiv:2602.xxxxx, 2026
2026
-
[27]
Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025
-
[28]
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
ChartOCR: Data extraction from charts images via a deep hybrid framework
Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. ChartOCR: Data extraction from charts images via a deep hybrid framework. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1917–1925, 2021. URLhttps://openaccess.thecvf.com/content/WACV2021/html/Luo_ ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_H...
1917
-
[30]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022. URLhttps://arxiv.org/abs/2203.10244
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023
Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023. URLhttps://arxiv.org/abs/ 2305.14761
-
[32]
Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots, 2020. URLhttps://arxiv.org/abs/1909.00997
-
[33]
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025
2025
-
[35]
latex-formulas-80m (hugging face dataset)
OleehyO. latex-formulas-80m (hugging face dataset). https://huggingface.co/datasets/OleehyO/ latex-formulas-80M, 2025. Large-scale rendered formula images with LaTeX supervision; accessed May 28, 2026
2025
-
[36]
Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023
R OpenAI. Gpt-4v (ision) system card.Citekey: gptvision, 6, 2023
2023
- [37]
-
[38]
Qwen3-vl: Technical report
Qwen Team. Qwen3-vl: Technical report. Technical report, Alibaba DAMO Academy, 2025
2025
-
[39]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5. 16
2026
-
[40]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Patch-as-decodable-token: Towards unified multi-modal vision tasks in MLLMs
Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning, 2023. URLhttps://arxiv.org/abs/2307.05356
-
[43]
Hunyuanocr technical report, 2025
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang...
-
[44]
Unimernet: A universal network for real-world mathematical expression recognition, 2024
Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition, 2024. URLhttps://arxiv.org/abs/2404. 15254
2024
-
[46]
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
arXiv preprint arXiv:2601.20552 (2026)
Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552
-
[50]
Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,
Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding,
- [51]
-
[52]
Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024
-
[53]
Image-based table recognition: Data, model, and evaluation
Xu Zhong, Elahe ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision, 2020
2020
-
[54]
Image-based table recognition: data, model, and evaluation, 2020
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation, 2020. URLhttps://arxiv.org/abs/1911.10683. 17
-
[55]
Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026
Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URL https://arxiv.org/abs/2601.21639. 18 Appendix A Qualitative Examples The following cases demonstrate the model’s capability to bridge the gap between ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.