PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training
Pith reviewed 2026-06-28 10:36 UTC · model grok-4.3
The pith
PaddleOCR-VL-1.6 reaches 96.33% on OmniDocBench v1.6 by refining under-optimized regions from its 1.5 predecessor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaddleOCR-VL-1.6 improves on PaddleOCR-VL-1.5 by identifying under-optimized regions where model behavior is unstable or supervision is weak, then applies targeted data enhancement and a progressive post-training recipe of curated selection plus reinforcement learning to reach 96.33% on OmniDocBench v1.6 without indiscriminate corpus growth.
What carries the argument
Region-aware data optimization framework that detects weak regions from the prior model, applies targeted enhancement to data and supervision signals, paired with progressive post-training via curated data selection and reinforcement learning.
If this is right
- Performance gains occur through staged, region-specific optimization instead of uniform data scaling.
- The model maintains strong competitiveness with top-tier vision-language models at a compact 0.9B scale.
- The same post-training recipe can be reused across the PaddleOCR-VL series for further upgrades.
- Error concentration analysis becomes a practical step before any new training round.
Where Pith is reading between the lines
- The method could lower the data volume needed for future upgrades in similar compact document models.
- Region-level error diagnosis may transfer to other vision-language tasks where performance plateaus after initial training.
- If the weak-region pattern repeats across model families, targeted refinement might become a standard efficiency lever before scaling data or parameters.
Load-bearing premise
Remaining errors after PaddleOCR-VL-1.5 concentrate in identifiable under-optimized regions whose targeted fixes will raise scores without degrading other areas or introducing new biases.
What would settle it
Run the region-aware optimization and progressive post-training on PaddleOCR-VL-1.5 and measure whether the OmniDocBench v1.6 score fails to rise above the 1.5 baseline or drops in previously strong regions.
read the original abstract
We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PaddleOCR-VL-1.6 as an upgrade to the 0.9B PaddleOCR-VL-1.5 model for document parsing. It proposes a region-aware data optimization framework that identifies under-optimized regions (where behavior is unstable, data coverage sparse, or supervision unreliable) from the prior model, applies targeted enhancement, and improves supervision reliability. This is combined with a progressive post-training recipe using curated data selection and reinforcement learning. The central claim is that these steps yield a new state-of-the-art score of 96.33% on OmniDocBench v1.6 while remaining competitive with top-tier VLMs and supplying a practical recipe for the PaddleOCR-VL series.
Significance. If the experimental support and controls were provided, the work would offer a concrete, efficiency-oriented alternative to indiscriminate data scaling for compact VLMs in document parsing. The emphasis on identifying and refining weak regions plus staged RL post-training could be reusable for other vision-language tasks where error concentration is observable.
major comments (3)
- [Abstract] Abstract: the SOTA claim of 96.33% on OmniDocBench v1.6 is stated without any experimental details, baselines, ablation studies, error analysis, or quantitative definition of 'under-optimized regions,' so the data-to-claim link cannot be evaluated.
- [Abstract] Abstract: the method relies on internal data curation whose independence from the OmniDocBench v1.6 evaluation set is not shown; without this or cross-benchmark results, observed gains could arise from selection effects rather than the proposed framework.
- [Abstract] Abstract: no ablation isolates the contribution of region-aware targeted selection plus progressive RL from random data addition or standard post-training, leaving the central assumption (that remaining errors concentrate in identifiable regions whose targeted fix produces reliable gains without side effects) untested.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. The full manuscript contains the requested experimental details, baselines, ablations, and clarifications in Sections 3 and 4, but we agree the abstract should be expanded for better self-containment. We will revise the abstract accordingly while preserving its conciseness. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the SOTA claim of 96.33% on OmniDocBench v1.6 is stated without any experimental details, baselines, ablation studies, error analysis, or quantitative definition of 'under-optimized regions,' so the data-to-claim link cannot be evaluated.
Authors: The abstract prioritizes brevity, but the full paper provides these elements: baselines and SOTA comparisons in Table 1 and Section 4.1, ablations in Table 3 and Section 4.3, error analysis in Section 4.4, and a quantitative definition of under-optimized regions (instability, sparsity, supervision unreliability) in Section 3.1 with metrics. We will revise the abstract to include a brief clause summarizing key supporting results and the region definition. revision: yes
-
Referee: [Abstract] Abstract: the method relies on internal data curation whose independence from the OmniDocBench v1.6 evaluation set is not shown; without this or cross-benchmark results, observed gains could arise from selection effects rather than the proposed framework.
Authors: Section 3.2 and Appendix A explicitly state that curation uses training splits with no overlap to OmniDocBench v1.6 (verified via deduplication and separate validation). Cross-benchmark results on additional document parsing sets are in Section 4.2. We will add an explicit sentence to the abstract confirming evaluation-set independence to eliminate any ambiguity. revision: yes
-
Referee: [Abstract] Abstract: no ablation isolates the contribution of region-aware targeted selection plus progressive RL from random data addition or standard post-training, leaving the central assumption (that remaining errors concentrate in identifiable regions whose targeted fix produces reliable gains without side effects) untested.
Authors: Table 3 and Section 4.3 present ablations that isolate region-aware selection and progressive RL against random data addition and standard post-training baselines, demonstrating targeted gains without negative side effects on other regions. These directly test the central assumption. We will revise the abstract to reference these ablation outcomes in a single clause. revision: yes
Circularity Check
No significant circularity detected.
full rationale
The provided abstract and description contain no equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The central claim is an empirical SOTA score on OmniDocBench v1.6 obtained via described region refinement and post-training; this does not reduce by construction to the inputs. The paper is self-contained against the stated external benchmark with no exhibited self-definitional or renaming steps. This is the expected honest outcome for an empirical methods paper without the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OmniDocBench v1.6 is a representative and unbiased measure of document parsing capability
Reference graph
Works this paper leans on
-
[1]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URLhttps://arxiv.org/abs/2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026
Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm, 2026. URL https://arxiv.org/ abs/2506.05218
-
[3]
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Dolphin: Document image parsing via heterogeneous anchor prompting, 2025
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Document image parsing via heterogeneous anchor prompting, 2025. URL https://arxiv.org/abs/25 05.14059
2025
-
[5]
Points-reader: Distillation-free adaptation of vision-language models for document conversion, 2025
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, and Jie Zhou. Points-reader: Distillation-free adaptation of vision-language models for document conversion, 2025. URL https://arxiv.org/ abs/2509.01215
-
[6]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025. URL https://arxiv.org/abs/2510.14528
-
[7]
Deepseek-ocr: Contexts optical compression,
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression,
-
[8]
URLhttps://arxiv.org/abs/2510.18234. 21
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Hunyuanocr technical report, 2025
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Xinsong Zhang, Jinnian Zhang, Houwen Peng, Hongming Yang, Senhao Xie, Longsha Zhou, Ge Pei, Binghong Wu, Rui Yan, Kan Wu, Jieneng Yang, Bochao Wang, Kai Liu, Jianchen Zhu, Jie Jiang...
-
[11]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
2025
-
[12]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023
2023
-
[14]
Ernie 4.5 technical report, 2025
Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025
2025
-
[15]
Qianfan-ocr: A unified end-to-end model for document intelligence, 2026
Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, and Dou Shen. Qianfan-ocr: A unified end-to-end model for document intelligence, 2026. URL https://arxiv.org/abs/2603.13398
-
[16]
Glm-ocr technical report, 2026
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. Glm-ocr technical report, 2026. URL https://arxiv.org/abs/2603.10910
-
[17]
Mineru2.5-pro: Pushing the limits of data-centric document parsing at scale, 2026
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhen- jiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, et al. Mineru2.5-pro: Pushing the limits of data-centric document parsing at scale, 2026. URL https://arxiv.org/abs/2604.0 4771
2026
- [19]
-
[21]
URLhttps://arxiv.org/abs/2601.21957. 22
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. Advances in Neural Information Processing Systems, 38:113222– 113244, 2026. URLhttps://arxiv.org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, and Yi Liu. Real5-omnidocbench: A full-scale physical reconstruction benchmark for robust document parsing in the wild, 2026. URLhttps://arxiv.org/abs/2603.04205
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Image over text: Transforming formula recognition evaluation with character detection matching
Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, June 2025
2025
-
[26]
URLhttps://arxiv.org/abs/2508.18265
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, et al. Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602 .02276
2026
-
[28]
Gpt-5.2 system card, 2025
OpenAI. Gpt-5.2 system card, 2025. URL https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf
2025
-
[29]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Gemini 3.0
Google DeepMind. Gemini 3.0. https://blog.google/products-and-platforms/p roducts/gemini/gemini-3-collection/, 2025
2025
-
[31]
Ovis: Structural embedding alignment for multimodal large language model, 2024
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405.20797
-
[32]
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025
Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging, 2025
2025
-
[34]
Mistral-ocr
Mistral AI Team. Mistral-ocr. https://mistral.ai/news/mistral-ocr?utm_sourc e=ai-bot.cn, 2025. 23
2025
-
[35]
olmocr: Unlocking trillions of tokens in pdfs with vision language models
Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Ran- gapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025. URL https://arxiv.org/abs/2502.18443
-
[36]
Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026
Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URLhttps://arxiv.org/abs/2601.21639
-
[37]
Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025
Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters, 2025. URLhttps://arxiv.org/abs/2512.21095
-
[38]
dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025
Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025. URLhttps://arxiv. org/abs/2512.02498
-
[39]
Xiaohongshu Inc. Super Intelligence Team. Firered-ocr technical report. 2026. URL https://arxiv.org/abs/2603.01840
-
[40]
Logics-Parsing-Omni Technical Report
Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Yan Gao, et al. Logics-parsing-omni technical report. arXiv preprint arXiv:2603.09677, 2026. URLhttps://arxiv.org/abs/2603.09677
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding, 2026
Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding, 2026. URL https://arxiv. org/abs/2601.20430
-
[42]
Omnidocbench 1.6.https://opendatalab.com/omnidocbench, 2026
OpenDataLab. Omnidocbench 1.6.https://opendatalab.com/omnidocbench, 2026
2026
-
[43]
Vik Paruchuri. Marker. https://github.com/datalab-to/marker , 2025. Accessed: 2025-09-25
2025
-
[44]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. URLhttps://arxiv.org/abs/2507.05595
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Gemini 2.5
Google DeepMind. Gemini 2.5. https://blog.google/technology/google-deepm ind/gemini-model-thinking-updates-march-2025/, 2025
2025
-
[47]
Mineru2.0-2505-0.9b
OpenDataLab. Mineru2.0-2505-0.9b. https://huggingface.co/opendatalab/Miner U2.0-2505-0.9B, 2025
2025
-
[48]
chatdoc-com. Ocrflux. https://github.com/chatdoc-com/OCRFlux, 2024. Accessed: 2025-05-28
2024
-
[49]
Trivia: Self-supervised fine-tuning of vision-language models for table recognition, 2026
Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, and Conghui He. Trivia: Self-supervised fine-tuning of vision-language models for table recognition, 2026. URLhttps://arxiv.org/abs/2512.01248. 24
-
[50]
Deplot: One- shot visual language reasoning by plot-to-table translation, 2023
Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One- shot visual language reasoning by plot-to-table translation, 2023. URL https://arxiv. org/abs/2212.10505
-
[51]
Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning. arXiv preprint arXiv:2404.16635, 2024. URL https://arxiv.org/ abs/2404.16635
-
[52]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General ocr theory: Towards ocr-2.0 via a unified end-to-end model, 2024. URL https://arxiv.or g/abs/2409.01704
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Onechart: Purify the chart structural extraction via one auxiliary token
Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Onechart: Purify the chart structural extraction via one auxiliary token. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 147–155, 2024
2024
-
[54]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Detect anything via next point prediction
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. arXiv preprint arXiv:2510.12798, 2025. URLhttps://arxiv.org/abs/2510.12798. 25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.