Unlimited OCR Works

Chaorun Liu; Guibin Wang; Hao Zou; Huanhuan Liu; Jingjing Wu; Jinyue Chen; Lei Jia; Mingxin Huang; Qunyi Xie; Shaohua Wang

arxiv: 2606.23050 · v1 · pith:65UZ4NNGnew · submitted 2026-06-22 · 💻 cs.CV · cs.CL

Unlimited OCR Works

Youyang Yin , Huanhuan Liu , YY , Qunyi Xie , Chaorun Liu , Shiqi Yang , Shaohua Wang , Zhanlong Liu

show 9 more authors

Hao Zou Jinyue Chen Shu Wei Jingjing Wu Mingxin Huang Zhen Wu Guibin Wang Tengyu Du Lei Jia

This is my paper

Pith reviewed 2026-06-26 09:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords OCRattention mechanismKV cachelong contextsequence modelingdecoder design

0 comments

The pith

Replacing attention with R-SWA keeps the KV cache constant so OCR can process dozens of pages in one pass under a 32K limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unlimited OCR to handle long documents without the memory and speed penalties that come from growing KV caches in standard LLM-based OCR decoders. It starts from DeepSeek OCR and swaps every attention layer for Reference Sliding Window Attention, which the authors designed to hold KV cache size fixed no matter how long the output becomes. This combination of the encoder's high compression and the new constant-cache decoder lets the model transcribe many pages at once while staying inside normal context windows. The same attention change is presented as a general tool for any parsing task that needs to copy or transcribe long sequences.

Core claim

By replacing all attention layers in the decoder with Reference Sliding Window Attention, Unlimited OCR maintains a constant KV cache size throughout decoding, allowing transcription of dozens of pages in a single forward pass under a 32K maximum length.

What carries the argument

Reference Sliding Window Attention (R-SWA), an attention mechanism that reduces computation costs while enforcing constant KV cache size for the entire decoding process.

If this is right

OCR models can now handle multi-page documents without splitting or repeated passes.
The constant cache removes the progressive slowdown that normally appears as output length grows.
R-SWA can be swapped into other sequence-to-sequence tasks that require long output sequences.
The design emulates human working memory by avoiding ever-growing state during copying tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tasks like automatic speech recognition or machine translation could adopt the same attention change to handle long inputs or outputs efficiently.
If R-SWA truly preserves accuracy, it offers a drop-in replacement for standard attention in any decoder that copies or transcribes text.
Future work could test whether the constant cache also reduces peak memory enough to run larger models on the same hardware.

Load-bearing premise

That swapping in R-SWA keeps the original OCR accuracy and language-modeling benefits intact even though no accuracy numbers or baseline comparisons are shown.

What would settle it

Run Unlimited OCR and the baseline DeepSeek OCR on the same set of multi-page documents and measure character error rate; if error rate rises sharply with R-SWA, the claim that accuracy is preserved fails.

read the original abstract

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Unlimited OCR claims constant KV cache via R-SWA but supplies zero evidence that OCR accuracy is preserved.

read the letter

The key thing here is that Unlimited OCR replaces the decoder attention with R-SWA to hold the KV cache at constant size, which in principle lets the model handle long document outputs without the usual memory and speed penalties. The paper also releases the code and weights.

What stands out as new is the R-SWA mechanism, described as a general-purpose attention for parsing tasks like OCR, ASR, and translation. It builds directly on the DeepSeek OCR baseline by swapping in this attention variant across all layers.

The work does a decent job framing the practical problem: LLM decoders accumulate KV cache on long sequences, unlike human reading. Combining the encoder's compression with constant cache is a reasonable engineering goal.

The soft spots are substantial though. There are no experimental results whatsoever—no error rates, no ablations, no comparisons showing that R-SWA preserves accuracy relative to the original attention. The claim that it maintains the language prior benefits rests entirely on assertion. The stress-test concern about unevaluated accuracy preservation is accurate based on what's provided.

This is aimed at practitioners in document AI who need to transcribe multi-page inputs efficiently. Someone looking for a method with demonstrated performance gains or rigorous validation will not find it here.

I would not recommend sending this to peer review in its current state. It reads more like an early technical note than a complete paper.

Referee Report

2 major / 0 minor

Summary. The paper proposes Unlimited OCR, an extension of DeepSeek OCR that replaces all decoder attention layers with Reference Sliding Window Attention (R-SWA). This is claimed to enforce a constant KV cache size while preserving the LLM decoder's language prior, enabling transcription of dozens of pages in one forward pass under a 32K context limit. R-SWA is further positioned as a general-purpose mechanism applicable beyond OCR to tasks such as ASR and translation.

Significance. If the accuracy-preservation claim holds with supporting measurements, the work would address a practical bottleneck in long-context LLM-based OCR and parsing models by decoupling memory usage from output length. The public release of code and weights would further strengthen its potential impact as a reusable attention variant.

major comments (2)

[Abstract] Abstract: The central claim that R-SWA 'maintains' the language prior benefits and overall OCR accuracy of the DeepSeek OCR baseline is asserted without any CER/WER numbers, ablation tables, or direct comparisons, leaving the accuracy-preservation step as an unevaluated assumption rather than a demonstrated result.
[Abstract] Abstract: No quantitative results, ablation studies, or error analysis are supplied to support the performance claims (dozens of pages in a single 32K pass) or the generality claim for ASR/translation, which are load-bearing for the paper's contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the need for empirical support. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that R-SWA 'maintains' the language prior benefits and overall OCR accuracy of the DeepSeek OCR baseline is asserted without any CER/WER numbers, ablation tables, or direct comparisons, leaving the accuracy-preservation step as an unevaluated assumption rather than a demonstrated result.

Authors: We agree the comment is correct: the abstract asserts preservation of the language prior without supporting measurements or comparisons. The technical report centers on the R-SWA design for constant KV cache while retaining the original decoder structure, but this does not constitute a demonstration. We will revise the abstract to describe R-SWA as intended to preserve the prior rather than claiming it maintains accuracy. revision: yes
Referee: [Abstract] Abstract: No quantitative results, ablation studies, or error analysis are supplied to support the performance claims (dozens of pages in a single 32K pass) or the generality claim for ASR/translation, which are load-bearing for the paper's contribution.

Authors: We agree the comment is correct: the abstract states the ability to transcribe dozens of pages and positions R-SWA as general-purpose without quantitative results, ablations, or error analysis. The constant-KV-cache property is a direct consequence of the sliding-window design, but the specific performance numbers and cross-task applicability are not demonstrated. We will revise the abstract and add a limitations paragraph to qualify these statements as design implications rather than evaluated outcomes. revision: yes

standing simulated objections not resolved

Supplying CER/WER numbers, ablation tables, error analysis, or results on ASR/translation tasks, as the current manuscript is a method-focused technical report without experimental evaluations.

Circularity Check

0 steps flagged

No derivation chain present; architectural proposal only

full rationale

The manuscript describes an engineering modification: replace decoder attention layers in an external baseline (DeepSeek OCR) with a new mechanism called R-SWA to enforce constant KV cache size. No equations, no fitted parameters, no derived predictions, and no self-citations appear in the provided text. The central claim is a design assertion rather than a mathematical reduction, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities can be identified with any specificity. The R-SWA mechanism is described at a high level without implementation details or assumptions listed.

pith-pipeline@v0.9.1-grok · 5818 in / 1046 out tokens · 32416 ms · 2026-06-26T09:01:27.459363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 15 linked inside Pith

[1]

URLhttps://huggingface.co/nanonets/Nanonets-OCR-s

Nanonets-ocr-s, 2025. URLhttps://huggingface.co/nanonets/Nanonets-OCR-s

2025
[2]

URLhttps://github.com/DocTron-hub/OCRVerse

Ocrverse, 2025. URLhttps://github.com/DocTron-hub/OCRVerse

2025
[3]

URLhttps://github.com/chatdoc-com/OCRFlux

Ocrflux, 2025. URLhttps://github.com/chatdoc-com/OCRFlux

2025
[4]

G. AI. Gemini 2.5-pro, 2025. URLhttps://gemini.google.com/

2025
[5]

URLhttps://github.com/alibaba/Logics-Parsing

alibaba, 2026. URLhttps://github.com/alibaba/Logics-Parsing

2026
[6]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023
[8]

URLhttps://arxiv.org/abs/2511.21631

Pith/arXiv arXiv
[9]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[10]

Blecher, G

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

Pith/arXiv arXiv 2023
[11]

C. Cui, T. Sun, S. Liang, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

arXiv 2025
[12]

C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025
[13]

D. Dong, M. Zheng, D. Xu, C. Luo, B. Zhuang, Y. Li, R. He, H. Wang, W. Zhang, W. Wang, et al. Qianfan-ocr: A unified end-to-end model for document intelligence. arXiv preprint arXiv:2603.13398, 2026

arXiv 2026
[14]

H. Feng, S. Wei, X. Fei, W. Shi, Y. Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, et al. Dol- phin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

arXiv 2025
[15]

Huang, C

A. Huang, C. Yao, C. Han, F. Wan, H. Guo, H. Lv, H. Zhou, J. Wang, J. Zhou, J. Sun, et al. Step3-vl-10b technical report. arXiv preprint arXiv:2601.09668, 2026

arXiv 2026
[16]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 12

Pith/arXiv arXiv 2023
[17]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[18]

Z. Li, Y. Liu, Q. Liu, Z. Ma, Z. Zhang, S. Zhang, Z. Guo, J. Zhang, X. Wang, and X. Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025

arXiv 2025
[19]

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025
[20]

Y. Liu, Z. Zhao, L. Tian, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, November 2025

2025
[21]

Loshchilov and F

I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

Pith/arXiv arXiv 2016
[22]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

2019
[23]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

2023
[24]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025
[25]

Poznanski, A

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, C. Wilhelm, K. Lo, and L. Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

arXiv 2025
[26]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[27]

dots.ocr, 2025

Rednote. dots.ocr, 2025. URLhttps://github.com/rednote-hilab/dots.ocr

2025
[28]

Shoeybi, M

M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909
[29]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[30]

H. V . Team, P . Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025

arXiv 2025
[31]

B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024. 13

Pith/arXiv arXiv 2024
[32]

W. Wang, Z. Gao, L. Gu, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[33]

H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024

2024
[34]

H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024

Pith/arXiv arXiv 2024
[35]

H. Wei, Y. Sun, and Y. Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

Pith/arXiv arXiv 2025
[36]

H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow. arXiv preprint arXiv:2601.20552, 2026

arXiv 2026
[37]

H. Wu, H. Lou, X. Li, Z. Zhong, Z. Sun, P . Chen, X. Zhou, K. Zuo, Y. Chen, X. Tang, et al. Firered-ocr technical report. arXiv preprint arXiv:2603.01840, 2026

arXiv 2026
[38]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14

Pith/arXiv arXiv 2025

[1] [1]

URLhttps://huggingface.co/nanonets/Nanonets-OCR-s

Nanonets-ocr-s, 2025. URLhttps://huggingface.co/nanonets/Nanonets-OCR-s

2025

[2] [2]

URLhttps://github.com/DocTron-hub/OCRVerse

Ocrverse, 2025. URLhttps://github.com/DocTron-hub/OCRVerse

2025

[3] [3]

URLhttps://github.com/chatdoc-com/OCRFlux

Ocrflux, 2025. URLhttps://github.com/chatdoc-com/OCRFlux

2025

[4] [4]

G. AI. Gemini 2.5-pro, 2025. URLhttps://gemini.google.com/

2025

[5] [5]

URLhttps://github.com/alibaba/Logics-Parsing

alibaba, 2026. URLhttps://github.com/alibaba/Logics-Parsing

2026

[6] [6]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023

[7] [8]

URLhttps://arxiv.org/abs/2511.21631

Pith/arXiv arXiv

[8] [9]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P . Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[9] [10]

Blecher, G

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

Pith/arXiv arXiv 2023

[10] [11]

C. Cui, T. Sun, S. Liang, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

arXiv 2025

[11] [12]

C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025

[12] [13]

D. Dong, M. Zheng, D. Xu, C. Luo, B. Zhuang, Y. Li, R. He, H. Wang, W. Zhang, W. Wang, et al. Qianfan-ocr: A unified end-to-end model for document intelligence. arXiv preprint arXiv:2603.13398, 2026

arXiv 2026

[13] [14]

H. Feng, S. Wei, X. Fei, W. Shi, Y. Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, et al. Dol- phin: Document image parsing via heterogeneous anchor prompting. arXiv preprint arXiv:2505.14059, 2025

arXiv 2025

[14] [15]

Huang, C

A. Huang, C. Yao, C. Han, F. Wan, H. Guo, H. Lv, H. Zhou, J. Wang, J. Zhou, J. Sun, et al. Step3-vl-10b technical report. arXiv preprint arXiv:2601.09668, 2026

arXiv 2026

[15] [16]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 12

Pith/arXiv arXiv 2023

[16] [17]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[17] [18]

Z. Li, Y. Liu, Q. Liu, Z. Ma, Z. Zhang, S. Zhang, Z. Guo, J. Zhang, X. Wang, and X. Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025

arXiv 2025

[18] [19]

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025

[19] [20]

Y. Liu, Z. Zhao, L. Tian, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, November 2025

2025

[20] [21]

Loshchilov and F

I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

Pith/arXiv arXiv 2016

[21] [22]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

2019

[22] [23]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

2023

[23] [24]

Ouyang, Y

L. Ouyang, Y. Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025

[24] [25]

Poznanski, A

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, C. Wilhelm, K. Lo, and L. Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

arXiv 2025

[25] [26]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

2021

[26] [27]

dots.ocr, 2025

Rednote. dots.ocr, 2025. URLhttps://github.com/rednote-hilab/dots.ocr

2025

[27] [28]

Shoeybi, M

M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909

[28] [29]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[29] [30]

H. V . Team, P . Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, et al. Hunyuanocr technical report. arXiv preprint arXiv:2511.19575, 2025

arXiv 2025

[30] [31]

B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, et al. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839, 2024. 13

Pith/arXiv arXiv 2024

[31] [32]

W. Wang, Z. Gao, L. Gu, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[32] [33]

H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024

2024

[33] [34]

H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y. Xu, Z. Ge, L. Zhao, J. Sun, Y. Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024

Pith/arXiv arXiv 2024

[34] [35]

H. Wei, Y. Sun, and Y. Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

Pith/arXiv arXiv 2025

[35] [36]

H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow. arXiv preprint arXiv:2601.20552, 2026

arXiv 2026

[36] [37]

H. Wu, H. Lou, X. Li, Z. Zhong, Z. Sun, P . Chen, X. Zhou, K. Zuo, Y. Chen, X. Tang, et al. Firered-ocr technical report. arXiv preprint arXiv:2603.01840, 2026

arXiv 2026

[37] [38]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14

Pith/arXiv arXiv 2025