arxiv: 2409.01704 · v1 · pith:VPF5YSOVnew · submitted 2024-09-03 · 💻 cs.CV

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei , Chenglong Liu , Jinyue Chen , Jia Wang , Lingyu Kong , Yanming Xu , Zheng Ge , Liang Zhao

show 4 more authors

Jianjian Sun Yuang Peng Chunrui Han Xiangyu Zhang

This is my paper

Pith reviewed 2026-05-17 20:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords general OCRunified end-to-end modelOCR-2.0prompt-based outputdocument recognitionmulti-task visionscene text understandingformula and table parsing

0 comments

The pith

A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional OCR systems fall short as demand grows for processing diverse man-made visual content. The authors redefine plain texts, math and molecular formulas, tables, charts, sheet music, and geometric shapes collectively as characters under a General OCR Theory. This leads to the GOT model, a 580-million-parameter end-to-end system with a high-compression encoder and long-context decoder. The model accepts scene or document images in slices or full pages and produces plain or formatted outputs such as markdown, tikz, or smiles based on simple prompts while supporting coordinate- or color-guided interactive recognition. If the approach holds, it could consolidate many separate OCR tools into one versatile system for broader practical use.

Core claim

The General OCR Theory treats all artificial optical signals as characters and introduces the GOT model as a unified end-to-end solution for OCR-2.0. With 580M parameters, the model combines a high-compression encoder and long-contexts decoder to process both scene-style and document-style images in slice or whole-page formats. It generates plain or structured results through prompt control and enables interactive features such as region-level recognition guided by coordinates or colors, while incorporating dynamic resolution and multi-page handling.

What carries the argument

The GOT model with its high-compression encoder and long-contexts decoder that uses input prompts to select output formats and supports interactive region guidance.

Load-bearing premise

That one 580M-parameter end-to-end model using prompt-based formatting can sustain high accuracy across all character types and input styles without interference or the need for separate task-specific parts.

What would settle it

A controlled test showing that accuracy on mathematical formulas falls when the same model is also trained on tables and charts, or that the model requires separate fine-tuning for different character types to reach competitive performance.

read the original abstract

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GOT tries to unify OCR across text, formulas, tables and more in one 580M model via prompts, but the shared decoder's ability to avoid negative transfer across tasks is the part that still needs real proof.

read the letter

The paper's main move is to treat everything from scene text to molecular formulas, tables, charts, sheet music, and geometric shapes as 'characters' under a single General OCR Theory, then deliver a 580M end-to-end model called GOT that uses a high-compression encoder plus long-context decoder and prompt conditioning to produce plain text, markdown, tikz, or smiles outputs. It also adds interactive region selection by coordinates or color and handles dynamic resolution plus multi-page inputs. That framing and the concrete architecture choices are what the authors put forward as the step toward OCR-2.0.

Referee Report

2 major / 2 minor

Summary. The paper proposes the General OCR Theory, redefining all artificial optical signals (plain texts, math/molecular formulas, tables, charts, sheet music, geometric shapes) as 'characters'. It introduces the GOT model: a 580M-parameter unified end-to-end system with a high-compression encoder and long-context decoder. GOT supports scene/document images (slice or whole-page), outputs plain or formatted results (markdown/tikz/smiles/kern) via prompts, enables interactive region-level recognition by coordinates or colors, and incorporates dynamic resolution and multi-page processing. The authors state that experiments provide sufficient results proving the model's superiority for advancing OCR-2.0.

Significance. If the empirical claims hold, this unified approach could reduce reliance on task-specific OCR pipelines and enable more flexible handling of diverse document and scene content. The prompt-based output formatting and interactive features are practical strengths. However, significance hinges on demonstrating that a single shared decoder maintains high accuracy across conflicting output formats without negative transfer, which is not yet substantiated by the available details.

major comments (2)

[Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.
[Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.

minor comments (2)

[Introduction] The introduction of 'General OCR Theory' as a named contribution would benefit from a concise formal statement or set of principles rather than solely descriptive text.
[Model] Notation for output formats (markdown/tikz/smiles/kern) is introduced via prompt but lacks an example of the exact prompt template used for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our paper proposing the General OCR Theory and the GOT model. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses

Referee: [Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.

Authors: We agree that the current manuscript does not provide a detailed description of mechanisms to mitigate negative transfer. The model relies on prompt engineering to specify the output format, which guides the decoder accordingly. The training data is curated to balance different tasks. To strengthen the paper, we will revise the architecture section to include more details on the training procedure, any regularization techniques employed, and how the shared decoder handles diverse outputs without significant interference. revision: yes
Referee: [Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.

Authors: The experiments in the manuscript do include results across various OCR tasks demonstrating the model's capabilities. However, we acknowledge that a more granular breakdown with per-task metrics, explicit baseline comparisons, and analysis of potential negative transfer would strengthen the claims. In the revised version, we will expand the experiments section to include these details, including any available error bars from repeated evaluations and discussion on output vocabulary interference. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model proposal with no derivation chain or self-referential reductions

full rationale

The paper proposes a redefinition of 'characters' to encompass diverse artificial optical signals and introduces the GOT model as an end-to-end architecture for OCR-2.0 tasks. No equations, first-principles derivations, or load-bearing predictions appear in the provided text. Claims rest on architectural description (high-compression encoder + long-context decoder) and empirical results rather than any step that reduces by construction to fitted inputs or prior self-citations. The central premise is a practical unification via prompting and dynamic resolution, which is independently testable against benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work. This is a standard model paper whose validity hinges on external validation, not internal definitional loops.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; no detailed equations or methods available. The 580M parameter count is stated but not decomposed. The new theory name and assumption of unified prompt-based handling constitute the main additions.

free parameters (1)

Total model parameters
580M parameter count given without breakdown into specific fitted values or hyperparameters.

axioms (1)

domain assumption A single unified end-to-end architecture with prompt control can effectively process all listed artificial optical signal types at high accuracy.
Central premise invoked when positioning GOT as OCR-2.0 without task-specific modules.

invented entities (1)

General OCR Theory no independent evidence
purpose: To provide a unifying conceptual framework that treats diverse man-made optical signals as equivalent 'characters'.
Introduced in the paper to motivate the model; no prior literature reference visible in abstract.

pith-pipeline@v0.9.0 · 5569 in / 1507 out tokens · 45207 ms · 2026-05-17T20:46:27.603037+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
cs.CV 2026-03 conditional novelty 7.0

OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
cs.CL 2026-02 unverdicted novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
cs.CV 2025-11 unverdicted novelty 7.0

FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
TableSeq: Unified Generation of Structure, Content, and Layout
cs.CV 2026-04 unverdicted novelty 6.0

TableSeq unifies table structure recognition, content extraction, and cell localization by generating an interleaved autoregressive sequence of HTML tags, cell text, and discretized coordinate tokens from an input image.
InstructTable: Improving Table Structure Recognition Through Instructions
cs.CV 2026-04 unverdicted novelty 6.0

InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
cs.CV 2026-03 unverdicted novelty 6.0

A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
cs.CV 2025-09 unverdicted novelty 6.0

MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making
cs.AI 2026-05 unverdicted novelty 5.0

SKG-VLA models each complaint as a structured scene via a Scene Knowledge Graph to improve policy-grounded multimodal reasoning and decision accuracy.
MinerU: An Open-Source Solution for Precise Document Content Extraction
cs.CV 2024-09 conditional novelty 4.0

MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
cs.CV 2026-01 unverdicted novelty 3.0

A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 17 Pith papers · 10 internal anchors

[1]

https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6

Casia-hwdb2-line. https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6

work page 2024
[2]

https://huggingface.co/datasets/Teklia/IAM-line (2024) 6

Iam-line. https://huggingface.co/datasets/Teklia/IAM-line (2024) 6

work page 2024
[3]

https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6

Norhand-v3-line. https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6

work page 2024
[4]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 1, 4, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Nougat: Neural Optical Understanding for Academic Documents

Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023) 4, 6, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7

Calvo-Zaragoza, J., Jr, J.H., Pacha, A.: Understanding optical music recognition. ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7

work page 2020
[8]

arXiv preprint arXiv:2404.09987 (2024) 7, 10

Chen, J., Kong, L., Wei, H., Liu, C., Ge, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Onechart: Purify the chart structural extraction via one auxiliary token. arXiv preprint arXiv:2404.09987 (2024) 7, 10

work page arXiv 2024
[9]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024) 1, 3, 4, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5

Du, Y ., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., Lu, B., Yang, Y ., Liu, Q., Hu, X., et al.: Pp-ocrv2: Bag of tricks for ultra lightweight ocr system. arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5

work page arXiv 2021
[11]

In: International Conference on Machine Learning (ICML) (2006) 4

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning (ICML) (2006) 4

work page 2006
[12]

Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5

Gu, J., Meng, X., Lu, G., Hou, L., Minzhe, N., Liang, X., Yao, L., Huang, R., Zhang, W., Jiang, X., et al.: Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5

work page 2022
[13]

arXiv preprint arXiv:2403.12895 (2024) 9, 10

Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., Zhang, B., Li, C., Zhang, J., Jin, Q., Huang, F., et al.: mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895 (2024) 9, 10

work page arXiv 2024
[14]

Proceedings of the IEEE 86(11), 2278–2324 (1998) 4

LeCun, Y ., Bottou, L., Bengio, Y ., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 4

work page 1998
[15]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., Wei, F.: Trocr: Transformer- based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 13094–13102 (2023) 4

work page 2023
[17]

In: European conference on computer vision

Li, Y ., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European conference on computer vision. pp. 280–296. Springer (2022) 5

work page 2022
[18]

In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4

Liao, M., Shi, B., Bai, X., Wang, C., Lu, T., Mei, T.: Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4

work page 2017
[19]

IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4

Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4

work page 2022
[20]

arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

Liu, C., Wei, H., Chen, J., Kong, L., Ge, Z., Zhu, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

work page arXiv 2024
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Liu, C., Wei, H., Yang, J., Liu, J., Li, W., Guo, Y ., Fang, L.: Gigahumandet: Exploring full-body detection on gigapixel-level images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 10092–10100 (2024) 14

work page 2024
[22]

In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

Liu, F., Eisenschlos, J.M., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y .: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

work page arXiv 2023
[23]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/2024-01-30-llava-next/ 3, 9

work page 2024
[24]

Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning (2023) 1, 3, 4

work page 2023
[25]

arXiv preprint arXiv:1912.09641 (2019) 8

Liu, X., Zhang, R., Zhou, Y ., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., et al.: Icdar 2019 robust reading challenge on reading chinese text on signboard. arXiv preprint arXiv:1912.09641 (2019) 8

work page arXiv 2019
[26]

Pattern Recognition 90, 337–345 (2019) 4

Liu, Y ., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, 337–345 (2019) 4

work page 2019
[27]

arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

Liu, Y ., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

work page arXiv 2024
[28]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016) 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

In: ICLR (2019) 8

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 8

work page 2019
[30]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7553–7563 (2018) 4

work page 2018
[31]

arXiv preprint arXiv:2305.14761 (2023) 10

Masry, A., Kavehzadeh, P., Do, X.L., Hoque, E., Joty, S.: Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 (2023) 10

work page arXiv 2023
[32]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022) 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 3

work page 2021
[34]

The PracTEX Journal 1, 1–22 (2007) 7

Mertz, A., Slough, W.: Graphics with tikz. The PracTEX Journal 1, 1–22 (2007) 7

work page 2007
[35]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: Plotqa: Reasoning over scientific plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1527–1536 (2020) 10

work page 2020
[36]

OpenAI: Gpt-4 technical report (2023) 1, 10

work page 2023
[37]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 1, 4

work page 2021
[38]

arXiv preprint arXiv:2402.07596 (2024) 7

Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet music transformer: End-to-end optical music recogni- tion beyond monophonic transcription. arXiv preprint arXiv:2402.07596 (2024) 7

work page arXiv 2024
[39]

International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7

Ríos-Vila, A., Rizo, D., Iñesta, J.M., Calvo-Zaragoza, J.: End-to-end optical music recognition for pianoform sheet music. International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7

work page 2023
[40]

Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5

work page 2022
[41]

In: 2017 14th iapr international conference on document analysis and recognition (ICDAR)

Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: Icdar2017 competition on reading chinese text in the wild (rctw-17). In: 2017 14th iapr international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1429–1434. IEEE (2017) 8 18

work page 2017
[42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019) 3

work page 2019
[43]

In: European conference on computer vision

Tian, Z., Huang, W., He, T., He, P., Qiao, Y .: Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision. pp. 56–72. Springer (2016) 4

work page 2016
[44]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016) 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y ., Xie, H., Zha, Z.J., Xing, M., Fu, Z., Zhang, Y .: Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11753–11762 (2020) 4

work page 2020
[46]

Vary: Scaling up the vision vocabulary for large vision-language models

Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yang, J., Sun, J., Han, C., Zhang, X.: Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023) 1, 4, 5, 6, 9

work page arXiv 2023
[47]

arXiv preprint arXiv:2401.12503 (2024) 6, 9

Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yu, E., Sun, J., Han, C., Zhang, X.: Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503 (2024) 6, 9

work page arXiv 2024
[48]

Xia, R., Zhang, B., Ye, H., Yan, X., Liu, Q., Zhou, H., Chen, Z., Dou, M., Shi, B., Yan, J., Qiao, Y .: Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning (2024) 10

work page 2024
[49]

InEMNLP (Find- ings)

Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y ., Zhao, C., Xu, G., Li, C., Tian, J., et al.: mplug- docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023) 1, 3, 4

work page arXiv 2023
[50]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Xu, G., Li, C., Tian, J., Qian, Q., Zhang, J., et al.: Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023) 3, 4, 9

work page arXiv 2023
[51]

ShopSign: a Diverse Scene Text Dataset of Chinese Shop Signs in Street Views

Zhang, C., Peng, G., Tao, Y ., Fu, F., Jiang, W., Almpanidis, G., Chen, K.: Shopsign: A diverse scene text dataset of chinese shop signs in street views. arXiv preprint arXiv:1903.10412 (2019) 8

work page internal anchor Pith review Pith/arXiv arXiv 1903
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1305–1314 (2021) 4

work page 2021
[53]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V ., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022) 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

In: 2019 International conference on document analysis and recognition (ICDAR)

Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR). pp. 1015–1022. IEEE (2019) 4

work page 2019
[55]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19

Zhou, X., Yao, C., Wen, H., Wang, Y ., Zhou, S., He, W., Liang, J.: East: An efficient and accurate scene text detector. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19

work page 2017