pith. machine review for the scientific record. sign in

arxiv: 2409.01704 · v1 · pith:VPF5YSOVnew · submitted 2024-09-03 · 💻 cs.CV

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Pith reviewed 2026-05-17 20:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords general OCRunified end-to-end modelOCR-2.0prompt-based outputdocument recognitionmulti-task visionscene text understandingformula and table parsing
0
0 comments X

The pith

A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional OCR systems fall short as demand grows for processing diverse man-made visual content. The authors redefine plain texts, math and molecular formulas, tables, charts, sheet music, and geometric shapes collectively as characters under a General OCR Theory. This leads to the GOT model, a 580-million-parameter end-to-end system with a high-compression encoder and long-context decoder. The model accepts scene or document images in slices or full pages and produces plain or formatted outputs such as markdown, tikz, or smiles based on simple prompts while supporting coordinate- or color-guided interactive recognition. If the approach holds, it could consolidate many separate OCR tools into one versatile system for broader practical use.

Core claim

The General OCR Theory treats all artificial optical signals as characters and introduces the GOT model as a unified end-to-end solution for OCR-2.0. With 580M parameters, the model combines a high-compression encoder and long-contexts decoder to process both scene-style and document-style images in slice or whole-page formats. It generates plain or structured results through prompt control and enables interactive features such as region-level recognition guided by coordinates or colors, while incorporating dynamic resolution and multi-page handling.

What carries the argument

The GOT model with its high-compression encoder and long-contexts decoder that uses input prompts to select output formats and supports interactive region guidance.

Load-bearing premise

That one 580M-parameter end-to-end model using prompt-based formatting can sustain high accuracy across all character types and input styles without interference or the need for separate task-specific parts.

What would settle it

A controlled test showing that accuracy on mathematical formulas falls when the same model is also trained on tables and charts, or that the model requires separate fine-tuning for different character types to reach competitive performance.

read the original abstract

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the General OCR Theory, redefining all artificial optical signals (plain texts, math/molecular formulas, tables, charts, sheet music, geometric shapes) as 'characters'. It introduces the GOT model: a 580M-parameter unified end-to-end system with a high-compression encoder and long-context decoder. GOT supports scene/document images (slice or whole-page), outputs plain or formatted results (markdown/tikz/smiles/kern) via prompts, enables interactive region-level recognition by coordinates or colors, and incorporates dynamic resolution and multi-page processing. The authors state that experiments provide sufficient results proving the model's superiority for advancing OCR-2.0.

Significance. If the empirical claims hold, this unified approach could reduce reliance on task-specific OCR pipelines and enable more flexible handling of diverse document and scene content. The prompt-based output formatting and interactive features are practical strengths. However, significance hinges on demonstrating that a single shared decoder maintains high accuracy across conflicting output formats without negative transfer, which is not yet substantiated by the available details.

major comments (2)
  1. [Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.
  2. [Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.
minor comments (2)
  1. [Introduction] The introduction of 'General OCR Theory' as a named contribution would benefit from a concise formal statement or set of principles rather than solely descriptive text.
  2. [Model] Notation for output formats (markdown/tikz/smiles/kern) is introduced via prompt but lacks an example of the exact prompt template used for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our paper proposing the General OCR Theory and the GOT model. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.

    Authors: We agree that the current manuscript does not provide a detailed description of mechanisms to mitigate negative transfer. The model relies on prompt engineering to specify the output format, which guides the decoder accordingly. The training data is curated to balance different tasks. To strengthen the paper, we will revise the architecture section to include more details on the training procedure, any regularization techniques employed, and how the shared decoder handles diverse outputs without significant interference. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.

    Authors: The experiments in the manuscript do include results across various OCR tasks demonstrating the model's capabilities. However, we acknowledge that a more granular breakdown with per-task metrics, explicit baseline comparisons, and analysis of potential negative transfer would strengthen the claims. In the revised version, we will expand the experiments section to include these details, including any available error bars from repeated evaluations and discussion on output vocabulary interference. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model proposal with no derivation chain or self-referential reductions

full rationale

The paper proposes a redefinition of 'characters' to encompass diverse artificial optical signals and introduces the GOT model as an end-to-end architecture for OCR-2.0 tasks. No equations, first-principles derivations, or load-bearing predictions appear in the provided text. Claims rest on architectural description (high-compression encoder + long-context decoder) and empirical results rather than any step that reduces by construction to fitted inputs or prior self-citations. The central premise is a practical unification via prompting and dynamic resolution, which is independently testable against benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work. This is a standard model paper whose validity hinges on external validation, not internal definitional loops.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; no detailed equations or methods available. The 580M parameter count is stated but not decomposed. The new theory name and assumption of unified prompt-based handling constitute the main additions.

free parameters (1)
  • Total model parameters
    580M parameter count given without breakdown into specific fitted values or hyperparameters.
axioms (1)
  • domain assumption A single unified end-to-end architecture with prompt control can effectively process all listed artificial optical signal types at high accuracy.
    Central premise invoked when positioning GOT as OCR-2.0 without task-specific modules.
invented entities (1)
  • General OCR Theory no independent evidence
    purpose: To provide a unifying conceptual framework that treats diverse man-made optical signals as equivalent 'characters'.
    Introduced in the paper to motivate the model; no prior literature reference visible in abstract.

pith-pipeline@v0.9.0 · 5569 in / 1507 out tokens · 45207 ms · 2026-05-17T20:46:27.603037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  2. OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.

  3. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    cs.CL 2026-02 unverdicted novelty 7.0

    Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

  4. Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

    cs.CV 2026-01 unverdicted novelty 7.0

    LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...

  5. FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

    cs.CV 2025-11 unverdicted novelty 7.0

    FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.

  6. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    cs.CV 2024-12 accept novelty 7.0

    OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

  7. DocAtlas: Multilingual Document Understanding Across 80+ Languages

    cs.CL 2026-05 unverdicted novelty 6.0

    DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.

  8. TableSeq: Unified Generation of Structure, Content, and Layout

    cs.CV 2026-04 unverdicted novelty 6.0

    TableSeq unifies table structure recognition, content extraction, and cell localization by generating an interleaved autoregressive sequence of HTML tags, cell text, and discretized coordinate tokens from an input image.

  9. InstructTable: Improving Table Structure Recognition Through Instructions

    cs.CV 2026-04 unverdicted novelty 6.0

    InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...

  10. Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

    cs.CV 2026-03 conditional novelty 6.0

    PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.

  11. Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

    cs.CV 2026-03 unverdicted novelty 6.0

    A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.

  12. DeepSeek-OCR: Contexts Optical Compression

    cs.CV 2025-10 unverdicted novelty 6.0

    DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

  13. MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    cs.CV 2025-09 unverdicted novelty 6.0

    MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.

  14. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  15. SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making

    cs.AI 2026-05 unverdicted novelty 5.0

    SKG-VLA models each complaint as a structured scene via a Scene Knowledge Graph to improve policy-grounded multimodal reasoning and decision accuracy.

  16. MinerU: An Open-Source Solution for Precise Document Content Extraction

    cs.CV 2024-09 conditional novelty 4.0

    MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.

  17. Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

    cs.CV 2026-01 unverdicted novelty 3.0

    A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 17 Pith papers · 10 internal anchors

  1. [1]

    https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6

    Casia-hwdb2-line. https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6

  2. [2]

    https://huggingface.co/datasets/Teklia/IAM-line (2024) 6

    Iam-line. https://huggingface.co/datasets/Teklia/IAM-line (2024) 6

  3. [3]

    https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6

    Norhand-v3-line. https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6

  4. [4]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 1, 4, 9, 10

  6. [6]

    Nougat: Neural Optical Understanding for Academic Documents

    Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023) 4, 6, 8, 9

  7. [7]

    ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7

    Calvo-Zaragoza, J., Jr, J.H., Pacha, A.: Understanding optical music recognition. ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7

  8. [8]

    arXiv preprint arXiv:2404.09987 (2024) 7, 10

    Chen, J., Kong, L., Wei, H., Liu, C., Ge, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Onechart: Purify the chart structural extraction via one auxiliary token. arXiv preprint arXiv:2404.09987 (2024) 7, 10

  9. [9]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024) 1, 3, 4, 8, 9

  10. [10]

    arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5

    Du, Y ., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., Lu, B., Yang, Y ., Liu, Q., Hu, X., et al.: Pp-ocrv2: Bag of tricks for ultra lightweight ocr system. arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5

  11. [11]

    In: International Conference on Machine Learning (ICML) (2006) 4

    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning (ICML) (2006) 4

  12. [12]

    Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5

    Gu, J., Meng, X., Lu, G., Hou, L., Minzhe, N., Liang, X., Yao, L., Huang, R., Zhang, W., Jiang, X., et al.: Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5

  13. [13]

    arXiv preprint arXiv:2403.12895 (2024) 9, 10

    Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., Zhang, B., Li, C., Zhang, J., Jin, Q., Huang, F., et al.: mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895 (2024) 9, 10

  14. [14]

    Proceedings of the IEEE 86(11), 2278–2324 (1998) 4

    LeCun, Y ., Bottou, L., Bengio, Y ., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 4

  15. [15]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 3

  16. [16]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., Wei, F.: Trocr: Transformer- based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 13094–13102 (2023) 4

  17. [17]

    In: European conference on computer vision

    Li, Y ., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European conference on computer vision. pp. 280–296. Springer (2022) 5

  18. [18]

    In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4

    Liao, M., Shi, B., Bai, X., Wang, C., Lu, T., Mei, T.: Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4

  19. [19]

    IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4

    Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4

  20. [20]

    arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

    Liu, C., Wei, H., Chen, J., Kong, L., Ge, Z., Zhu, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17

  21. [21]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Liu, C., Wei, H., Yang, J., Liu, J., Li, W., Guo, Y ., Fang, L.: Gigahumandet: Exploring full-body detection on gigapixel-level images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 10092–10100 (2024) 14

  22. [22]

    In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

    Liu, F., Eisenschlos, J.M., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y .: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

  23. [23]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/2024-01-30-llava-next/ 3, 9

  24. [24]

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning (2023) 1, 3, 4

  25. [25]

    arXiv preprint arXiv:1912.09641 (2019) 8

    Liu, X., Zhang, R., Zhou, Y ., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., et al.: Icdar 2019 robust reading challenge on reading chinese text on signboard. arXiv preprint arXiv:1912.09641 (2019) 8

  26. [26]

    Pattern Recognition 90, 337–345 (2019) 4

    Liu, Y ., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, 337–345 (2019) 4

  27. [27]

    arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

    Liu, Y ., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9

  28. [28]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016) 8

  29. [29]

    In: ICLR (2019) 8

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 8

  30. [30]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7553–7563 (2018) 4

  31. [31]

    arXiv preprint arXiv:2305.14761 (2023) 10

    Masry, A., Kavehzadeh, P., Do, X.L., Hoque, E., Joty, S.: Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 (2023) 10

  32. [32]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022) 10

  33. [33]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 3

  34. [34]

    The PracTEX Journal 1, 1–22 (2007) 7

    Mertz, A., Slough, W.: Graphics with tikz. The PracTEX Journal 1, 1–22 (2007) 7

  35. [35]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: Plotqa: Reasoning over scientific plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1527–1536 (2020) 10

  36. [36]

    OpenAI: Gpt-4 technical report (2023) 1, 10

  37. [37]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 1, 4

  38. [38]

    arXiv preprint arXiv:2402.07596 (2024) 7

    Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet music transformer: End-to-end optical music recogni- tion beyond monophonic transcription. arXiv preprint arXiv:2402.07596 (2024) 7

  39. [39]

    International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7

    Ríos-Vila, A., Rizo, D., Iñesta, J.M., Calvo-Zaragoza, J.: End-to-end optical music recognition for pianoform sheet music. International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7

  40. [40]

    Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5

  41. [41]

    In: 2017 14th iapr international conference on document analysis and recognition (ICDAR)

    Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: Icdar2017 competition on reading chinese text in the wild (rctw-17). In: 2017 14th iapr international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1429–1434. IEEE (2017) 8 18

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019) 3

  43. [43]

    In: European conference on computer vision

    Tian, Z., Huang, W., He, T., He, P., Qiao, Y .: Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision. pp. 56–72. Springer (2016) 4

  44. [44]

    COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

    Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016) 8

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Y ., Xie, H., Zha, Z.J., Xing, M., Fu, Z., Zhang, Y .: Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11753–11762 (2020) 4

  46. [46]

    Vary: Scaling up the vision vocabulary for large vision-language models

    Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yang, J., Sun, J., Han, C., Zhang, X.: Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023) 1, 4, 5, 6, 9

  47. [47]

    arXiv preprint arXiv:2401.12503 (2024) 6, 9

    Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yu, E., Sun, J., Han, C., Zhang, X.: Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503 (2024) 6, 9

  48. [48]

    Xia, R., Zhang, B., Ye, H., Yan, X., Liu, Q., Zhou, H., Chen, Z., Dou, M., Shi, B., Yan, J., Qiao, Y .: Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning (2024) 10

  49. [49]

    InEMNLP (Find- ings)

    Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y ., Zhao, C., Xu, G., Li, C., Tian, J., et al.: mplug- docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023) 1, 3, 4

  50. [50]

    Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

    Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Xu, G., Li, C., Tian, J., Qian, Q., Zhang, J., et al.: Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023) 3, 4, 9

  51. [51]

    ShopSign: a Diverse Scene Text Dataset of Chinese Shop Signs in Street Views

    Zhang, C., Peng, G., Tao, Y ., Fu, F., Jiang, W., Almpanidis, G., Chen, K.: Shopsign: A diverse scene text dataset of chinese shop signs in street views. arXiv preprint arXiv:1903.10412 (2019) 8

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1305–1314 (2021) 4

  53. [53]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V ., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022) 5

  54. [54]

    In: 2019 International conference on document analysis and recognition (ICDAR)

    Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR). pp. 1015–1022. IEEE (2019) 4

  55. [55]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19

    Zhou, X., Yao, C., Wen, H., Wang, Y ., Zhou, S., He, W., Liang, J.: East: An efficient and accurate scene text detector. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19