General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Pith reviewed 2026-05-17 20:46 UTC · model grok-4.3
The pith
A single unified model can recognize texts, formulas, tables, charts and more by treating them all as characters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The General OCR Theory treats all artificial optical signals as characters and introduces the GOT model as a unified end-to-end solution for OCR-2.0. With 580M parameters, the model combines a high-compression encoder and long-contexts decoder to process both scene-style and document-style images in slice or whole-page formats. It generates plain or structured results through prompt control and enables interactive features such as region-level recognition guided by coordinates or colors, while incorporating dynamic resolution and multi-page handling.
What carries the argument
The GOT model with its high-compression encoder and long-contexts decoder that uses input prompts to select output formats and supports interactive region guidance.
Load-bearing premise
That one 580M-parameter end-to-end model using prompt-based formatting can sustain high accuracy across all character types and input styles without interference or the need for separate task-specific parts.
What would settle it
A controlled test showing that accuracy on mathematical formulas falls when the same model is also trained on tables and charts, or that the model requires separate fine-tuning for different character types to reach competitive performance.
read the original abstract
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the General OCR Theory, redefining all artificial optical signals (plain texts, math/molecular formulas, tables, charts, sheet music, geometric shapes) as 'characters'. It introduces the GOT model: a 580M-parameter unified end-to-end system with a high-compression encoder and long-context decoder. GOT supports scene/document images (slice or whole-page), outputs plain or formatted results (markdown/tikz/smiles/kern) via prompts, enables interactive region-level recognition by coordinates or colors, and incorporates dynamic resolution and multi-page processing. The authors state that experiments provide sufficient results proving the model's superiority for advancing OCR-2.0.
Significance. If the empirical claims hold, this unified approach could reduce reliance on task-specific OCR pipelines and enable more flexible handling of diverse document and scene content. The prompt-based output formatting and interactive features are practical strengths. However, significance hinges on demonstrating that a single shared decoder maintains high accuracy across conflicting output formats without negative transfer, which is not yet substantiated by the available details.
major comments (2)
- [Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.
- [Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.
minor comments (2)
- [Introduction] The introduction of 'General OCR Theory' as a named contribution would benefit from a concise formal statement or set of principles rather than solely descriptive text.
- [Model] Notation for output formats (markdown/tikz/smiles/kern) is introduced via prompt but lacks an example of the exact prompt template used for each.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our paper proposing the General OCR Theory and the GOT model. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Architecture] Model description (high-compression encoder + long-context decoder): no inductive bias, regularization, or task routing is described to prevent negative transfer when jointly optimizing cross-entropy over plain text, table markdown, SMILES, and TikZ outputs. This directly undermines the central claim that one 580M-parameter model can handle all listed character types without interference or accuracy degradation on structured tasks.
Authors: We agree that the current manuscript does not provide a detailed description of mechanisms to mitigate negative transfer. The model relies on prompt engineering to specify the output format, which guides the decoder accordingly. The training data is curated to balance different tasks. To strengthen the paper, we will revise the architecture section to include more details on the training procedure, any regularization techniques employed, and how the shared decoder handles diverse outputs without significant interference. revision: yes
-
Referee: [Experiments] Experiments section: the abstract asserts 'sufficient results to prove the superiority of our model' yet supplies no per-task metrics, baseline comparisons, error bars, or analysis of interference between output vocabularies. Without these, the superiority claim for the unified end-to-end design cannot be evaluated and is load-bearing for the OCR-2.0 contribution.
Authors: The experiments in the manuscript do include results across various OCR tasks demonstrating the model's capabilities. However, we acknowledge that a more granular breakdown with per-task metrics, explicit baseline comparisons, and analysis of potential negative transfer would strengthen the claims. In the revised version, we will expand the experiments section to include these details, including any available error bars from repeated evaluations and discussion on output vocabulary interference. revision: yes
Circularity Check
No circularity: empirical model proposal with no derivation chain or self-referential reductions
full rationale
The paper proposes a redefinition of 'characters' to encompass diverse artificial optical signals and introduces the GOT model as an end-to-end architecture for OCR-2.0 tasks. No equations, first-principles derivations, or load-bearing predictions appear in the provided text. Claims rest on architectural description (high-compression encoder + long-context decoder) and empirical results rather than any step that reduces by construction to fitted inputs or prior self-citations. The central premise is a practical unification via prompting and dynamic resolution, which is independently testable against benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work. This is a standard model paper whose validity hinges on external validation, not internal definitional loops.
Axiom & Free-Parameter Ledger
free parameters (1)
- Total model parameters
axioms (1)
- domain assumption A single unified end-to-end architecture with prompt control can effectively process all listed artificial optical signal types at high accuracy.
invented entities (1)
-
General OCR Theory
no independent evidence
Forward citations
Cited by 17 Pith papers
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
-
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
TableSeq: Unified Generation of Structure, Content, and Layout
TableSeq unifies table structure recognition, content extraction, and cell localization by generating an interleaved autoregressive sequence of HTML tags, cell text, and discretized coordinate tokens from an input image.
-
InstructTable: Improving Table Structure Recognition Through Instructions
InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making
SKG-VLA models each complaint as a structured scene via a Scene Knowledge Graph to improve policy-grounded multimodal reasoning and decision accuracy.
-
MinerU: An Open-Source Solution for Precise Document Content Extraction
MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
-
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.
Reference graph
Works this paper leans on
-
[1]
https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6
Casia-hwdb2-line. https://huggingface.co/datasets/Teklia/CASIA-HWDB2-line (2024) 6
work page 2024
-
[2]
https://huggingface.co/datasets/Teklia/IAM-line (2024) 6
Iam-line. https://huggingface.co/datasets/Teklia/IAM-line (2024) 6
work page 2024
-
[3]
https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6
Norhand-v3-line. https://huggingface.co/datasets/Teklia/NorHand-v3-line (2024) 6
work page 2024
-
[4]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision- language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 1, 4, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Nougat: Neural Optical Understanding for Academic Documents
Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023) 4, 6, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7
Calvo-Zaragoza, J., Jr, J.H., Pacha, A.: Understanding optical music recognition. ACM Computing Surveys (CSUR) 53(4), 1–35 (2020) 7
work page 2020
-
[8]
arXiv preprint arXiv:2404.09987 (2024) 7, 10
Chen, J., Kong, L., Wei, H., Liu, C., Ge, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Onechart: Purify the chart structural extraction via one auxiliary token. arXiv preprint arXiv:2404.09987 (2024) 7, 10
-
[9]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024) 1, 3, 4, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5
Du, Y ., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., Lu, B., Yang, Y ., Liu, Q., Hu, X., et al.: Pp-ocrv2: Bag of tricks for ultra lightweight ocr system. arXiv preprint arXiv:2109.03144 (2021) 1, 4, 5
-
[11]
In: International Conference on Machine Learning (ICML) (2006) 4
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning (ICML) (2006) 4
work page 2006
-
[12]
Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5
Gu, J., Meng, X., Lu, G., Hou, L., Minzhe, N., Liang, X., Yao, L., Huang, R., Zhang, W., Jiang, X., et al.: Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems 35, 26418–26431 (2022) 5
work page 2022
-
[13]
arXiv preprint arXiv:2403.12895 (2024) 9, 10
Hu, A., Xu, H., Ye, J., Yan, M., Zhang, L., Zhang, B., Li, C., Zhang, J., Jin, Q., Huang, F., et al.: mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895 (2024) 9, 10
-
[14]
Proceedings of the IEEE 86(11), 2278–2324 (1998) 4
LeCun, Y ., Bottou, L., Bengio, Y ., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 4
work page 1998
-
[15]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., Wei, F.: Trocr: Transformer- based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 13094–13102 (2023) 4
work page 2023
-
[17]
In: European conference on computer vision
Li, Y ., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European conference on computer vision. pp. 280–296. Springer (2022) 5
work page 2022
-
[18]
In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4
Liao, M., Shi, B., Bai, X., Wang, C., Lu, T., Mei, T.: Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017) 4
work page 2017
-
[19]
IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4
Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence 45(1), 919–931 (2022) 4
work page 2022
-
[20]
arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17
Liu, C., Wei, H., Chen, J., Kong, L., Ge, Z., Zhu, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17
-
[21]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Liu, C., Wei, H., Yang, J., Liu, J., Li, W., Guo, Y ., Fang, L.: Gigahumandet: Exploring full-body detection on gigapixel-level images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 10092–10100 (2024) 14
work page 2024
-
[22]
Liu, F., Eisenschlos, J.M., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y .: Deplot: One-shot visual language reasoning by plot-to-table translation. In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10
-
[23]
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/2024-01-30-llava-next/ 3, 9
work page 2024
-
[24]
Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning (2023) 1, 3, 4
work page 2023
-
[25]
arXiv preprint arXiv:1912.09641 (2019) 8
Liu, X., Zhang, R., Zhou, Y ., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., et al.: Icdar 2019 robust reading challenge on reading chinese text on signboard. arXiv preprint arXiv:1912.09641 (2019) 8
-
[26]
Pattern Recognition 90, 337–345 (2019) 4
Liu, Y ., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, 337–345 (2019) 4
work page 2019
-
[27]
arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9
Liu, Y ., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9
-
[28]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016) 8
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 8
work page 2019
-
[30]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7553–7563 (2018) 4
work page 2018
-
[31]
arXiv preprint arXiv:2305.14761 (2023) 10
Masry, A., Kavehzadeh, P., Do, X.L., Hoque, E., Joty, S.: Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 (2023) 10
-
[32]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022) 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 3
work page 2021
-
[34]
The PracTEX Journal 1, 1–22 (2007) 7
Mertz, A., Slough, W.: Graphics with tikz. The PracTEX Journal 1, 1–22 (2007) 7
work page 2007
-
[35]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: Plotqa: Reasoning over scientific plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1527–1536 (2020) 10
work page 2020
-
[36]
OpenAI: Gpt-4 technical report (2023) 1, 10
work page 2023
-
[37]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 1, 4
work page 2021
-
[38]
arXiv preprint arXiv:2402.07596 (2024) 7
Ríos-Vila, A., Calvo-Zaragoza, J., Paquet, T.: Sheet music transformer: End-to-end optical music recogni- tion beyond monophonic transcription. arXiv preprint arXiv:2402.07596 (2024) 7
-
[39]
International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7
Ríos-Vila, A., Rizo, D., Iñesta, J.M., Calvo-Zaragoza, J.: End-to-end optical music recognition for pianoform sheet music. International Journal on Document Analysis and Recognition (IJDAR) 26(3), 347–362 (2023) 7
work page 2023
-
[40]
Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022) 5
work page 2022
-
[41]
In: 2017 14th iapr international conference on document analysis and recognition (ICDAR)
Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: Icdar2017 competition on reading chinese text in the wild (rctw-17). In: 2017 14th iapr international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1429–1434. IEEE (2017) 8 18
work page 2017
-
[42]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019) 3
work page 2019
-
[43]
In: European conference on computer vision
Tian, Z., Huang, W., He, T., He, P., Qiao, Y .: Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision. pp. 56–72. Springer (2016) 4
work page 2016
-
[44]
COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016) 8
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[45]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, Y ., Xie, H., Zha, Z.J., Xing, M., Fu, Z., Zhang, Y .: Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11753–11762 (2020) 4
work page 2020
-
[46]
arXiv preprint arXiv:2312.06109 (2023) 1, 4, 5, 6, 9
Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yang, J., Sun, J., Han, C., Zhang, X.: Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023) 1, 4, 5, 6, 9
-
[47]
arXiv preprint arXiv:2401.12503 (2024) 6, 9
Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yu, E., Sun, J., Han, C., Zhang, X.: Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503 (2024) 6, 9
-
[48]
Xia, R., Zhang, B., Ye, H., Yan, X., Liu, Q., Zhou, H., Chen, Z., Dou, M., Shi, B., Yan, J., Qiao, Y .: Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning (2024) 10
work page 2024
-
[49]
Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y ., Zhao, C., Xu, G., Li, C., Tian, J., et al.: mplug- docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023) 1, 3, 4
-
[50]
Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Xu, G., Li, C., Tian, J., Qian, Q., Zhang, J., et al.: Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023) 3, 4, 9
-
[51]
ShopSign: a Diverse Scene Text Dataset of Chinese Shop Signs in Street Views
Zhang, C., Peng, G., Tao, Y ., Fu, F., Jiang, W., Almpanidis, G., Chen, K.: Shopsign: A diverse scene text dataset of chinese shop signs in street views. arXiv preprint arXiv:1903.10412 (2019) 8
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[52]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1305–1314 (2021) 4
work page 2021
-
[53]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V ., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022) 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
In: 2019 International conference on document analysis and recognition (ICDAR)
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR). pp. 1015–1022. IEEE (2019) 4
work page 2019
-
[55]
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19
Zhou, X., Yao, C., Wen, H., Wang, Y ., Zhou, S., He, W., Liang, J.: East: An efficient and accurate scene text detector. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 4 19
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.