pith. machine review for the scientific record.
sign in

arxiv: 2305.07895 · v7 · pith:7YTNYMNUnew · submitted 2023-05-13 · 💻 cs.CV · cs.CL

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Pith reviewed 2026-05-17 09:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords OCRLarge Multimodal ModelsBenchmarkText RecognitionVisual Question AnsweringHandwritten TextMathematical Expression RecognitionMultimodal Evaluation
0
0 comments X

The pith

OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OCRBench as a standardized way to measure how well models like GPT-4V and Gemini read text from images across many scenarios. It compiles evaluations on tasks ranging from basic text recognition to document question answering and handwritten math expressions. Results indicate these models handle some clean printed text reasonably but struggle with multilingual content, handwriting, non-semantic symbols, and complex mathematical notation. The benchmark supplies baseline numbers that future work can use to track improvements in zero-shot visual text understanding.

Core claim

Large multimodal models show uneven OCR performance, performing adequately on certain scene text and document tasks yet revealing clear limitations with handwritten, multilingual, non-semantic, and mathematical text; OCRBench supplies the 29-dataset testbed needed to quantify these gaps and guide targeted improvements in multimodal text handling.

What carries the argument

OCRBench, the evaluation benchmark assembled from 29 existing datasets spanning text recognition, scene text VQA, document VQA, key information extraction, and handwritten mathematical expression recognition.

If this is right

  • Baseline scores on OCRBench can serve as a reference for measuring whether new multimodal architectures improve text reading without task-specific fine-tuning.
  • Persistent low performance on handwritten and multilingual subsets points to the need for training data or model components that better handle varied scripts and writing styles.
  • The benchmark enables direct comparison of zero-shot OCR ability across models, clarifying which ones are currently most reliable for document and scene text applications.
  • Weak results on mathematical expression recognition suggest that general vision-language training leaves gaps that may require dedicated symbol-processing pathways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of OCRBench could standardize reporting of text capabilities in new vision-language models, similar to how other benchmarks track general visual understanding.
  • The identified weaknesses may stem from training distributions that under-represent noisy, handwritten, or non-Latin text, implying that data curation strategies could close some gaps.
  • Future extensions might add video-based text or heavily degraded real-world images to test whether current model limitations persist under greater visual noise.
  • The work implies that purely end-to-end multimodal models may benefit from hybrid designs that incorporate explicit OCR modules for certain high-stakes text tasks.

Load-bearing premise

The selected 29 datasets together represent a balanced and sufficiently broad sample of all text-related visual challenges that multimodal models will face in practice.

What would settle it

Release of a new multimodal model that scores near the top on all 29 OCRBench datasets yet still fails on everyday text extraction from photos or videos outside those datasets.

read the original abstract

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes OCRBench, a benchmark with 29 datasets for evaluating OCR capabilities of large multimodal models (e.g., GPT-4V, Gemini) across five task categories: Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. It reports model evaluations that highlight weaknesses in multilingual, handwritten, non-semantic, and mathematical text, supplies baseline results, and releases an evaluation pipeline and GitHub repository.

Significance. If the benchmark holds as a representative standard, the work is significant for filling a gap in systematic OCR assessment within LMMs, whose text-related visual performance remains underexplored despite their dominance in vision-language tasks. The public code release and direct evaluation on held-out datasets support reproducibility and provide a useful foundation for future zero-shot multimodal improvements.

major comments (1)
  1. [Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.
minor comments (2)
  1. [Experiments] The evaluation section would benefit from per-dataset error bars or statistical significance tests on model comparisons to strengthen reliability of relative performance claims.
  2. [OCRBench Construction] A summary table listing all 29 datasets with key attributes (script, degradation type, size) would improve clarity and allow readers to assess coverage at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comment on the abstract's claim regarding comprehensiveness is well-taken, and we address it directly below with a commitment to revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.

    Authors: We agree that the claim would be more robust with explicit support beyond dataset count. In the revised manuscript, we will add a new table and accompanying text providing a taxonomy of the 29 datasets across key dimensions: language/script diversity (including English, Chinese, Japanese, and multilingual sets), text type (printed scene text, handwritten, mathematical expressions), domain and layout complexity (natural scenes, documents, forms), and semantic vs. non-semantic content. This will clarify coverage and reduce ambiguity about gaps such as historical documents or very low-resource scripts, which we will also note as limitations. While a full quantitative redundancy or overlap metric across all datasets is not feasible within the current scope without substantial new analysis, the categorical selection process already aimed to avoid direct duplicates by drawing from established but distinct sources in each of the five task areas. We will moderate the abstract wording to 'among the most comprehensive' and emphasize how the five categories together address OCR challenges in LMMs that prior benchmarks have not unified. These changes will better substantiate the benchmark as a foundational framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluations

full rationale

This is an empirical benchmark paper that curates 29 existing datasets across five task categories and reports direct model outputs on them. No derivations, predictions, fitted parameters, or first-principles results are claimed. The statement that OCRBench is the most comprehensive is a factual count of included datasets rather than any constructed or self-referential result. No load-bearing steps reduce to inputs by construction, self-citation chains, or ansatzes. The work is self-contained against external datasets and model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the central contribution rests on dataset curation and model evaluation rather than new theory.

pith-pipeline@v0.9.0 · 5533 in / 1099 out tokens · 83827 ms · 2026-05-17T09:49:42.939904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    cs.CV 2024-12 accept novelty 7.0

    OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

  3. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    cs.CV 2024-06 unverdicted novelty 7.0

    Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...

  4. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

  5. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  6. DocAtlas: Multilingual Document Understanding Across 80+ Languages

    cs.CL 2026-05 unverdicted novelty 6.0

    DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.

  7. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  8. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  9. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  10. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    cs.CV 2024-12 unverdicted novelty 6.0

    VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

  11. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  12. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  13. BLINK: Multimodal Large Language Models Can See but Not Perceive

    cs.CV 2024-04 accept novelty 6.0

    BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

  14. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  15. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  16. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  17. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  18. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    cs.CV 2025-01 conditional novelty 4.0

    VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

  19. MinerU: An Open-Source Solution for Precise Document Content Extraction

    cs.CV 2024-09 conditional novelty 4.0

    MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.

  20. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 20 Pith papers · 12 internal anchors

  1. [1]

    OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023

  2. [2]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  4. [4]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca, 2023

  5. [5]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna. lmsys.org/, 2023

  6. [6]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023

  7. [7]

    Vision-language pre-training: Basics, recent advances, and future trends

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, recent advances, and future trends. F oundations and Trends® in Computer Graphics and Vision, 2022

  8. [8]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 8

  9. [9]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

  10. [10]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021

  11. [11]

    ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models

    Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks , 2022

  12. [12]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  13. [13]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems , volume 35, pages 23716–23736, 2022

  14. [14]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022

  15. [15]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

  16. [16]

    Gemini: A family of highly capable multimodal models

    Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023

  17. [17]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. 2023

  18. [18]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

  19. [19]

    Openflamingo, March 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

  20. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296–26306, 2024

  21. [21]

    MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023

  22. [22]

    mPLUG-Owl: Modularization empowers large language models with multimodality, 2023

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mPLUG-Owl: Modularization empowers large language models with multimodality, 2023

  23. [23]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

  24. [24]

    Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023

    Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023

  25. [25]

    Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023

    Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023

  26. [26]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

  27. [27]

    Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

  28. [28]

    Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023

    Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023

  29. [29]

    Monkey: Image resolution and text label are important things for large multi-modal models, 2023

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models, 2023. 9

  30. [30]

    Textmonkey: An ocr-free large multimodal model for understanding document, 2024

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document, 2024

  31. [31]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

  32. [32]

    Lmms-eval: Accelerating the development of large multimoal models, March 2024

    Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Yuhao Dong Xinrun Du, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, March 2024

  33. [33]

    Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

  34. [34]

    Mmbench: Is your multi-modal model an all-around player?, 2023

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023

  35. [35]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

  36. [36]

    Ali Furkan Biten, Rubèn Pérez Tito, Andrés Mafla, Lluís Gómez, Marçal Rusiñol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019

  37. [37]

    Top-down and bottom-up cues for scene text recognition

    Anand Mishra, Karteek Alahari, and CV Jawahar. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2687–2694. IEEE, 2012

  38. [38]

    End-to-end scene text recognition using tree-structured models

    Cunzhao Shi, Chunheng Wang, Baihua Xiao, Song Gao, and Jinlong Hu. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47:2853–2866, 2014

  39. [39]

    Iwamura, Lluís Gómez i Bigorda, Sergi Robles Mestre, Joan Mas Romeu, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras

    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, M. Iwamura, Lluís Gómez i Bigorda, Sergi Robles Mestre, Joan Mas Romeu, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1484–1493, 2013

  40. [40]

    Ghosh, Andrew D

    Dimosthenis Karatzas, Lluís Gómez i Bigorda, Anguelos Nicolaou, Suman K. Ghosh, Andrew D. Bagdanov, M. Iwamura, Jiri Matas, Lukás Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDA...

  41. [41]

    Recognizing text with perspective distortion in natural scenes

    Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision , pages 569–576, 2013

  42. [42]

    A robust arbitrary text detection system for natural scene images

    Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014

  43. [43]

    COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

    Andreas Veit, Tomas Matera, Lukás Neumann, Jiri Matas, and Serge J. Belongie. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. ArXiv, abs/1601.07140, 2016

  44. [44]

    Curved scene text detection via transverse and longitudinal sequence connection

    Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., 90:337–345, 2019

  45. [45]

    Total-Text: A comprehensive dataset for scene text detection and recognition

    Chee-Kheng Chng and Chee Seng Chan. Total-Text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 935–942, 2017

  46. [46]

    From two to one: A new scene text recognizer with visual language modeling network

    Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. From two to one: A new scene text recognizer with visual language modeling network. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183, 2021

  47. [47]

    Toward understanding WordArt: Corner- guided transformer for scene text recognition

    Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. Toward understanding WordArt: Corner- guided transformer for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 303–321. Springer, 2022

  48. [48]

    The iam-database: an english sentence database for offline handwriting recognition

    U-V Marti and Horst Bunke. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition , 5:39–46, 2002. 10

  49. [49]

    Icdar 2019 robust reading challenge on reading chinese text on signboard

    Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR) , pages 1577–1581. IEEE, 2019

  50. [50]

    Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S

    Markus Diem, Stefan Fiel, Florian Kleber, Robert Sablatnig, Jose M. Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition , pages 779–784, 2014

  51. [51]

    Jawahar, Ernest Valveny, and Dimosthenis Karatzas

    Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1563–1570, 2019

  52. [52]

    Towards VQA models that can read, 2019

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read, 2019

  53. [53]

    OCR-VQA: visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019

  54. [54]

    On the general value of evidence, and bilingual scene-text visual question answering

    Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. On the general value of evidence, and bilingual scene-text visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 10123–10132, 2020

  55. [55]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for vqa on document images, 2021

  56. [56]

    Info- graphicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 1697–1706, 2022

  57. [57]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

  58. [58]

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V . Jawahar. IC- DAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, sep 2019

  59. [59]

    FUNSD: A dataset for form understanding in noisy scanned documents, 2019

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents, 2019

  60. [60]

    Visual information extraction in the wild: Practical dataset and end-to-end solution

    Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, and Xiang Bai. Visual information extraction in the wild: Practical dataset and end-to-end solution. arXiv preprint arXiv:2305.07498, 2023

  61. [61]

    Syntax-aware network for handwritten mathematical expression recognition

    Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4553–4562, 2022

  62. [62]

    Minicpm-v2.6

    OpenBMB. Minicpm-v2.6. https://huggingface.co/openbmb/MiniCPM-V-2_6, 2024

  63. [63]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

  64. [64]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

  65. [65]

    Paligemma: A versatile 3b vlm for transfer, 2024

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

  66. [66]

    Congrong

    CloudWalk. Congrong. https://maas.cloudwalk.com/web/#/login, 2024. 11

  67. [67]

    Cogvlm: Visual expert for pretrained language models, 2023

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023

  68. [68]

    Minicpm-v-2

    OpenBMB. Minicpm-v-2. https://huggingface.co/openbmb/MiniCPM-V-2, 2024

  69. [69]

    Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024

    Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, and Xiang Bai. Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024

  70. [70]

    Claude3.5-sonnet

    Anthropic. Claude3.5-sonnet. https://docs.anthropic.com/en/docs/build-with-claude/vision, 2024

  71. [71]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  72. [72]

    Gpt-4o-mini-20240718

    OpenAI. Gpt-4o-mini-20240718. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024

  73. [73]

    Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

  74. [74]

    Rekaflash

    Reka AI. Rekaflash. https://www.reka.ai/, 2024

  75. [75]

    Gemini models

    Google. Gemini models. https://deepmind.google/technologies/gemini/, 2024

  76. [76]

    Xverse-v

    XVERSE. Xverse-v. https://github.com/xverse-ai/XVERSE-V-13B , 2024

  77. [77]

    Ovis: Structural embedding alignment for multimodal large language model, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024

  78. [78]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  79. [79]

    Minicpm-llama3-v-2.5

    OpenBMB. Minicpm-llama3-v-2.5. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5 , 2024

  80. [80]

    Generative multimodal models are in-context learners, 2024

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners, 2024

Showing first 80 references.