arxiv: 2305.07895 · v7 · pith:7YTNYMNUnew · submitted 2023-05-13 · 💻 cs.CV · cs.CL

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu , Zhang Li , Mingxin Huang , Biao Yang , Wenwen Yu , Chunyuan Li , Xucheng Yin , Cheng-lin Liu

show 2 more authors

Lianwen Jin Xiang Bai

This is my paper

Pith reviewed 2026-05-17 09:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords OCRLarge Multimodal ModelsBenchmarkText RecognitionVisual Question AnsweringHandwritten TextMathematical Expression RecognitionMultimodal Evaluation

0 comments

The pith

OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OCRBench as a standardized way to measure how well models like GPT-4V and Gemini read text from images across many scenarios. It compiles evaluations on tasks ranging from basic text recognition to document question answering and handwritten math expressions. Results indicate these models handle some clean printed text reasonably but struggle with multilingual content, handwriting, non-semantic symbols, and complex mathematical notation. The benchmark supplies baseline numbers that future work can use to track improvements in zero-shot visual text understanding.

Core claim

Large multimodal models show uneven OCR performance, performing adequately on certain scene text and document tasks yet revealing clear limitations with handwritten, multilingual, non-semantic, and mathematical text; OCRBench supplies the 29-dataset testbed needed to quantify these gaps and guide targeted improvements in multimodal text handling.

What carries the argument

OCRBench, the evaluation benchmark assembled from 29 existing datasets spanning text recognition, scene text VQA, document VQA, key information extraction, and handwritten mathematical expression recognition.

If this is right

Baseline scores on OCRBench can serve as a reference for measuring whether new multimodal architectures improve text reading without task-specific fine-tuning.
Persistent low performance on handwritten and multilingual subsets points to the need for training data or model components that better handle varied scripts and writing styles.
The benchmark enables direct comparison of zero-shot OCR ability across models, clarifying which ones are currently most reliable for document and scene text applications.
Weak results on mathematical expression recognition suggest that general vision-language training leaves gaps that may require dedicated symbol-processing pathways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of OCRBench could standardize reporting of text capabilities in new vision-language models, similar to how other benchmarks track general visual understanding.
The identified weaknesses may stem from training distributions that under-represent noisy, handwritten, or non-Latin text, implying that data curation strategies could close some gaps.
Future extensions might add video-based text or heavily degraded real-world images to test whether current model limitations persist under greater visual noise.
The work implies that purely end-to-end multimodal models may benefit from hybrid designs that incorporate explicit OCR modules for certain high-stakes text tasks.

Load-bearing premise

The selected 29 datasets together represent a balanced and sufficiently broad sample of all text-related visual challenges that multimodal models will face in practice.

What would settle it

Release of a new multimodal model that scores near the top on all 29 OCRBench datasets yet still fails on everyday text extraction from photos or videos outside those datasets.

read the original abstract

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OCRBench unifies 29 datasets for OCR testing in multimodal models and gives the first broad zero-shot numbers on GPT-4V and Gemini, but the dataset selection lacks any coverage or overlap analysis.

read the letter

OCRBench is a benchmark paper that unifies 29 datasets for OCR evaluation in large multimodal models and provides initial zero-shot results on GPT-4V and Gemini. The main contribution is the assembled test suite and the reported performance gaps in areas like multilingual and mathematical text. The paper does well by clearly defining the evaluation tasks and making the full pipeline available on GitHub. This allows straightforward reproduction and extension. The results highlight real weaknesses without overclaiming, and the protocol appears free of obvious biases or selective reporting. The absence of per-dataset statistical tests or error bars is minor and does not affect the overall picture much, given the scale of the evaluation. The main soft spot is the lack of a systematic look at the 29 datasets themselves. The authors categorize them but supply no coverage matrix, redundancy check, or discussion of potential gaps in script types or degradation levels. This leaves the representativeness claim open to the concern that some important regimes might be underrepresented while others are repeated. It is a noticeable but not central flaw. This work targets researchers focused on document understanding and scene text in vision-language models. A practitioner or benchmark user will find immediate value in the released resources and the comparative data. It has enough substance and transparency to merit serious peer review. I recommend sending it out for review, with the expectation that the authors can address the dataset selection details in a revision.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes OCRBench, a benchmark with 29 datasets for evaluating OCR capabilities of large multimodal models (e.g., GPT-4V, Gemini) across five task categories: Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. It reports model evaluations that highlight weaknesses in multilingual, handwritten, non-semantic, and mathematical text, supplies baseline results, and releases an evaluation pipeline and GitHub repository.

Significance. If the benchmark holds as a representative standard, the work is significant for filling a gap in systematic OCR assessment within LMMs, whose text-related visual performance remains underexplored despite their dominance in vision-language tasks. The public code release and direct evaluation on held-out datasets support reproducibility and provide a useful foundation for future zero-shot multimodal improvements.

major comments (1)

[Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.

minor comments (2)

[Experiments] The evaluation section would benefit from per-dataset error bars or statistical significance tests on model comparisons to strengthen reliability of relative performance claims.
[OCRBench Construction] A summary table listing all 29 datasets with key attributes (script, degradation type, size) would improve clarity and allow readers to assess coverage at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comment on the abstract's claim regarding comprehensiveness is well-taken, and we address it directly below with a commitment to revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.

Authors: We agree that the claim would be more robust with explicit support beyond dataset count. In the revised manuscript, we will add a new table and accompanying text providing a taxonomy of the 29 datasets across key dimensions: language/script diversity (including English, Chinese, Japanese, and multilingual sets), text type (printed scene text, handwritten, mathematical expressions), domain and layout complexity (natural scenes, documents, forms), and semantic vs. non-semantic content. This will clarify coverage and reduce ambiguity about gaps such as historical documents or very low-resource scripts, which we will also note as limitations. While a full quantitative redundancy or overlap metric across all datasets is not feasible within the current scope without substantial new analysis, the categorical selection process already aimed to avoid direct duplicates by drawing from established but distinct sources in each of the five task areas. We will moderate the abstract wording to 'among the most comprehensive' and emphasize how the five categories together address OCR challenges in LMMs that prior benchmarks have not unified. These changes will better substantiate the benchmark as a foundational framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluations

full rationale

This is an empirical benchmark paper that curates 29 existing datasets across five task categories and reports direct model outputs on them. No derivations, predictions, fitted parameters, or first-principles results are claimed. The statement that OCRBench is the most comprehensive is a factual count of included datasets rather than any constructed or self-referential result. No load-bearing steps reduce to inputs by construction, self-citation chains, or ansatzes. The work is self-contained against external datasets and model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the central contribution rests on dataset curation and model evaluation rather than new theory.

pith-pipeline@v0.9.0 · 5533 in / 1099 out tokens · 83827 ms · 2026-05-17T09:49:42.939904+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
cs.CV 2024-06 unverdicted novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
cs.CV 2024-06 conditional novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
cs.CV 2024-12 unverdicted novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
cs.CV 2025-01 conditional novelty 4.0

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
MinerU: An Open-Source Solution for Precise Document Content Extraction
cs.CV 2024-09 conditional novelty 4.0

MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 20 Pith papers · 12 internal anchors

[1]

OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023

work page 2023
[2]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[3]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca, 2023

work page 2023
[5]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna. lmsys.org/, 2023

work page 2023
[6]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Vision-language pre-training: Basics, recent advances, and future trends

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, recent advances, and future trends. F oundations and Trends® in Computer Graphics and Vision, 2022

work page 2022
[8]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021

work page arXiv 2021
[11]

ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks , 2022

work page 2022
[12]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems , volume 35, pages 23716–23736, 2022

work page 2022
[14]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024
[16]

Gemini: A family of highly capable multimodal models

Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023

work page 2023
[17]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. 2023

work page 2023
[18]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[19]

Openflamingo, March 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

work page 2023
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296–26306, 2024

work page 2024
[21]

MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023

work page 2023
[22]

mPLUG-Owl: Modularization empowers large language models with multimodality, 2023

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mPLUG-Owl: Modularization empowers large language models with multimodality, 2023

work page 2023
[23]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

work page 2023
[24]

Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023

work page 2023
[25]

Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023

work page 2023
[26]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023
[27]

Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

work page 2023
[28]

Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023

Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023

work page 2023
[29]

Monkey: Image resolution and text label are important things for large multi-modal models, 2023

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models, 2023. 9

work page 2023
[30]

Textmonkey: An ocr-free large multimodal model for understanding document, 2024

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document, 2024

work page 2024
[31]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

work page 2024
[32]

Lmms-eval: Accelerating the development of large multimoal models, March 2024

Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Yuhao Dong Xinrun Du, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, March 2024

work page 2024
[33]

Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

work page 2023
[34]

Mmbench: Is your multi-modal model an all-around player?, 2023

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023

work page 2023
[35]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

work page 2023
[36]

Ali Furkan Biten, Rubèn Pérez Tito, Andrés Mafla, Lluís Gómez, Marçal Rusiñol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019

work page 2019
[37]

Top-down and bottom-up cues for scene text recognition

Anand Mishra, Karteek Alahari, and CV Jawahar. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2687–2694. IEEE, 2012

work page 2012
[38]

End-to-end scene text recognition using tree-structured models

Cunzhao Shi, Chunheng Wang, Baihua Xiao, Song Gao, and Jinlong Hu. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47:2853–2866, 2014

work page 2014
[39]

Iwamura, Lluís Gómez i Bigorda, Sergi Robles Mestre, Joan Mas Romeu, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, M. Iwamura, Lluís Gómez i Bigorda, Sergi Robles Mestre, Joan Mas Romeu, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1484–1493, 2013

work page 2013
[40]

Ghosh, Andrew D

Dimosthenis Karatzas, Lluís Gómez i Bigorda, Anguelos Nicolaou, Suman K. Ghosh, Andrew D. Bagdanov, M. Iwamura, Jiri Matas, Lukás Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDA...

work page 2015
[41]

Recognizing text with perspective distortion in natural scenes

Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision , pages 569–576, 2013

work page 2013
[42]

A robust arbitrary text detection system for natural scene images

Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014

work page 2014
[43]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit, Tomas Matera, Lukás Neumann, Jiri Matas, and Serge J. Belongie. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. ArXiv, abs/1601.07140, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[44]

Curved scene text detection via transverse and longitudinal sequence connection

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., 90:337–345, 2019

work page 2019
[45]

Total-Text: A comprehensive dataset for scene text detection and recognition

Chee-Kheng Chng and Chee Seng Chan. Total-Text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 935–942, 2017

work page 2017
[46]

From two to one: A new scene text recognizer with visual language modeling network

Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. From two to one: A new scene text recognizer with visual language modeling network. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183, 2021

work page 2021
[47]

Toward understanding WordArt: Corner- guided transformer for scene text recognition

Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. Toward understanding WordArt: Corner- guided transformer for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 303–321. Springer, 2022

work page 2022
[48]

The iam-database: an english sentence database for offline handwriting recognition

U-V Marti and Horst Bunke. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition , 5:39–46, 2002. 10

work page 2002
[49]

Icdar 2019 robust reading challenge on reading chinese text on signboard

Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR) , pages 1577–1581. IEEE, 2019

work page 2019
[50]

Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S

Markus Diem, Stefan Fiel, Florian Kleber, Robert Sablatnig, Jose M. Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition , pages 779–784, 2014

work page 2014
[51]

Jawahar, Ernest Valveny, and Dimosthenis Karatzas

Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1563–1570, 2019

work page 2019
[52]

Towards VQA models that can read, 2019

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read, 2019

work page 2019
[53]

OCR-VQA: visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019

work page 2019
[54]

On the general value of evidence, and bilingual scene-text visual question answering

Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. On the general value of evidence, and bilingual scene-text visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 10123–10132, 2020

work page 2020
[55]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for vqa on document images, 2021

work page 2021
[56]

Info- graphicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 1697–1706, 2022

work page 2022
[57]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V . Jawahar. IC- DAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, sep 2019

work page 2019
[59]

FUNSD: A dataset for form understanding in noisy scanned documents, 2019

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents, 2019

work page 2019
[60]

Visual information extraction in the wild: Practical dataset and end-to-end solution

Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, and Xiang Bai. Visual information extraction in the wild: Practical dataset and end-to-end solution. arXiv preprint arXiv:2305.07498, 2023

work page arXiv 2023
[61]

Syntax-aware network for handwritten mathematical expression recognition

Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4553–4562, 2022

work page 2022
[62]

Minicpm-v2.6

OpenBMB. Minicpm-v2.6. https://huggingface.co/openbmb/MiniCPM-V-2_6, 2024

work page 2024
[63]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

work page 2024
[64]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page 2024
[65]

Paligemma: A versatile 3b vlm for transfer, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

work page 2024
[66]

Congrong

CloudWalk. Congrong. https://maas.cloudwalk.com/web/#/login, 2024. 11

work page 2024
[67]

Cogvlm: Visual expert for pretrained language models, 2023

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023

work page 2023
[68]

Minicpm-v-2

OpenBMB. Minicpm-v-2. https://huggingface.co/openbmb/MiniCPM-V-2, 2024

work page 2024
[69]

Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, and Xiang Bai. Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024

work page 2024
[70]

Claude3.5-sonnet

Anthropic. Claude3.5-sonnet. https://docs.anthropic.com/en/docs/build-with-claude/vision, 2024

work page 2024
[71]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[72]

Gpt-4o-mini-20240718

OpenAI. Gpt-4o-mini-20240718. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024

work page 2024
[73]

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

work page 2024
[74]

Rekaflash

Reka AI. Rekaflash. https://www.reka.ai/, 2024

work page 2024
[75]

Gemini models

Google. Gemini models. https://deepmind.google/technologies/gemini/, 2024

work page 2024
[76]

Xverse-v

XVERSE. Xverse-v. https://github.com/xverse-ai/XVERSE-V-13B , 2024

work page 2024
[77]

Ovis: Structural embedding alignment for multimodal large language model, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024

work page 2024
[78]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

work page 2023
[79]

Minicpm-llama3-v-2.5

OpenBMB. Minicpm-llama3-v-2.5. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5 , 2024

work page 2024
[80]

Generative multimodal models are in-context learners, 2024

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners, 2024

work page 2024

Showing first 80 references.