OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Pith reviewed 2026-05-17 09:49 UTC · model grok-4.3
The pith
OCRBench evaluates large multimodal models on 29 OCR datasets to expose their specific weaknesses in text recognition tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large multimodal models show uneven OCR performance, performing adequately on certain scene text and document tasks yet revealing clear limitations with handwritten, multilingual, non-semantic, and mathematical text; OCRBench supplies the 29-dataset testbed needed to quantify these gaps and guide targeted improvements in multimodal text handling.
What carries the argument
OCRBench, the evaluation benchmark assembled from 29 existing datasets spanning text recognition, scene text VQA, document VQA, key information extraction, and handwritten mathematical expression recognition.
If this is right
- Baseline scores on OCRBench can serve as a reference for measuring whether new multimodal architectures improve text reading without task-specific fine-tuning.
- Persistent low performance on handwritten and multilingual subsets points to the need for training data or model components that better handle varied scripts and writing styles.
- The benchmark enables direct comparison of zero-shot OCR ability across models, clarifying which ones are currently most reliable for document and scene text applications.
- Weak results on mathematical expression recognition suggest that general vision-language training leaves gaps that may require dedicated symbol-processing pathways.
Where Pith is reading between the lines
- Widespread use of OCRBench could standardize reporting of text capabilities in new vision-language models, similar to how other benchmarks track general visual understanding.
- The identified weaknesses may stem from training distributions that under-represent noisy, handwritten, or non-Latin text, implying that data curation strategies could close some gaps.
- Future extensions might add video-based text or heavily degraded real-world images to test whether current model limitations persist under greater visual noise.
- The work implies that purely end-to-end multimodal models may benefit from hybrid designs that incorporate explicit OCR modules for certain high-stakes text tasks.
Load-bearing premise
The selected 29 datasets together represent a balanced and sufficiently broad sample of all text-related visual challenges that multimodal models will face in practice.
What would settle it
Release of a new multimodal model that scores near the top on all 29 OCRBench datasets yet still fails on everyday text extraction from photos or videos outside those datasets.
read the original abstract
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes OCRBench, a benchmark with 29 datasets for evaluating OCR capabilities of large multimodal models (e.g., GPT-4V, Gemini) across five task categories: Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. It reports model evaluations that highlight weaknesses in multilingual, handwritten, non-semantic, and mathematical text, supplies baseline results, and releases an evaluation pipeline and GitHub repository.
Significance. If the benchmark holds as a representative standard, the work is significant for filling a gap in systematic OCR assessment within LMMs, whose text-related visual performance remains underexplored despite their dominance in vision-language tasks. The public code release and direct evaluation on held-out datasets support reproducibility and provide a useful foundation for future zero-shot multimodal improvements.
major comments (1)
- [Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.
minor comments (2)
- [Experiments] The evaluation section would benefit from per-dataset error bars or statistical significance tests on model comparisons to strengthen reliability of relative performance claims.
- [OCRBench Construction] A summary table listing all 29 datasets with key attributes (script, degradation type, size) would improve clarity and allow readers to assess coverage at a glance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comment on the abstract's claim regarding comprehensiveness is well-taken, and we address it directly below with a commitment to revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that OCRBench 'contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available' is load-bearing for the central contribution yet rests only on dataset count. The manuscript groups datasets into five categories but provides no coverage matrix, taxonomy, overlap metric, or redundancy analysis across dimensions such as script diversity, degradation, layout complexity, or semantic vs. non-semantic text. This leaves open the possibility of gaps (e.g., historical documents, low-resource scripts) or near-duplicates (e.g., among scene-text sets), directly affecting whether the benchmark can serve as a 'foundational framework'.
Authors: We agree that the claim would be more robust with explicit support beyond dataset count. In the revised manuscript, we will add a new table and accompanying text providing a taxonomy of the 29 datasets across key dimensions: language/script diversity (including English, Chinese, Japanese, and multilingual sets), text type (printed scene text, handwritten, mathematical expressions), domain and layout complexity (natural scenes, documents, forms), and semantic vs. non-semantic content. This will clarify coverage and reduce ambiguity about gaps such as historical documents or very low-resource scripts, which we will also note as limitations. While a full quantitative redundancy or overlap metric across all datasets is not feasible within the current scope without substantial new analysis, the categorical selection process already aimed to avoid direct duplicates by drawing from established but distinct sources in each of the five task areas. We will moderate the abstract wording to 'among the most comprehensive' and emphasize how the five categories together address OCR challenges in LMMs that prior benchmarks have not unified. These changes will better substantiate the benchmark as a foundational framework. revision: yes
Circularity Check
No circularity: empirical benchmark with direct evaluations
full rationale
This is an empirical benchmark paper that curates 29 existing datasets across five task categories and reports direct model outputs on them. No derivations, predictions, fitted parameters, or first-principles results are claimed. The statement that OCRBench is the most comprehensive is a factual count of included datasets rather than any constructed or self-referential result. No load-bearing steps reduce to inputs by construction, self-citation chains, or ansatzes. The work is self-contained against external datasets and model runs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
-
MinerU: An Open-Source Solution for Precise Document Content Extraction
MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023
work page 2023
- [2]
-
[3]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna. lmsys.org/, 2023
work page 2023
-
[6]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Vision-language pre-training: Basics, recent advances, and future trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, recent advances, and future trends. F oundations and Trends® in Computer Graphics and Vision, 2022
work page 2022
-
[8]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021
-
[11]
ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models
Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks , 2022
work page 2022
-
[12]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems , volume 35, pages 23716–23736, 2022
work page 2022
-
[14]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024
work page 2024
-
[16]
Gemini: A family of highly capable multimodal models
Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023
work page 2023
- [17]
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023
work page 2023
-
[19]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023
work page 2023
-
[20]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296–26306, 2024
work page 2024
-
[21]
MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision- language understanding with advanced large language models, 2023
work page 2023
-
[22]
mPLUG-Owl: Modularization empowers large language models with multimodality, 2023
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mPLUG-Owl: Modularization empowers large language models with multimodality, 2023
work page 2023
-
[23]
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023
work page 2023
-
[24]
Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023
work page 2023
-
[25]
Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions, 2023
work page 2023
-
[26]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
work page 2023
-
[27]
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023
work page 2023
-
[28]
Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023
work page 2023
-
[29]
Monkey: Image resolution and text label are important things for large multi-modal models, 2023
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models, 2023. 9
work page 2023
-
[30]
Textmonkey: An ocr-free large multimodal model for understanding document, 2024
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document, 2024
work page 2024
-
[31]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024
work page 2024
-
[32]
Lmms-eval: Accelerating the development of large multimoal models, March 2024
Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Yuhao Dong Xinrun Du, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, March 2024
work page 2024
-
[33]
Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023
work page 2023
-
[34]
Mmbench: Is your multi-modal model an all-around player?, 2023
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023
work page 2023
-
[35]
Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023
work page 2023
-
[36]
Ali Furkan Biten, Rubèn Pérez Tito, Andrés Mafla, Lluís Gómez, Marçal Rusiñol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019
work page 2019
-
[37]
Top-down and bottom-up cues for scene text recognition
Anand Mishra, Karteek Alahari, and CV Jawahar. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2687–2694. IEEE, 2012
work page 2012
-
[38]
End-to-end scene text recognition using tree-structured models
Cunzhao Shi, Chunheng Wang, Baihua Xiao, Song Gao, and Jinlong Hu. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47:2853–2866, 2014
work page 2014
-
[39]
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, M. Iwamura, Lluís Gómez i Bigorda, Sergi Robles Mestre, Joan Mas Romeu, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1484–1493, 2013
work page 2013
-
[40]
Dimosthenis Karatzas, Lluís Gómez i Bigorda, Anguelos Nicolaou, Suman K. Ghosh, Andrew D. Bagdanov, M. Iwamura, Jiri Matas, Lukás Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDA...
work page 2015
-
[41]
Recognizing text with perspective distortion in natural scenes
Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision , pages 569–576, 2013
work page 2013
-
[42]
A robust arbitrary text detection system for natural scene images
Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014
work page 2014
-
[43]
COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images
Andreas Veit, Tomas Matera, Lukás Neumann, Jiri Matas, and Serge J. Belongie. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. ArXiv, abs/1601.07140, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[44]
Curved scene text detection via transverse and longitudinal sequence connection
Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., 90:337–345, 2019
work page 2019
-
[45]
Total-Text: A comprehensive dataset for scene text detection and recognition
Chee-Kheng Chng and Chee Seng Chan. Total-Text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 935–942, 2017
work page 2017
-
[46]
From two to one: A new scene text recognizer with visual language modeling network
Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. From two to one: A new scene text recognizer with visual language modeling network. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183, 2021
work page 2021
-
[47]
Toward understanding WordArt: Corner- guided transformer for scene text recognition
Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. Toward understanding WordArt: Corner- guided transformer for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 303–321. Springer, 2022
work page 2022
-
[48]
The iam-database: an english sentence database for offline handwriting recognition
U-V Marti and Horst Bunke. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition , 5:39–46, 2002. 10
work page 2002
-
[49]
Icdar 2019 robust reading challenge on reading chinese text on signboard
Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR) , pages 1577–1581. IEEE, 2019
work page 2019
-
[50]
Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S
Markus Diem, Stefan Fiel, Florian Kleber, Robert Sablatnig, Jose M. Saavedra, David Contreras, Juan Manuel Barrios, and Luiz S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition , pages 779–784, 2014
work page 2014
-
[51]
Jawahar, Ernest Valveny, and Dimosthenis Karatzas
Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1563–1570, 2019
work page 2019
-
[52]
Towards VQA models that can read, 2019
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read, 2019
work page 2019
-
[53]
OCR-VQA: visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019
work page 2019
-
[54]
On the general value of evidence, and bilingual scene-text visual question answering
Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. On the general value of evidence, and bilingual scene-text visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 10123–10132, 2020
work page 2020
-
[55]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for vqa on document images, 2021
work page 2021
-
[56]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Info- graphicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 1697–1706, 2022
work page 2022
-
[57]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V . Jawahar. IC- DAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR) . IEEE, sep 2019
work page 2019
-
[59]
FUNSD: A dataset for form understanding in noisy scanned documents, 2019
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents, 2019
work page 2019
-
[60]
Visual information extraction in the wild: Practical dataset and end-to-end solution
Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, and Xiang Bai. Visual information extraction in the wild: Practical dataset and end-to-end solution. arXiv preprint arXiv:2305.07498, 2023
-
[61]
Syntax-aware network for handwritten mathematical expression recognition
Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4553–4562, 2022
work page 2022
-
[62]
OpenBMB. Minicpm-v2.6. https://huggingface.co/openbmb/MiniCPM-V-2_6, 2024
work page 2024
-
[63]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024
work page 2024
-
[64]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...
work page 2024
-
[65]
Paligemma: A versatile 3b vlm for transfer, 2024
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...
work page 2024
- [66]
-
[67]
Cogvlm: Visual expert for pretrained language models, 2023
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023
work page 2023
- [68]
-
[69]
Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, and Xiang Bai. Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping, 2024
work page 2024
-
[70]
Anthropic. Claude3.5-sonnet. https://docs.anthropic.com/en/docs/build-with-claude/vision, 2024
work page 2024
-
[71]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[72]
OpenAI. Gpt-4o-mini-20240718. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024
work page 2024
-
[73]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...
work page 2024
- [74]
-
[75]
Google. Gemini models. https://deepmind.google/technologies/gemini/, 2024
work page 2024
- [76]
-
[77]
Ovis: Structural embedding alignment for multimodal large language model, 2024
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024
work page 2024
-
[78]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
work page 2023
-
[79]
OpenBMB. Minicpm-llama3-v-2.5. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5 , 2024
work page 2024
-
[80]
Generative multimodal models are in-context learners, 2024
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.