OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Pith reviewed 2026-05-17 20:29 UTC · model grok-4.3
The pith
A new benchmark shows most large multimodal models score below 50 out of 100 on visual text tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OCRBench v2 provides the widest coverage yet for text-centric visual understanding, with 31 scenarios, thorough metrics, 10,000 verified pairs, and a private test set; benchmarking reveals that current LMMs generally score below 50 out of 100 and exhibit five recurring limitations in less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.
What carries the argument
OCRBench v2, the expanded benchmark consisting of diverse scenarios, human-verified QA pairs, and separate public and private test sets used to measure LMM performance on text localization and reasoning.
If this is right
- Models require targeted gains in recognizing uncommon or handwritten text.
- Better spatial and layout understanding is needed to parse document structure.
- Reasoning capabilities must advance to connect information across text elements.
- Fine-grained visual detail extraction remains a bottleneck for complex scenes.
Where Pith is reading between the lines
- Training pipelines could prioritize data that stresses the five identified weaknesses to accelerate progress.
- Real-world systems for document processing or scene text analysis may still need supplementary rule-based components.
- Extending the benchmark to additional scripts or domains could expose further model gaps not visible in the current 31 scenarios.
Load-bearing premise
The chosen 31 scenarios and 10,000 question-answer pairs together with the private test set give an unbiased picture of model limits without selection effects that favor particular failure modes.
What would settle it
A new model that scores above 70 on both the public and private sets while showing none of the five listed limitations would contradict the claim that most current LMMs suffer from those specific weaknesses.
read the original abstract
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OCRBench v2, a large-scale bilingual benchmark for evaluating large multimodal models (LMMs) on visual text localization and reasoning. It expands prior work with 31 diverse scenarios (4x more tasks than OCRBench), 10,000 human-verified QA pairs featuring a high proportion of difficult samples, and a private test set of 1,500 manually annotated images. The authors benchmark state-of-the-art LMMs, reporting that most models score below 50/100 and exhibit five specific limitations: less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. Reliability is supported by consistent performance trends between the public and private test sets.
Significance. If the benchmark construction and human verification hold, this work provides a valuable, more comprehensive tool for diagnosing OCR weaknesses in LMMs beyond basic text recognition. The explicit human verification of 10,000 pairs and the private test set with consistent trends are clear strengths that enhance reproducibility and reduce selection bias concerns. The findings can usefully direct future research toward the five identified limitation categories.
minor comments (2)
- [Abstract] Abstract: The claim of 'thorough evaluation metrics' and 'high proportion of difficult samples' would benefit from one additional sentence summarizing the scoring rubric (e.g., exact criteria for partial credit on localization or reasoning tasks) and the selection process for difficult samples.
- [Benchmark Construction] The manuscript would be strengthened by a short table or paragraph in the benchmark construction section explicitly mapping the 31 scenarios to the five limitation types, to make the categorization less implicit.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the benchmark's strengths in scale, human verification, and private test set, and their recommendation to accept the manuscript. No major comments were raised that require point-by-point rebuttal.
Circularity Check
No significant circularity
full rationale
The paper introduces OCRBench v2 as a new benchmark built from freshly collected images, 31 scenarios, and 10,000 human-verified QA pairs plus a private test set. Its central claims consist of empirical performance numbers and observed limitation categories obtained by running existing LMMs on this dataset. No equations, first-principles derivations, fitted parameters, or self-citation chains are used to generate the reported scores or limitations; the results are direct measurements on independently annotated data and therefore do not reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification produces accurate ground-truth labels for the 10,000 QA pairs
Forward citations
Cited by 17 Pith papers
-
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
ParseBench: A Document Parsing Benchmark for AI Agents
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
-
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
-
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
-
Discovering Failure Modes in Vision-Language Models using RL
An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.
-
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion
CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep fea...
-
Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection
FPFNet reports state-of-the-art AUROC scores on MVTec-AD and VisA for unified multi-class defect detection by adding feature perturbation and hierarchical fusion to UniAD with no extra parameters.
-
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
Linear probing reveals a gap between internal representations and responses in LVLMs for visual document understanding, with task information encoded more linearly in intermediate layers than the final layer, and fine...
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduc...
-
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, 2020
work page 2020
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[6]
Minigpt-4: Enhancing vision-language understanding with advanced large language models,
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”Proceedings of the International Con- ference on Learning Representations, 2024
work page 2024
-
[7]
arXiv preprint arXiv:2403.04473 (2024) 1, 3, 4, 9
Y . Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for understanding document,”arXiv preprint arXiv:2403.04473, 2024
-
[8]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sunet al., “MME: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
K. Ying, F. Meng, J. Wang, Z. Li, H. Lin, Y . Yang, H. Zhang, W. Zhang, Y . Lin, S. Liuet al., “Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi,”arXiv preprint arXiv:2404.16006, 2024
-
[10]
Towards vqa models that can read,
A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8317–8326
work page 2019
-
[11]
Scene text visual question answering,
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, “Scene text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4291–4301. 10
work page 2019
-
[12]
On the general value of evidence, and bilingual scene-text visual question answering,
X. Wang, Y . Liu, C. Shen, C. C. Ng, C. Luo, L. Jin, C. S. Chan, A. v. d. Hengel, and L. Wang, “On the general value of evidence, and bilingual scene-text visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 126–10 135
work page 2020
-
[13]
Are We on the Right Way for Evaluating Large Vision-Language Models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are We on the Right Way for Evaluating Large Vision-Language Models?”arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C.-l. Liu, L. Jin, and X. Bai, “On the hidden mystery of ocr in large multimodal models,”arXiv preprint arXiv:2305.07895, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
B. Li, Y . Ge, Y . Chen, Y . Ge, R. Zhang, and Y . Shan, “Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension,” arXiv preprint arXiv:2404.16790, 2024
-
[16]
ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,
R. Wadhawan, H. Bansal, K.-W. Chang, and N. Peng, “ConTextual: Evaluating Context- Sensitive Text-Rich Visual Reasoning in Large Multimodal Models,” inProceedings of Inter- national Conference on Machine Learning, 2024
work page 2024
-
[17]
arXiv preprint arXiv:2405.14295 (2024) 4, 8, 9, 10 17
C. Liu, H. Wei, J. Chen, L. Kong, Z. Ge, Z. Zhu, L. Zhao, J. Sun, C. Han, and X. Zhang, “Focus Anywhere for Fine-grained Multi-page Document Understanding,” arXiv preprint arXiv:2405.14295, 2024
-
[18]
Y . Kim, M. Yim, and K. Y . Song, “TableVQA-Bench: A visual question answering benchmark on multiple table domains,”arXiv preprint arXiv:2404.19205, 2024
-
[19]
W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, H. Li et al., “TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy,”arXiv preprint arXiv:2406.01326, 2024
-
[20]
Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,
R. Xia, B. Zhang, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen, M. Dou, B. Shi, J. Yan et al., “Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning,”arXiv preprint arXiv:2402.12185, 2024
-
[21]
Q. Team, “Qwen2.5-vl,” January 2025. [Online]. Available: https://qwenlm.github.io/blog/ qwen2.5-vl/
work page 2025
-
[22]
Docvqa: A dataset for vqa on document images,
M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2200–2209
work page 2021
-
[23]
Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,”arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
OpenAI, “Hello GPT-4o,” https://openai.com/index/gpt-4v-system-card, 2024, accessed: 2024- 12-29
work page 2024
-
[25]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,
L. Ouyang, Y . Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhao et al., “Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,”arXiv preprint arXiv:2412.07626, 2024
-
[26]
Z. Yang, J. Tang, Z. Li, P. Wang, J. Wan, H. Zhong, X. Liu, M. Yang, P. Wang, Y . Liuet al., “Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy,”arXiv preprint arXiv:2412.02210, 2024
-
[27]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations,
Y . Ma, Y . Zang, L. Chen, M. Chen, Y . Jiao, X. Li, X. Lu, Z. Liu, Y . Ma, X. Dong et al., “Mmlongbench-doc: Benchmarking long-context document understanding with visualizations,” arXiv preprint arXiv:2407.01523, 2024. 11
-
[28]
Multimodal Table Understanding,
M. Zheng, X. Feng, Q. Si, Q. She, Z. Lin, W. Jiang, and W. Wang, “Multimodal Table Understanding,” in Proceedings of Annual Meeting of the Association for Computational Linguistics , L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 9102–9124. [Online]. Available: https: //doi.org/10.18653/v1/2024.acl-long.493
-
[29]
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,
F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho, Y . Yacoob, and D. Yu, “MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 1287–1310
work page 2024
-
[30]
Llavar: Enhanced visual instruction tuning for text-rich image understanding,
Y . Zhang, R. Zhang, J. Gu, Y . Zhou, N. Lipka, D. Yang, and T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,”arXiv preprint arXiv:2306.17107, 2023
-
[31]
J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y . Dan, C. Zhao, G. Xu, C. Li, J. Tian et al., “mplug- docowl: Modularized multimodal large language model for document understanding,” arXiv preprint arXiv:2307.02499, 2023
-
[32]
H. Feng, Q. Liu, H. Liu, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,”arXiv preprint arXiv:2311.11810, 2023
-
[33]
J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhanget al., “Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,”arXiv preprint arXiv:2310.05126, 2023
-
[34]
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,
C. Luo, Y . Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao, “LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 630–15 640
work page 2024
-
[35]
arXiv preprint arXiv:2403.12895 (2024) 9, 10
A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang et al., “mplug-docowl 1.5: Unified structure learning for ocr-free document understanding,”arXiv preprint arXiv:2403.12895, 2024
-
[36]
Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,
J. Zhang, W. Yang, S. Lai, Z. Xie, and L. Jin, “Dockylin: A large multimodal model for visual document understanding with efficient visual slimming,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9923–9932
work page 2025
-
[37]
W. Liao, J. Wang, H. Li, C. Wang, J. Huang, and L. Jin, “Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,”arXiv preprint arXiv:2408.15045, 2024
-
[38]
A simple yet effective layout token in large language models for document understanding,
Z. Zhu, C. Luo, Z. Shao, F. Gao, H. Xing, Q. Zheng, and J. Zhang, “A simple yet effective layout token in large language models for document understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[39]
Adaptive markup language generation for contextually- grounded visual document understanding,
H. Xiao, Y . Xie, G. Tan, Y . Chen, R. Hu, K. Wang, A. Zhou, H. Li, H. Shao, X. Lu, P. Gao, Y . Wen, X. Chen, S. Ren, and H. Li, “Adaptive markup language generation for contextually- grounded visual document understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[40]
Marten: Visual question answering with mask generation for multi-modal document under- standing,
Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang, “Marten: Visual question answering with mask generation for multi-modal document under- standing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[41]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,”arXiv preprint arXiv:2203.10244, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 1697–1706. 12
work page 2022
-
[43]
Exploring the Capabilities of Large Multimodal Models on Dense Text,
S. Zhang, B. Yang, Z. Li, Z. Ma, Y . Liu, and X. Bai, “Exploring the Capabilities of Large Multimodal Models on Dense Text,” inProceedings of International Conference on Document Analysis and Recognition. Springer, 2024, pp. 281–298
work page 2024
-
[44]
Onechart: Purify the chart structural extraction via one auxiliary token,
J. Chen, L. Kong, H. Wei, C. Liu, Z. Ge, L. Zhao, J. Sun, C. Han, and X. Zhang, “Onechart: Purify the chart structural extraction via one auxiliary token,” in Proceedings of the ACM International Conference on Multimedia, 2024, pp. 147–155
work page 2024
-
[45]
Document understanding dataset and evaluation (dude),
J. Van Landeghem, R. Tito, Ł. Borchmann, M. Pietruszka, P. Joziak, R. Powalski, D. Ju- rkiewicz, M. Coustaty, B. Anckaert, E. Valvenyet al., “Document understanding dataset and evaluation (dude),” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 528–19 540
work page 2023
-
[46]
Needle in a multimodal haystack,
W. Wang, S. Zhang, Y . Ren, Y . Duan, T. Li, S. Liu, M. Hu, Z. Chen, K. Zhang, L. Luet al., “Needle in a multimodal haystack,” Advances in Neural Information Processing Systems , vol. 37, pp. 20 540–20 565, 2025
work page 2025
-
[47]
Hierarchical multimodal transformers for multipage docvqa,
R. Tito, D. Karatzas, and E. Valveny, “Hierarchical multimodal transformers for multipage docvqa,”Pattern Recognition, vol. 144, p. 109834, 2023
work page 2023
-
[48]
Natural Language Engineering, 30(4):870–881
C. Deng, J. Yuan, P. Bu, P. Wang, Z.-Z. Li, J. Xu, X.-H. Li, Y . Gao, J. Song, B. Zheng et al., “Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating,”arXiv preprint arXiv:2412.18424, 2024
-
[49]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” 2024
work page 2024
-
[50]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Monkey: Image resolution and text label are important things for large multi-modal models,
Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 763– 26 773
work page 2024
-
[52]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models,”arXiv preprint arXiv:2409.17146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms,
S. Tong, E. L. Brown II, P. Wu, S. Woo, A. J. IYER, S. C. Akula, S. Yang, J. Yang, M. Midde- pogu, Z. Wang et al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” inAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[54]
P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet et al., “Pixtral 12b,”arXiv preprint arXiv:2410.07073, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al., “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal under- standing,”arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., “ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools,”arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
S. Lu, Y . Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye, “Ovis: Structural embedding alignment for multimodal large language model,”arXiv preprint arXiv:2405.20797, 2024
-
[59]
GPT-4o mini: advancing cost-efficient intelligence,
OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence, 2024, accessed: 2024-12-29. 13
work page 2024
-
[60]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Anthropic, “Claude 3.5 Sonnet,” https://www.anthropic.com/news/claude-3-5-sonnet, 2024, accessed: 2024-12-29
work page 2024
- [62]
-
[63]
Image-based table recognition: data, model, and evaluation,
X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in Proceedings of European Conference on Computer Vision. Springer, 2020, pp. 564–580
work page 2020
-
[64]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318
work page 2002
-
[65]
METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,
S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, pp. 65–72
work page 2005
-
[66]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016
work page 2016
-
[68]
S. Fang, H. Xie, Y . Wang, Z. Mao, and Y . Zhang, “Read like humans: Autonomous, bidi- rectional and iterative language modeling for scene text recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098–7107
work page 2021
-
[69]
Aster: An attentional scene text recognizer with flexible rectification,
B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035–2048, 2018
work page 2035
-
[70]
Master: Multi-aspect non-local network for scene text recognition,
N. Lu, W. Yu, X. Qi, Y . Chen, P. Gong, R. Xiao, and X. Bai, “Master: Multi-aspect non-local network for scene text recognition,”Pattern Recognition, vol. 117, p. 107980, 2021
work page 2021
-
[71]
SVTR: scene text recognition with a single visual model,
Y . Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y . Du, and Y . Jiang, “SVTR: scene text recognition with a single visual model,” in Proceedings of the International Joint Conference on Artificial Intelligence, L. D. Raedt, Ed. ijcai.org, 2022, pp. 884–890. [Online]. Available: https://doi.org/10.24963/ijcai.2022/124
-
[72]
Abcnet: Real-time scene text spotting with adaptive bezier-curve network,
Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9809–9818
work page 2020
-
[73]
Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,
Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2021
work page 2021
-
[74]
X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9519–9528
work page 2022
-
[75]
Total-text: A comprehensive dataset for scene text detection and recognition,
C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Proceedings of International Conference on Document Analysis and Recognition, vol. 1. IEEE, 2017, pp. 935–942
work page 2017
-
[76]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
H. Wei, C. Liu, J. Chen, J. Wang, L. Kong, Y . Xu, Z. Ge, L. Zhao, J. Sun, Y . Peng et al., “General ocr theory: Towards ocr-2.0 via a unified end-to-end model,”arXiv preprint arXiv:2409.01704, 2024. 14
work page internal anchor Pith review arXiv 2024
-
[77]
Icdar 2013 robust reading competition,
D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, “Icdar 2013 robust reading competition,” in Proceedings of International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493
work page 2013
-
[78]
End-to-end scene text recognition using tree-structured models,
C. Shi, C. Wang, B. Xiao, S. Gao, and J. Hu, “End-to-end scene text recognition using tree-structured models,” Pattern Recognition , vol. 47, pp. 2853–2866, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:30201169
work page 2014
-
[79]
Scene text recognition using higher order language priors,
A. Mishra, K. Alahari, and C. V . Jawahar, “Scene text recognition using higher order language priors,” inBritish Machine Vision Conference, 2012
work page 2012
-
[80]
Icdar 2015 competition on robust reading,
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Luet al., “Icdar 2015 competition on robust reading,” in Proceedings of International Conference on Document Analysis and Recognition. IEEE, 2015, pp. 1156–1160
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.