pith. sign in

Ocr-reasoning benchmark: Unveiling the true capabilities of mllms in complex text-rich image reasoning

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it
abstract

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

citation-role summary

background 2 baseline 1

citation-polarity summary

years

2026 4 2025 2

representative citing papers

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

cs.LG · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

Nemotron 3 Nano Omni is an efficient open multimodal model supporting audio, text, images, and video with reported accuracy gains and leading results on document understanding and long audio-video tasks.

citing papers explorer

Showing 6 of 6 citing papers.