Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4roles
background 1polarities
background 1representative citing papers
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
DocVAL transfers spatial reasoning via validated CoT distillation from large teachers to compact student VLMs, delivering up to 6-7 ANLS gains and strong mAP localization on document VQA benchmarks.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
citing papers explorer
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.