Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin · 2024 · arXiv 2408.15045

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

cs.CV · 2024-12-31 · accept · novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

cs.CV · 2025-11-27 · unverdicted · novelty 6.0

DocVAL transfers spatial reasoning via validated CoT distillation from large teachers to compact student VLMs, delivering up to 6-7 ANLS gains and strong mAP localization on document VQA benchmarks.

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

cs.CV · 2025-07-14 · unverdicted · novelty 3.0

A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.

citing papers explorer

Showing 4 of 4 citing papers.

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models cs.CV · 2026-03-31 · unverdicted · none · ref 19
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 37
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA cs.CV · 2025-11-27 · unverdicted · none · ref 15
DocVAL transfers spatial reasoning via validated CoT distillation from large teachers to compact student VLMs, delivering up to 6-7 ANLS gains and strong mAP localization on document VQA benchmarks.
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends cs.CV · 2025-07-14 · unverdicted · none · ref 33
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.

Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer