Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Bo Zhang; Gang Huang; Hangdi Xing; Huan Zhou; Jiajun Bu; Kai Ye; Kehan Chen; Sheng Zhou; Xianwei Mao; Ye Mo

arxiv: 2505.18603 · v2 · pith:SGK536E6new · submitted 2025-05-24 · 💻 cs.AI · cs.CV

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Ye Mo , Kai Ye , Xianwei Mao , Zirui Shao , Gang Huang , Bo Zhang , Hangdi Xing , Kehan Chen

show 4 more authors

Huan Zhou Zixu Yan Jiajun Bu Sheng Zhou

This is my paper

classification 💻 cs.AI cs.CV

keywords visualdoc-cobdocumentreasoninginformationlayoutregionsunderstanding

0 comments

read the original abstract

Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we introduce Doc-CoB (Chain-of-Boxes), a simple-yet-effective framework that integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models. Instead of directly zooming into small regions, Doc-CoB progressively focuses on query-relevant layouts while preserving global document information. Specifically, it first selects key layout boxes and then focuses on them for further understanding with visual prompting. To support this paradigm, we introduce two reasoning tasks for box recognition and box reasoning, with an automatic pipeline that constructs 249k training samples with intermediate visual supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
cs.CL 2026-05 unverdicted novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.