Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Catyana Heyne; Filippo Riccio; J\"urgen Frikel

arxiv: 2606.02162 · v1 · pith:DFLWSGKDnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.CL· cs.IR

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Catyana Heyne , J\"urgen Frikel , Filippo Riccio This is my paper

Pith reviewed 2026-06-28 15:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.IR

keywords multimodal document classificationvisually rich documentsLayoutLMv3DonutQwen3RVL-CDIPOCR-freetransformer vs LLM

0 comments

The pith

Specialized multimodal Transformers outperform LLM-based models on visually rich documents, with image information as the strongest contributor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled comparison of four models on the RVL-CDIP benchmark for document type classification. Two specialized multimodal Transformers (LayoutLMv3 and Donut) are tested against two LLM-based systems (Qwen3-VL-32B-Instruct and Qwen3-32B) in a single experimental setup. The evaluation isolates how image, OCR-derived text, and layout each affect accuracy. Results show the Transformer models reach higher performance, driven mainly by visual features, while text from OCR adds only secondary value. The work concludes that dedicated multimodal processing stays necessary when documents carry important layout structure.

Core claim

Specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure.

What carries the argument

Unified evaluation of LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B on RVL-CDIP to isolate the separate contributions of image, OCR text, and layout modalities.

Load-bearing premise

The four selected models and the RVL-CDIP benchmark together provide a representative and unbiased test of multimodal design strategies across transformer and LLM architectures.

What would settle it

An LLM-based model achieving higher accuracy than LayoutLMv3 and Donut on RVL-CDIP under the same controlled training and evaluation conditions would falsify the main performance claim.

read the original abstract

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Narrow head-to-head of four models on RVL-CDIP finds the two specialized document transformers beat the two LLMs, driven mostly by image features, but the model set is too small to support broad claims about the families.

read the letter

This paper runs LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B on RVL-CDIP in what it calls a unified framework and reports that the first two win, with image information carrying most of the signal and OCR text adding less.

It does a useful job of putting OCR-dependent and OCR-free models into the same experimental conditions so the modality contributions can be compared directly. That kind of controlled side-by-side is still rare enough that the tables will probably get cited by people who just need to pick a model for document classification.

The soft spot is exactly the one the stress-test note flags: four models is a thin basis for statements about “specialized multimodal Transformers” versus “LLM-based approaches.” The Qwen models are large and the paper gives no detail on whether fine-tuning budgets or data were matched, so the ordering could easily be an artifact of the specific choices rather than a general architectural fact. The abstract does not show additional models or ablation on the selection, which keeps the result narrow.

This is the sort of paper that belongs in a reading group when the group is looking at practical model selection for layout-heavy tasks. It does not introduce new methods or derivations, so I would not cite it for a technical claim, but the empirical comparison is honest enough to deserve referee time. Send it to review.

Referee Report

2 major / 0 minor

Summary. The paper provides a comparative analysis of multimodal strategies for document type classification on visually rich documents. It evaluates four models—LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B—on the RVL-CDIP benchmark within a unified experimental framework, contrasting OCR-dependent and OCR-free approaches. The central claim is that specialized multimodal Transformers outperform LLM-based methods on layout-intensive documents, with image information contributing most strongly to performance and OCR text providing secondary support.

Significance. If the empirical comparison is shown to be fair and representative, the work supplies concrete guidance on feature contributions (image vs. text vs. layout) and architectural families for document classification tasks. The use of a public benchmark and focus on multimodal necessity for layout-heavy documents could inform practical model selection in document AI.

major comments (2)

[Abstract] Abstract: The headline claim that 'specialized multimodal Transformers outperform LLM-based approaches' rests on the representativeness of exactly these four models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, Qwen3-32B). No justification is given for why the chosen LLMs are the strongest or most comparable representatives of the LLM family, nor that fine-tuning budgets and protocols were equalized; this makes the performance ordering vulnerable to selection artifacts rather than a general architectural property.
[Abstract] Abstract: The abstract asserts a 'unified experimental framework' and reports that image information contributes most strongly, yet provides no details on data splits, statistical testing, or ablation controls. Without these, it is impossible to verify whether the reported ordering and modality contributions are robust or could be explained by implementation differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments point by point below and will revise the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that 'specialized multimodal Transformers outperform LLM-based approaches' rests on the representativeness of exactly these four models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B). No justification is given for why the chosen LLMs are the strongest or most comparable representatives of the LLM family, nor that fine-tuning budgets and protocols were equalized; this makes the performance ordering vulnerable to selection artifacts rather than a general architectural property.

Authors: These four models were deliberately chosen as representatives of distinct architectural families relevant to the paper's focus: LayoutLMv3 (OCR-dependent layout-aware transformer), Donut (OCR-free document transformer), Qwen3-VL-32B-Instruct (multimodal LLM), and Qwen3-32B (text-only LLM). This selection enables a controlled contrast between specialized multimodal transformers and LLM-based approaches while highlighting OCR-dependent vs. OCR-free strategies. We will add an explicit justification paragraph in the Introduction or Experimental Setup section detailing the selection criteria (public availability, relevance to layout-intensive documents, and coverage of modality handling). On fine-tuning, we followed each model's standard recommended protocols on identical hardware and data to reflect typical usage; exact equalization of compute budgets is inherently limited by architectural differences (e.g., vision encoder sizes). We acknowledge this as a limitation of any cross-family comparison but maintain that the results indicate architectural trends rather than artifacts, as the ordering aligns with modality ablation findings. revision: partial
Referee: [Abstract] Abstract: The abstract asserts a 'unified experimental framework' and reports that image information contributes most strongly, yet provides no details on data splits, statistical testing, or ablation controls. Without these, it is impossible to verify whether the reported ordering and modality contributions are robust or could be explained by implementation differences.

Authors: The full manuscript details the unified framework, including standard RVL-CDIP train/validation/test splits, repeated runs with different seeds for statistical assessment, and controlled modality ablations (image-only, text-only, layout-only, and combinations). The abstract's length constraint prevented inclusion of these specifics. We will revise the abstract to include a concise clause such as 'within a unified framework on standard RVL-CDIP splits with modality ablations and multi-run validation' to make the claims verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on public benchmark

full rationale

The paper conducts a controlled empirical evaluation of four models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, Qwen3-32B) on the RVL-CDIP benchmark under a unified experimental framework. No equations, fitted parameters, derivations, or self-citations are present that reduce any result to prior definitions by construction. The performance claims rest on direct experimental outcomes rather than any of the enumerated circular patterns. This is a standard self-contained empirical study against an external public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical comparative study; contains no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5785 in / 964 out tokens · 20358 ms · 2026-06-28T15:00:57.036474+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 11 canonical work pages · 8 internal anchors

[1]

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Y. Ding, S. Luo, Y. Dai, Y. Jiang, Z. Li, G. Martin, and Y. Peng, “A survey on mllm-based visually rich document understanding: Methods, challenges, and emerging trends.” [Online]. Available: https://arxiv.org/pdf/2507.09861

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Deep learning based visually rich document content understanding: a survey,

Y. Ding, S. C. Han, J. Lee, and E. Hovy, “Deep learning based visually rich document content understanding: a survey,”Artificial Intelligence Review, vol. 59, no. 4, p. 114, 2026

2026
[3]

A table detection method for pdf documents based on convolutional neural networks,

L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for pdf documents based on convolutional neural networks,” in2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016, pp. 287–292

2016
[4]

Mask R-CNN

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn.” [Online]. Available: https://arxiv.org/pdf/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” inAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc, 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/ 2015/...

2015
[6]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds....

2019
[7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach.” [Online]. Available: https://arxiv.org/pdf/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 1907
[8]

Deep learning based key information extraction from business documents: Systematic literature review,

A. M. Rombach and P. Fettke, “Deep learning based key information extraction from business documents: Systematic literature review,”ACM Computing Surveys, vol. 58, no. 2, 2025

2025
[9]

Layoutlmv3: Pre-training for document ai with unified text and image masking,

Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” inProceedings of the 30th ACM International Conference on Multimedia, ser. ACM Digital Library, J. Magalhães, Ed. New York, NY, United States: Association for Computing Machinery, 2022, pp. 4083–4091. [Online]. Available: http://arxi...

work page arXiv 2022
[10]

Ocr-free document understanding transformer,

G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 498–517. [Online]. Available: http://arxiv.org/pdf/2111.15664

work page arXiv 2022
[11]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv
[12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y. Wang, Y. Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, Da Yin, F. Yang, G. Chen, H. Li, J. Zhu, J. Chen, J. Xu, J. Xu, J. Chen, J. Lin, J. Chen, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zha...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Deep learning approaches for information extraction from visually rich documents: datasets, challenges and methods,

H. Gbada, K. Kalti, and M. A. Mahjoub, “Deep learning approaches for information extraction from visually rich documents: datasets, challenges and methods,”International Journal on Document Analysis and Recognition (IJDAR), vol. 28, no. 1, pp. 121–142, 2025

2025
[14]

Visually-rich document understanding: Concepts, taxonomy and challenges,

A. Sassioui, R. Benouini, Y. El Ouargui, M. El Kamili, M. Chergui, and M. Ouzzif, “Visually-rich document understanding: Concepts, taxonomy and challenges,” in2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), 2023, pp. 1–7

2023
[15]

Are layout analysis and ocr still useful for document information extraction using foundation models?

A. Scius-Bertrand, A. Fakhari, L. Vögtlin, D. R. Cabral, and A. Fischer, “Are layout analysis and ocr still useful for document information extraction using foundation models?” inDocument analysis and recognition - ICDAR 2024, ser. Lecture Notes in Computer Science, E. B. Smith, M. Liwicki, and L. Peng, Eds. Cham: Springer, 2024, vol. 14807, pp. 175–191

2024
[16]

Due: End-to-end document understanding benchmark,

Ł. Borchmann, M. Pietruszka, T. Stanislawek, D. Jurkiewicz, M. Turski, K. Szyndler, and F. Graliński, “Due: End-to-end document understanding benchmark,” in Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track (Round 2), 2021. [Online]. Available: https: //openreview.net/forum?id=rNs2FvJGDK

2021
[17]

On evaluation of document classification with rvl-cdip,

S. Larson, G. Lim, and K. Leach, “On evaluation of document classification with rvl-cdip,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, A. Vlachos and I. Augenstein, Eds. Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp. 2665–2678. [Online]. Available: https://aclanth...

2023
[18]

Qwen3 Technical Report

K. Bao, Z. Cui, K. Dang, L. Deng, Y. Fan, R. Gao, C. Gao, H. Ge, F. Hu, C. Huang, F. Huang, B. Hui, Le Yu, A. Li, M. Li, M. Li, T. Li, H. Lin, J. Lin, D. Liu, S. Liu, Y. Liu, S. Luo, C. Lv, R. Men, Z. Qiu, X. Ren, X. Ren, Y. Su, J. Tang, T. Tang, J. Tu, Y. Wan, X. Wang, P. Wang, Z. Wang, H. Wei, M. Xue, K. Yang, A. Yang, B. Yang, J. Yang, J. Yang, J. Yang...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Evaluation of deep convolutional nets for document image classification and retrieval,

A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in13th International Conference on Document Analysis and Recognition (ICDAR 2015). Piscataway, NJ: IEEE, 2015, pp. 991–995

2015
[20]

Docformerv2: Local features for document understanding,

S. Appalaraju, P. Tang, Q. Dong, N. Sankaran, Y. Zhou, and R. Manmatha, “Docformerv2: Local features for document understanding,”Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 2, pp. 709–718, 2024. 29 Multimodal Approaches for Visually-Rich Document Type Classification Heyne et al

2024
[21]

Industry documents library

University of California, San Francisco, “Industry documents library.” [Online]. Available: https://www.industrydocuments.ucsf.edu/
[22]

Building a test collection for complex document information processing,

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” inProceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, Ed. New York, NY: ACM Press, 2006, pp. 665–666

2006
[23]

Cord: A consolidated receipt dataset for post-ocr parsing,

S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee, “Cord: A consolidated receipt dataset for post-ocr parsing,” inDocument Intelligence Workshop at Neural Information Processing Systems (NeurIPS), 2019. [Online]. Available: https://openreview.net/forum?id=SJl3z659UH

2019
[24]

Funsd: A dataset for form understanding in noisy scanned documents,

G. Jaume, H. Ekenel, and J.-P. Thiran, “Funsd: A dataset for form understanding in noisy scanned documents,” inAccepted to ICDAR-OST, 2019. [Online]. Available: https://arxiv.org/pdf/1905.13538

work page arXiv 2019
[25]

Informa- tion extraction from visually rich documents using llm-based organization of documents into independent textual segments,

A. Bhattacharyya, A. Tripathi, U. Das, A. Karmakar, A. Pathak, and M. Gupta, “Informa- tion extraction from visually rich documents using llm-based organization of documents into independent textual segments,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, ...

2025
[26]

Layoutllm: Layout instruction tuning with large language models for document understanding,

C. Luo, Y. Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao, “Layoutllm: Layout instruction tuning with large language models for document understanding,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15630–15640

2024
[27]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, Q. Jin, L. He, X. Lin, and F. Huang, “Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Associatio...

2023
[28]

Ocr quality: Key to enhanced data mining,

A. Jääskeläinen, M. Lipsanen, A. Föhr, and T. Räisänen, “Ocr quality: Key to enhanced data mining,” in2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), 2023, pp. 1–6

2023
[29]

Vlcdoc: Vision-language contrastive pre-training model for cross-modal document classification,

Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, and Oriol Ramos Terrades, “Vlcdoc: Vision-language contrastive pre-training model for cross-modal document classification,”Pattern Recognition, vol. 139, p. 109419, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320323001206

2023
[30]

Business document information extraction: Towards practical benchmarks,

M. Skalický, Š. Šimsa, M. Uřičář, and M. Šulc, “Business document information extraction: Towards practical benchmarks,” inExperimental IR Meets Multilinguality, Multimodality, and Interaction, ser. Lecture Notes in Computer Science, A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Fa...

2022
[31]

Doclaynet: A large human- annotated dataset for document-layout segmentation,

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar, “Doclaynet: A large human- annotated dataset for document-layout segmentation,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. ACM Digital Library, A. Zhang, Ed. New York, United States: Association for Computing Machinery, 2022, pp. 3743–3751. 30 Mu...

2022
[32]

A survey of recent approaches to form understanding in scanned documents,

A. Abdallah, D. Eberharter, Z. Pfister, and A. Jatowt, “A survey of recent approaches to form understanding in scanned documents,”Artificial Intelligence Review, vol. 57, no. 12, p. 342, 2024

2024
[33]

Xfund: A benchmark dataset for multilingual visually rich form understanding,

Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei, “Xfund: A benchmark dataset for multilingual visually rich form understanding,” inFindings of the Association for Computational Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3214–3224. [O...

2022
[34]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002

2021
[35]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreau...

2020
[36]

Multilingual denoising pre-training for neural machine translation,

Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, L. Zettlemoyer, M. Johnson, B. Roark, and A. Nenkova, “Multilingual denoising pre-training for neural machine translation,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

2020
[37]

Dit: Self-supervised pre-training for document image transformer,

J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” inProceedings of the 30th ACM International Conference on Multimedia, ser. ACM Digital Library, J. Magalhães, Ed. New York, NY, United States: Association for Computing Machinery, 2022, pp. 3530–3539

2022
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai, “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.” [Online]. Available: https://arxiv.org/pdf/2502.14786

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Huggingface transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Huggingface transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Conference on Empiri...

2020

[1] [1]

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Y. Ding, S. Luo, Y. Dai, Y. Jiang, Z. Li, G. Martin, and Y. Peng, “A survey on mllm-based visually rich document understanding: Methods, challenges, and emerging trends.” [Online]. Available: https://arxiv.org/pdf/2507.09861

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Deep learning based visually rich document content understanding: a survey,

Y. Ding, S. C. Han, J. Lee, and E. Hovy, “Deep learning based visually rich document content understanding: a survey,”Artificial Intelligence Review, vol. 59, no. 4, p. 114, 2026

2026

[3] [3]

A table detection method for pdf documents based on convolutional neural networks,

L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for pdf documents based on convolutional neural networks,” in2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016, pp. 287–292

2016

[4] [4]

Mask R-CNN

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn.” [Online]. Available: https://arxiv.org/pdf/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” inAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc, 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/ 2015/...

2015

[6] [6]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds....

2019

[7] [7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach.” [Online]. Available: https://arxiv.org/pdf/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 1907

[8] [8]

Deep learning based key information extraction from business documents: Systematic literature review,

A. M. Rombach and P. Fettke, “Deep learning based key information extraction from business documents: Systematic literature review,”ACM Computing Surveys, vol. 58, no. 2, 2025

2025

[9] [9]

Layoutlmv3: Pre-training for document ai with unified text and image masking,

Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” inProceedings of the 30th ACM International Conference on Multimedia, ser. ACM Digital Library, J. Magalhães, Ed. New York, NY, United States: Association for Computing Machinery, 2022, pp. 4083–4091. [Online]. Available: http://arxi...

work page arXiv 2022

[10] [10]

Ocr-free document understanding transformer,

G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 498–517. [Online]. Available: http://arxiv.org/pdf/2111.15664

work page arXiv 2022

[11] [11]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y. Wang, Y. Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, Da Yin, F. Yang, G. Chen, H. Li, J. Zhu, J. Chen, J. Xu, J. Xu, J. Chen, J. Lin, J. Chen, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zha...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Deep learning approaches for information extraction from visually rich documents: datasets, challenges and methods,

H. Gbada, K. Kalti, and M. A. Mahjoub, “Deep learning approaches for information extraction from visually rich documents: datasets, challenges and methods,”International Journal on Document Analysis and Recognition (IJDAR), vol. 28, no. 1, pp. 121–142, 2025

2025

[14] [14]

Visually-rich document understanding: Concepts, taxonomy and challenges,

A. Sassioui, R. Benouini, Y. El Ouargui, M. El Kamili, M. Chergui, and M. Ouzzif, “Visually-rich document understanding: Concepts, taxonomy and challenges,” in2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), 2023, pp. 1–7

2023

[15] [15]

Are layout analysis and ocr still useful for document information extraction using foundation models?

A. Scius-Bertrand, A. Fakhari, L. Vögtlin, D. R. Cabral, and A. Fischer, “Are layout analysis and ocr still useful for document information extraction using foundation models?” inDocument analysis and recognition - ICDAR 2024, ser. Lecture Notes in Computer Science, E. B. Smith, M. Liwicki, and L. Peng, Eds. Cham: Springer, 2024, vol. 14807, pp. 175–191

2024

[16] [16]

Due: End-to-end document understanding benchmark,

Ł. Borchmann, M. Pietruszka, T. Stanislawek, D. Jurkiewicz, M. Turski, K. Szyndler, and F. Graliński, “Due: End-to-end document understanding benchmark,” in Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track (Round 2), 2021. [Online]. Available: https: //openreview.net/forum?id=rNs2FvJGDK

2021

[17] [17]

On evaluation of document classification with rvl-cdip,

S. Larson, G. Lim, and K. Leach, “On evaluation of document classification with rvl-cdip,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, A. Vlachos and I. Augenstein, Eds. Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp. 2665–2678. [Online]. Available: https://aclanth...

2023

[18] [18]

Qwen3 Technical Report

K. Bao, Z. Cui, K. Dang, L. Deng, Y. Fan, R. Gao, C. Gao, H. Ge, F. Hu, C. Huang, F. Huang, B. Hui, Le Yu, A. Li, M. Li, M. Li, T. Li, H. Lin, J. Lin, D. Liu, S. Liu, Y. Liu, S. Luo, C. Lv, R. Men, Z. Qiu, X. Ren, X. Ren, Y. Su, J. Tang, T. Tang, J. Tu, Y. Wan, X. Wang, P. Wang, Z. Wang, H. Wei, M. Xue, K. Yang, A. Yang, B. Yang, J. Yang, J. Yang, J. Yang...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Evaluation of deep convolutional nets for document image classification and retrieval,

A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in13th International Conference on Document Analysis and Recognition (ICDAR 2015). Piscataway, NJ: IEEE, 2015, pp. 991–995

2015

[20] [20]

Docformerv2: Local features for document understanding,

S. Appalaraju, P. Tang, Q. Dong, N. Sankaran, Y. Zhou, and R. Manmatha, “Docformerv2: Local features for document understanding,”Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 2, pp. 709–718, 2024. 29 Multimodal Approaches for Visually-Rich Document Type Classification Heyne et al

2024

[21] [21]

Industry documents library

University of California, San Francisco, “Industry documents library.” [Online]. Available: https://www.industrydocuments.ucsf.edu/

[22] [22]

Building a test collection for complex document information processing,

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” inProceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, Ed. New York, NY: ACM Press, 2006, pp. 665–666

2006

[23] [23]

Cord: A consolidated receipt dataset for post-ocr parsing,

S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee, “Cord: A consolidated receipt dataset for post-ocr parsing,” inDocument Intelligence Workshop at Neural Information Processing Systems (NeurIPS), 2019. [Online]. Available: https://openreview.net/forum?id=SJl3z659UH

2019

[24] [24]

Funsd: A dataset for form understanding in noisy scanned documents,

G. Jaume, H. Ekenel, and J.-P. Thiran, “Funsd: A dataset for form understanding in noisy scanned documents,” inAccepted to ICDAR-OST, 2019. [Online]. Available: https://arxiv.org/pdf/1905.13538

work page arXiv 2019

[25] [25]

Informa- tion extraction from visually rich documents using llm-based organization of documents into independent textual segments,

A. Bhattacharyya, A. Tripathi, U. Das, A. Karmakar, A. Pathak, and M. Gupta, “Informa- tion extraction from visually rich documents using llm-based organization of documents into independent textual segments,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, ...

2025

[26] [26]

Layoutllm: Layout instruction tuning with large language models for document understanding,

C. Luo, Y. Shen, Z. Zhu, Q. Zheng, Z. Yu, and C. Yao, “Layoutllm: Layout instruction tuning with large language models for document understanding,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15630–15640

2024

[27] [27]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, Q. Jin, L. He, X. Lin, and F. Huang, “Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Associatio...

2023

[28] [28]

Ocr quality: Key to enhanced data mining,

A. Jääskeläinen, M. Lipsanen, A. Föhr, and T. Räisänen, “Ocr quality: Key to enhanced data mining,” in2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), 2023, pp. 1–6

2023

[29] [29]

Vlcdoc: Vision-language contrastive pre-training model for cross-modal document classification,

Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, and Oriol Ramos Terrades, “Vlcdoc: Vision-language contrastive pre-training model for cross-modal document classification,”Pattern Recognition, vol. 139, p. 109419, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320323001206

2023

[30] [30]

Business document information extraction: Towards practical benchmarks,

M. Skalický, Š. Šimsa, M. Uřičář, and M. Šulc, “Business document information extraction: Towards practical benchmarks,” inExperimental IR Meets Multilinguality, Multimodality, and Interaction, ser. Lecture Notes in Computer Science, A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Fa...

2022

[31] [31]

Doclaynet: A large human- annotated dataset for document-layout segmentation,

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar, “Doclaynet: A large human- annotated dataset for document-layout segmentation,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. ACM Digital Library, A. Zhang, Ed. New York, United States: Association for Computing Machinery, 2022, pp. 3743–3751. 30 Mu...

2022

[32] [32]

A survey of recent approaches to form understanding in scanned documents,

A. Abdallah, D. Eberharter, Z. Pfister, and A. Jatowt, “A survey of recent approaches to form understanding in scanned documents,”Artificial Intelligence Review, vol. 57, no. 12, p. 342, 2024

2024

[33] [33]

Xfund: A benchmark dataset for multilingual visually rich form understanding,

Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei, “Xfund: A benchmark dataset for multilingual visually rich form understanding,” inFindings of the Association for Computational Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3214–3224. [O...

2022

[34] [34]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002

2021

[35] [35]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreau...

2020

[36] [36]

Multilingual denoising pre-training for neural machine translation,

Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, L. Zettlemoyer, M. Johnson, B. Roark, and A. Nenkova, “Multilingual denoising pre-training for neural machine translation,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

2020

[37] [37]

Dit: Self-supervised pre-training for document image transformer,

J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” inProceedings of the 30th ACM International Conference on Multimedia, ser. ACM Digital Library, J. Magalhães, Ed. New York, NY, United States: Association for Computing Machinery, 2022, pp. 3530–3539

2022

[38] [38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai, “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.” [Online]. Available: https://arxiv.org/pdf/2502.14786

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Huggingface transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Huggingface transformers: State-of-the-art natural language processing,” inProceedings of the 2020 Conference on Empiri...

2020