arxiv: 2604.00161 · v2 · submitted 2026-03-31 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

Anan Du, Feng Feng, Hailong Yu, Hang Li, Jian Luan, Longwei Xu, Pei Fu, Shaojie Zhang, Xin Chen, Zhenbo Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords Q-Maskcausal maskstext anchoringOCRvision-language modelsmask decodertext groundingvisual chain-of-thought

0 comments

The pith

Q-Mask generates query-conditioned visual masks before OCR output to create stable text anchors in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models often fail to ground specific queried text to its exact location in an image, even when they can read the text correctly. Q-Mask addresses this by inserting a causal decoder step that first produces a visual mask tied to the query, then performs recognition. This sequence separates locating the text from reading its content, following a visual chain-of-thought pattern. A new dataset of 26 million image-text pairs with fine-grained masks supplies the training signal for these correspondences. The result is improved text anchoring across varied scenes without changing the underlying model architecture.

Core claim

The central claim is that a causal query-driven mask decoder enables precise text anchoring by sequentially generating query-conditioned visual masks prior to OCR recognition, thereby disentangling spatial location from textual content and enforcing grounded evidence acquisition before final output.

What carries the argument

The causal query-driven mask decoder (CQMD) that produces query-specific visual masks to guide subsequent OCR recognition.

Load-bearing premise

Sequentially generating query-conditioned visual masks before recognition will enforce grounded evidence acquisition and produce stable text anchors without introducing new biases or failing to generalize beyond the TextAnchor-26M training distribution.

What would settle it

A held-out test set of images outside the TextAnchor-26M distribution where Q-Mask produces lower text-region grounding accuracy than a baseline VLM or generates masks that point to incorrect regions.

read the original abstract

Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-Mask adds a causal query-driven mask decoder plus a new benchmark and 26M dataset for text grounding in VLMs, but the abstract gives no numbers or ablations to show the sequential masks actually help.

read the letter

Q-Mask tries to improve text anchoring in VLMs by running a causal query-driven mask decoder that produces query-specific visual masks before the final OCR step. It packages this with TextAnchor-Bench and the TextAnchor-26M dataset of masked image-text pairs. The core idea is to treat mask generation as a visual chain-of-thought step that separates location from content and forces grounded evidence before recognition. The new benchmark and dataset are the clearest additions; they give the community concrete ways to measure and train for stable text-region correspondences, which current VLMs still handle poorly in real scenes. That part of the work is useful on its own. The causal ordering is a reasonable engineering choice on paper, but the abstract claims substantial gains without reporting any metrics, baselines, error bars, or controls. This leaves the main performance claim unsupported from the text provided. The stress-test point about error propagation is worth checking: if an early mask is off, the rest of the pipeline could degrade rather than improve, and nothing in the abstract shows an ablation on generation order versus joint modeling or standard attention. Without those checks, it is hard to know whether the disentanglement adds value or just adds another failure mode. The paper is aimed at groups working on OCR inside VLMs and on text-grounded VQA. Readers who need datasets or benchmarks for grounding tasks will get something concrete even if the model gains turn out incremental. It deserves peer review so referees can examine the full experiments, ablations, and dataset construction details.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Q-Mask, an OCR framework for vision-language models that employs a causal query-driven mask decoder (CQMD) to sequentially generate query-conditioned visual masks before producing the final OCR output. This visual chain-of-thought approach is intended to disentangle spatial grounding ('where the text is') from recognition ('what the text is'), thereby enforcing grounded evidence acquisition and enabling explicit text anchors. The work introduces TextAnchor-Bench for evaluating fine-grained text-region grounding and TextAnchor-26M, a large-scale dataset of image-text pairs with mask annotations, and claims that extensive experiments show substantial improvements in text anchoring and understanding across diverse scenes.

Significance. If the central claims hold with rigorous validation, the causal mask-decoding paradigm could meaningfully advance reliable text grounding in VLMs for downstream VQA tasks. The new benchmark and dataset would also provide reusable resources for studying spatial priors in OCR-oriented models.

major comments (3)

[§3] §3 (CQMD architecture): The core claim that sequential causal mask generation enforces stable anchors and mitigates error propagation is load-bearing, yet the manuscript provides no ablation comparing the proposed order to joint mask-OCR modeling or non-causal alternatives; without this, it remains unclear whether early mask inaccuracies cascade into worse recognition performance.
[§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate substantial improvements,' but the text supplies no quantitative metrics, baseline comparisons (e.g., against standard VLM attention or non-causal OCR models), error bars, or controls for dataset scale, leaving the magnitude and reliability of gains unverifiable.
[§4.3] §4.3 (TextAnchor-26M): The dataset is presented as injecting strong spatial priors, but the manuscript does not report annotation methodology, inter-annotator agreement, or out-of-distribution generalization tests; this is critical because the central assumption that query-conditioned masks produce stable anchors may fail outside the 26M training distribution.

minor comments (1)

[Abstract] The acronym CQMD is used in the abstract without immediate expansion, which reduces readability on first encounter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have reviewed each major comment carefully and provide point-by-point responses below. All requested clarifications and additions will be incorporated into the revised manuscript to strengthen the presentation of the CQMD architecture, experimental results, and dataset details.

read point-by-point responses

Referee: [§3] §3 (CQMD architecture): The core claim that sequential causal mask generation enforces stable anchors and mitigates error propagation is load-bearing, yet the manuscript provides no ablation comparing the proposed order to joint mask-OCR modeling or non-causal alternatives; without this, it remains unclear whether early mask inaccuracies cascade into worse recognition performance.

Authors: We agree that direct ablations are essential to substantiate the causal ordering. In the revision we will add a dedicated ablation subsection comparing the proposed sequential CQMD against (i) a joint mask-OCR decoder and (ii) a non-causal bidirectional mask decoder. These experiments will quantify error propagation by measuring recognition accuracy when early masks are intentionally perturbed, thereby testing whether the causal constraint indeed stabilizes anchors. We will also include a brief discussion of potential failure modes. revision: yes
Referee: [§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate substantial improvements,' but the text supplies no quantitative metrics, baseline comparisons (e.g., against standard VLM attention or non-causal OCR models), error bars, or controls for dataset scale, leaving the magnitude and reliability of gains unverifiable.

Authors: We acknowledge that the current draft does not present the numerical results with sufficient detail. The revised §4 will be expanded to include: (1) full quantitative tables reporting accuracy, grounding IoU, and downstream VQA gains on TextAnchor-Bench; (2) explicit comparisons against standard VLM attention baselines and non-causal OCR variants; (3) error bars computed over five independent runs; and (4) controlled experiments that vary training set size to isolate the contribution of TextAnchor-26M. These additions will make the reported improvements directly verifiable. revision: yes
Referee: [§4.3] §4.3 (TextAnchor-26M): The dataset is presented as injecting strong spatial priors, but the manuscript does not report annotation methodology, inter-annotator agreement, or out-of-distribution generalization tests; this is critical because the central assumption that query-conditioned masks produce stable anchors may fail outside the 26M training distribution.

Authors: We will add a new subsection detailing the annotation pipeline for TextAnchor-26M, including the query-to-mask generation protocol and quality-control steps. Inter-annotator agreement (Cohen’s kappa) will be reported. In addition, we will include OOD generalization experiments on held-out scene types and datasets not seen during training to evaluate whether the learned spatial priors transfer beyond the 26M distribution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces new components including the CQMD decoder, TextAnchor-Bench benchmark, and TextAnchor-26M dataset. Claims of improvement rest on empirical training and evaluation results rather than reducing any prediction or central result to fitted inputs, self-citations, or definitional equivalences by construction. No equations or load-bearing steps in the provided text collapse the output to the input via the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the new CQMD architecture providing independent grounding and the TextAnchor-26M dataset supplying spatial priors; no explicit free parameters are named, but training scale implies data-driven fitting of the decoder.

axioms (1)

domain assumption Chain-of-thought style sequential decoding improves grounding in visual tasks
The paper invokes inspiration from CoT reasoning to justify generating masks before OCR output.

invented entities (3)

Causal Query-driven Mask Decoder (CQMD) no independent evidence
purpose: To sequentially generate query-conditioned visual masks prior to OCR recognition
New decoder component introduced to disentangle location from content.
TextAnchor-Bench no independent evidence
purpose: Benchmark for evaluating fine-grained text-region grounding
New evaluation resource to measure text anchoring capability.
TextAnchor-26M no independent evidence
purpose: Large-scale training dataset with fine-grained masks for text elements
New data resource to inject spatial priors into VLM training.

pith-pipeline@v0.9.0 · 5572 in / 1383 out tokens · 81817 ms · 2026-05-13T23:27:35.316001+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Anthropic. Introducing the next generation of claude.https://www.anthropic.com/news /claude-3-family, mar 2024a. Accessed: 2026-03-27. Anthropic. Introducing claude 3.5 sonnet.https://www.anthropic.com/news/claude-3 -5-sonnet, jun 2024b. Accessed: 2026-03-27. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A front...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. ...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Bhushan and M

S. Bhushan and M. Lee. Block diagram-to-text: Understanding block diagram images by generating natural language descriptors. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 153–168, Online only, Nov

work page 2022
[5]

URLhttps://aclanthology.org/2022.findings-aacl.15

Association for Computational Linguistics. URLhttps://aclanthology.org/2022.findings-aacl.15. A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301,

work page 2022
[6]

J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering.European Conference on Computer Vision, 2024a. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. Advances in Neural InformationProcessing Systems, 36, 2024b. X. Chen, Z. Zha...

work page 2021
[7]

26 Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729,

work page arXiv 1908
[8]

Cruz and M

F. Cruz and M. Castelli. Dataset of personal invoices and receipts including annotation of relevant fields.https://doi.org/10.5281/zenodo.7213544, Oct

work page doi:10.5281/zenodo.7213544
[9]

C. Cui, T. Sun, S. Liang, T. Gao, Z. Zhang, J. Liu, X. Wang, C. Zhou, H. Liu, M. Lin, et al. Paddleocr- vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025a. C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical repo...

work page arXiv
[10]

M. Diem, S. Fiel, F. Kleber, R. Sablatnig, J. M. Saavedra, D. Contreras, others, and L. S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In2014 14th InternationalConferenceon Frontiersin Handwriting Recognition, pages 779–784. IEEE, Sept

work page 2014
[11]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URLhttps://arxiv.org/abs/2503.19786. P. Gervais, A. Fadeeva, and A. Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition.arXiv preprint arXiv:2404.10690,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

URLhttps://github .com/buptlihang/CDLA. A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, et al. mplug- docowl 1.5: Unified structure learning for ocr-free document understanding.arXiv preprint arXiv:2403.12895, 2024a. A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mplug- docowl ...

work page arXiv 2024
[14]

Huang, H

J. Huang, H. Chen, F. Yu, and W. Lu. From detection to application: Recent advances in under- standing scientific tables and figures.ACM Computing Surveys, 2024a. M. Huang, Y. Liu, D. Liang, L. Jin, and X. Bai. Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping.arXiv preprint arXiv:2408.02034, 2024b. Z. Huang, K. Chen, J. He, X. B...

work page arXiv
[15]

Jaume, H

G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents.arXiv preprint arXiv:1905.13538,

work page arXiv 1905
[16]

S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler, and Y. Bengio. Figureqa: An annotated figure dataset for visual reasoning.arXiv preprint arXiv:1710.07300,

work page Pith review arXiv
[17]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE,

work page 2015
[18]

G. Kim, H. Lee, D. Kim, H. Jung, S. Park, Y. Kim, S. Yun, T. Kil, B. Lee, and S. Park. Visually-situated natural language understanding with contrastive reading model and frozen large language models. arXiv preprint arXiv:2305.15080,

work page arXiv
[19]

L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024a. Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. S...

work page arXiv
[20]

Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning

F.Liu,X.Wang,W.Yao,J.Chen,K.Song,S.Cho,Y.Yacoob,andD.Yu. Mmc: Advancingmultimodal chartunderstandingwithlarge-scaleinstructiontuning. arXivpreprintarXiv:2311.10774,

work page arXiv
[21]

Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

Y.Liu, B.Yang, Q.Liu, Z.Li, Z.Ma, S.Zhang, andX.Bai. Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

work page arXiv
[22]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Icdar 2023 competition on hierarchical text detection and recognition.arXiv preprint arXiv:2305.09750,

work page arXiv 2023
[23]

J. Lu, H. Yu, Y. Wang, Y. Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wang, et al. A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding. arXiv preprint arXiv:2407.01976, 2024a. J. Lu, H. Yu, Y. Wang, Y. Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wang, et al. A bounding box is worth...

work page arXiv
[24]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

work page arXiv
[25]

URLhttps://arxiv.org/abs/2410.07073. A. Mohammadshirazi, P. P. G. Neogi, S.-N. Lim, and R. Ramnath. Dlava: Document language and vision assistant for answer localization with enhanced interpretability and trustworthiness. arXiv preprint arXiv:2412.00151,

work page internal anchor Pith review arXiv
[26]

arXiv preprint arXiv:2503.11576 , year=

A. Nassar, A. Marafioti, M. Omenetti, M. Lysak, N. Livathinos, C. Auer, L. Morin, R. T. de Lima, Y. Kim, A. S. Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576,

work page arXiv
[27]

30 OleehyO

URLhttps://arxiv.org/abs/2509.22186. 30 OleehyO. latex-formulas. https://huggingface.co/datasets/OleehyO/latex-formu las,

work page arXiv
[28]

Accessed: 2026-03-12

Hugging Face dataset. Accessed: 2026-03-12. OpenAI. Introducing gpt-5.2,

work page 2026
[29]

Subramani, A

N. Subramani, A. Matton, M. Greaves, and A. Lam. A survey of deep learning approaches for ocr and document understanding. arxiv 2020.arXiv preprint arXiv:2011.13534. H. R. Sujet AI, Allaa Boutaleb. Sujet-finance-qa-vision-100k: A large-scale dataset for financial document vqa,

work page arXiv 2020
[30]

URLhttps://huggingface.co/datasets/sujet-ai/Sujet-Fin ance-QA-Vision-100k. Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In2019 International Conference on Document Analysis and Recognition(ICDAR), pages 1557–1562. IEEE,

work page 2019
[31]

J. Tang, Q. Liu, Y. Ye, J. Lu, S. Wei, A.-L. Wang, C. Lin, H. Feng, Z. Zhao, Y. Wang, et al. Mtvqa: Benchmarkingmultilingualtext-centricvisualquestionanswering. In FindingsoftheAssociation for Computational Linguistics: ACL 2025, pages 7748–7763,

work page 2025
[32]

H. V. Team, P. Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, et al. Hunyuanocr technical report.arXiv preprint arXiv:2511.19575, 2025a. 31 K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026a. V. Team, W. Hong, ...

work page arXiv
[33]

A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images.arXiv preprint arXiv:1601.07140,

work page arXiv
[34]

B. Wang, Z. Gu, G. Liang, C. Xu, B. Zhang, B. Shi, and C. He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024a. B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, B. Zhang, L. Wei, Z. Sui, W. Li, B. Shi, Y. Qiao, D. Lin, and C. He. Mineru: An open-sour...

work page arXiv
[35]

32 Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang. Marten: Visual question answering with mask generation for multi-modal document understanding. In Proceedings of the Computer Visionand PatternRecognitionConference, pages 14460–14471, 2025b. H. Wei, Y. Sun, and Y. Li. Deepseek-ocr: Contexts optical compression. a...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026a. H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026b. C. Wendler. Renderedtext dataset.https://huggingface.co/datasets/wendlerc/Rende redText,

work page arXiv
[37]

Accessed: 2023-10-17. L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. R. Gosthipaty, and A. Marafioti. Finevision: Open data is all you need,

work page 2023
[38]

URLhttps://arxiv.org/abs/ 2510.17269. R. Xia, B. Zhang, H. Peng, H. Ye, X. Yan, P. Ye, B. Shi, Y. Qiao, and J. Yan. Structchart: Perception, structuring, reasoning for visual chart understanding.arXiv preprint arXiv:2309.11268,

work page arXiv
[39]

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126,

work page arXiv
[40]

W. Yu, C. Zhang, H. Cao, W. Hua, B. Li, H. Chen, M. Liu, M. Chen, J. Kuang, M. Cheng, et al. Icdar 2023 competition on structured text extraction from visually-rich document images. In International Conference on Document Analysis and Recognition, pages 536–552. Springer,

work page 2023
[41]

Y.-Q. Yu, M. Liao, J. Zhang, and J. Wu. Texthawk2: A large vision-language model excels in bilingual ocr and grounding with 16x fewer tokens.arXiv preprint arXiv:2410.05261,

work page arXiv
[42]

Y. Yuan, X. Liu, W. Dikubab, H. Liu, Z. Ji, Z. Wu, and X. Bai. Syntax-aware network for handwritten mathematical expression recognition.arXiv preprint arXiv:2203.01601,

work page arXiv
[43]

Zhang, W

J. Zhang, W. Yang, S. Lai, Z. Xie, and L. Jin. Dockylin: A large multimodal model for visual document understanding with efficient visual slimming.arXiv preprint arXiv:2406.19101,

work page arXiv
[44]

Zhang, Y

33 R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In2019international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE,

work page 2019