pith. machine review for the scientific record. sign in

arxiv: 2604.00161 · v2 · submitted 2026-03-31 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

Anan Du, Feng Feng, Hailong Yu, Hang Li, Jian Luan, Longwei Xu, Pei Fu, Shaojie Zhang, Xin Chen, Zhenbo Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords Q-Maskcausal maskstext anchoringOCRvision-language modelsmask decodertext groundingvisual chain-of-thought
0
0 comments X

The pith

Q-Mask generates query-conditioned visual masks before OCR output to create stable text anchors in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models often fail to ground specific queried text to its exact location in an image, even when they can read the text correctly. Q-Mask addresses this by inserting a causal decoder step that first produces a visual mask tied to the query, then performs recognition. This sequence separates locating the text from reading its content, following a visual chain-of-thought pattern. A new dataset of 26 million image-text pairs with fine-grained masks supplies the training signal for these correspondences. The result is improved text anchoring across varied scenes without changing the underlying model architecture.

Core claim

The central claim is that a causal query-driven mask decoder enables precise text anchoring by sequentially generating query-conditioned visual masks prior to OCR recognition, thereby disentangling spatial location from textual content and enforcing grounded evidence acquisition before final output.

What carries the argument

The causal query-driven mask decoder (CQMD) that produces query-specific visual masks to guide subsequent OCR recognition.

Load-bearing premise

Sequentially generating query-conditioned visual masks before recognition will enforce grounded evidence acquisition and produce stable text anchors without introducing new biases or failing to generalize beyond the TextAnchor-26M training distribution.

What would settle it

A held-out test set of images outside the TextAnchor-26M distribution where Q-Mask produces lower text-region grounding accuracy than a baseline VLM or generates masks that point to incorrect regions.

read the original abstract

Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Q-Mask, an OCR framework for vision-language models that employs a causal query-driven mask decoder (CQMD) to sequentially generate query-conditioned visual masks before producing the final OCR output. This visual chain-of-thought approach is intended to disentangle spatial grounding ('where the text is') from recognition ('what the text is'), thereby enforcing grounded evidence acquisition and enabling explicit text anchors. The work introduces TextAnchor-Bench for evaluating fine-grained text-region grounding and TextAnchor-26M, a large-scale dataset of image-text pairs with mask annotations, and claims that extensive experiments show substantial improvements in text anchoring and understanding across diverse scenes.

Significance. If the central claims hold with rigorous validation, the causal mask-decoding paradigm could meaningfully advance reliable text grounding in VLMs for downstream VQA tasks. The new benchmark and dataset would also provide reusable resources for studying spatial priors in OCR-oriented models.

major comments (3)
  1. [§3] §3 (CQMD architecture): The core claim that sequential causal mask generation enforces stable anchors and mitigates error propagation is load-bearing, yet the manuscript provides no ablation comparing the proposed order to joint mask-OCR modeling or non-causal alternatives; without this, it remains unclear whether early mask inaccuracies cascade into worse recognition performance.
  2. [§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate substantial improvements,' but the text supplies no quantitative metrics, baseline comparisons (e.g., against standard VLM attention or non-causal OCR models), error bars, or controls for dataset scale, leaving the magnitude and reliability of gains unverifiable.
  3. [§4.3] §4.3 (TextAnchor-26M): The dataset is presented as injecting strong spatial priors, but the manuscript does not report annotation methodology, inter-annotator agreement, or out-of-distribution generalization tests; this is critical because the central assumption that query-conditioned masks produce stable anchors may fail outside the 26M training distribution.
minor comments (1)
  1. [Abstract] The acronym CQMD is used in the abstract without immediate expansion, which reduces readability on first encounter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have reviewed each major comment carefully and provide point-by-point responses below. All requested clarifications and additions will be incorporated into the revised manuscript to strengthen the presentation of the CQMD architecture, experimental results, and dataset details.

read point-by-point responses
  1. Referee: [§3] §3 (CQMD architecture): The core claim that sequential causal mask generation enforces stable anchors and mitigates error propagation is load-bearing, yet the manuscript provides no ablation comparing the proposed order to joint mask-OCR modeling or non-causal alternatives; without this, it remains unclear whether early mask inaccuracies cascade into worse recognition performance.

    Authors: We agree that direct ablations are essential to substantiate the causal ordering. In the revision we will add a dedicated ablation subsection comparing the proposed sequential CQMD against (i) a joint mask-OCR decoder and (ii) a non-causal bidirectional mask decoder. These experiments will quantify error propagation by measuring recognition accuracy when early masks are intentionally perturbed, thereby testing whether the causal constraint indeed stabilizes anchors. We will also include a brief discussion of potential failure modes. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate substantial improvements,' but the text supplies no quantitative metrics, baseline comparisons (e.g., against standard VLM attention or non-causal OCR models), error bars, or controls for dataset scale, leaving the magnitude and reliability of gains unverifiable.

    Authors: We acknowledge that the current draft does not present the numerical results with sufficient detail. The revised §4 will be expanded to include: (1) full quantitative tables reporting accuracy, grounding IoU, and downstream VQA gains on TextAnchor-Bench; (2) explicit comparisons against standard VLM attention baselines and non-causal OCR variants; (3) error bars computed over five independent runs; and (4) controlled experiments that vary training set size to isolate the contribution of TextAnchor-26M. These additions will make the reported improvements directly verifiable. revision: yes

  3. Referee: [§4.3] §4.3 (TextAnchor-26M): The dataset is presented as injecting strong spatial priors, but the manuscript does not report annotation methodology, inter-annotator agreement, or out-of-distribution generalization tests; this is critical because the central assumption that query-conditioned masks produce stable anchors may fail outside the 26M training distribution.

    Authors: We will add a new subsection detailing the annotation pipeline for TextAnchor-26M, including the query-to-mask generation protocol and quality-control steps. Inter-annotator agreement (Cohen’s kappa) will be reported. In addition, we will include OOD generalization experiments on held-out scene types and datasets not seen during training to evaluate whether the learned spatial priors transfer beyond the 26M distribution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces new components including the CQMD decoder, TextAnchor-Bench benchmark, and TextAnchor-26M dataset. Claims of improvement rest on empirical training and evaluation results rather than reducing any prediction or central result to fitted inputs, self-citations, or definitional equivalences by construction. No equations or load-bearing steps in the provided text collapse the output to the input via the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the new CQMD architecture providing independent grounding and the TextAnchor-26M dataset supplying spatial priors; no explicit free parameters are named, but training scale implies data-driven fitting of the decoder.

axioms (1)
  • domain assumption Chain-of-thought style sequential decoding improves grounding in visual tasks
    The paper invokes inspiration from CoT reasoning to justify generating masks before OCR output.
invented entities (3)
  • Causal Query-driven Mask Decoder (CQMD) no independent evidence
    purpose: To sequentially generate query-conditioned visual masks prior to OCR recognition
    New decoder component introduced to disentangle location from content.
  • TextAnchor-Bench no independent evidence
    purpose: Benchmark for evaluating fine-grained text-region grounding
    New evaluation resource to measure text anchoring capability.
  • TextAnchor-26M no independent evidence
    purpose: Large-scale training dataset with fine-grained masks for text elements
    New data resource to inject spatial priors into VLM training.

pith-pipeline@v0.9.0 · 5572 in / 1383 out tokens · 81817 ms · 2026-05-13T23:27:35.316001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Anthropic. Introducing the next generation of claude.https://www.anthropic.com/news /claude-3-family, mar 2024a. Accessed: 2026-03-27. Anthropic. Introducing claude 3.5 sonnet.https://www.anthropic.com/news/claude-3 -5-sonnet, jun 2024b. Accessed: 2026-03-27. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A front...

  3. [3]

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. ...

  4. [4]

    Bhushan and M

    S. Bhushan and M. Lee. Block diagram-to-text: Understanding block diagram images by generating natural language descriptors. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 153–168, Online only, Nov

  5. [5]

    URLhttps://aclanthology.org/2022.findings-aacl.15

    Association for Computational Linguistics. URLhttps://aclanthology.org/2022.findings-aacl.15. A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301,

  6. [6]

    J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering.European Conference on Computer Vision, 2024a. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. Advances in Neural InformationProcessing Systems, 36, 2024b. X. Chen, Z. Zha...

  7. [7]

    26 Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao. Complicated table structure recognition. arXiv preprint arXiv:1908.04729,

  8. [8]

    Cruz and M

    F. Cruz and M. Castelli. Dataset of personal invoices and receipts including annotation of relevant fields.https://doi.org/10.5281/zenodo.7213544, Oct

  9. [9]

    C. Cui, T. Sun, S. Liang, T. Gao, Z. Zhang, J. Liu, X. Wang, C. Zhou, H. Liu, M. Lin, et al. Paddleocr- vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025a. C. Cui, T. Sun, M. Lin, T. Gao, Y. Zhang, J. Liu, X. Wang, Z. Zhang, C. Zhou, H. Liu, et al. Paddleocr 3.0 technical repo...

  10. [10]

    M. Diem, S. Fiel, F. Kleber, R. Sablatnig, J. M. Saavedra, D. Contreras, others, and L. S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In2014 14th InternationalConferenceon Frontiersin Handwriting Recognition, pages 779–784. IEEE, Sept

  11. [11]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  12. [12]

    URLhttps://arxiv.org/abs/2503.19786. P. Gervais, A. Fadeeva, and A. Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition.arXiv preprint arXiv:2404.10690,

  13. [13]

    URLhttps://github .com/buptlihang/CDLA. A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, et al. mplug- docowl 1.5: Unified structure learning for ocr-free document understanding.arXiv preprint arXiv:2403.12895, 2024a. A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mplug- docowl ...

  14. [14]

    Huang, H

    J. Huang, H. Chen, F. Yu, and W. Lu. From detection to application: Recent advances in under- standing scientific tables and figures.ACM Computing Surveys, 2024a. M. Huang, Y. Liu, D. Liang, L. Jin, and X. Bai. Mini-monkey: Alleviate the sawtooth effect by multi-scale adaptive cropping.arXiv preprint arXiv:2408.02034, 2024b. Z. Huang, K. Chen, J. He, X. B...

  15. [15]

    Jaume, H

    G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents.arXiv preprint arXiv:1905.13538,

  16. [16]

    S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler, and Y. Bengio. Figureqa: An annotated figure dataset for visual reasoning.arXiv preprint arXiv:1710.07300,

  17. [17]

    Karatzas, L

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE,

  18. [18]

    G. Kim, H. Lee, D. Kim, H. Jung, S. Park, Y. Kim, S. Yun, T. Kil, B. Lee, and S. Park. Visually-situated natural language understanding with contrastive reading model and frozen large language models. arXiv preprint arXiv:2305.15080,

  19. [19]

    L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024a. Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. S...

  20. [20]

    Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning

    F.Liu,X.Wang,W.Yao,J.Chen,K.Song,S.Cho,Y.Yacoob,andD.Yu. Mmc: Advancingmultimodal chartunderstandingwithlarge-scaleinstructiontuning. arXivpreprintarXiv:2311.10774,

  21. [21]

    Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

    Y.Liu, B.Yang, Q.Liu, Z.Li, Z.Ma, S.Zhang, andX.Bai. Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

  22. [22]

    S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Icdar 2023 competition on hierarchical text detection and recognition.arXiv preprint arXiv:2305.09750,

  23. [23]

    J. Lu, H. Yu, Y. Wang, Y. Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wang, et al. A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding. arXiv preprint arXiv:2407.01976, 2024a. J. Lu, H. Yu, Y. Wang, Y. Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wang, et al. A bounding box is worth...

  24. [24]

    Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

  25. [25]

    URLhttps://arxiv.org/abs/2410.07073. A. Mohammadshirazi, P. P. G. Neogi, S.-N. Lim, and R. Ramnath. Dlava: Document language and vision assistant for answer localization with enhanced interpretability and trustworthiness. arXiv preprint arXiv:2412.00151,

  26. [26]

    arXiv preprint arXiv:2503.11576 , year=

    A. Nassar, A. Marafioti, M. Omenetti, M. Lysak, N. Livathinos, C. Auer, L. Morin, R. T. de Lima, Y. Kim, A. S. Gurbuz, et al. Smoldocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.arXiv preprint arXiv:2503.11576,

  27. [27]

    30 OleehyO

    URLhttps://arxiv.org/abs/2509.22186. 30 OleehyO. latex-formulas. https://huggingface.co/datasets/OleehyO/latex-formu las,

  28. [28]

    Accessed: 2026-03-12

    Hugging Face dataset. Accessed: 2026-03-12. OpenAI. Introducing gpt-5.2,

  29. [29]

    Subramani, A

    N. Subramani, A. Matton, M. Greaves, and A. Lam. A survey of deep learning approaches for ocr and document understanding. arxiv 2020.arXiv preprint arXiv:2011.13534. H. R. Sujet AI, Allaa Boutaleb. Sujet-finance-qa-vision-100k: A large-scale dataset for financial document vqa,

  30. [30]

    URLhttps://huggingface.co/datasets/sujet-ai/Sujet-Fin ance-QA-Vision-100k. Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In2019 International Conference on Document Analysis and Recognition(ICDAR), pages 1557–1562. IEEE,

  31. [31]

    J. Tang, Q. Liu, Y. Ye, J. Lu, S. Wei, A.-L. Wang, C. Lin, H. Feng, Z. Zhao, Y. Wang, et al. Mtvqa: Benchmarkingmultilingualtext-centricvisualquestionanswering. In FindingsoftheAssociation for Computational Linguistics: ACL 2025, pages 7748–7763,

  32. [32]

    H. V. Team, P. Lyu, X. Wan, G. Li, S. Peng, W. Wang, L. Wu, H. Shen, Y. Zhou, C. Tang, et al. Hunyuanocr technical report.arXiv preprint arXiv:2511.19575, 2025a. 31 K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026a. V. Team, W. Hong, ...

  33. [33]

    A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images.arXiv preprint arXiv:1601.07140,

  34. [34]

    B. Wang, Z. Gu, G. Liang, C. Xu, B. Zhang, B. Shi, and C. He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024a. B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, B. Zhang, L. Wei, Z. Sui, W. Li, B. Shi, Y. Qiao, D. Lin, and C. He. Mineru: An open-sour...

  35. [35]

    32 Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang. Marten: Visual question answering with mask generation for multi-modal document understanding. In Proceedings of the Computer Visionand PatternRecognitionConference, pages 14460–14471, 2025b. H. Wei, Y. Sun, and Y. Li. Deepseek-ocr: Contexts optical compression. a...

  36. [36]

    H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026a. H. Wei, Y. Sun, and Y. Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026b. C. Wendler. Renderedtext dataset.https://huggingface.co/datasets/wendlerc/Rende redText,

  37. [37]

    Accessed: 2023-10-17. L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. R. Gosthipaty, and A. Marafioti. Finevision: Open data is all you need,

  38. [38]

    URLhttps://arxiv.org/abs/ 2510.17269. R. Xia, B. Zhang, H. Peng, H. Ye, X. Yan, P. Ye, B. Shi, Y. Qiao, and J. Yan. Structchart: Perception, structuring, reasoning for visual chart understanding.arXiv preprint arXiv:2309.11268,

  39. [39]

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126,

  40. [40]

    W. Yu, C. Zhang, H. Cao, W. Hua, B. Li, H. Chen, M. Liu, M. Chen, J. Kuang, M. Cheng, et al. Icdar 2023 competition on structured text extraction from visually-rich document images. In International Conference on Document Analysis and Recognition, pages 536–552. Springer,

  41. [41]

    Y.-Q. Yu, M. Liao, J. Zhang, and J. Wu. Texthawk2: A large vision-language model excels in bilingual ocr and grounding with 16x fewer tokens.arXiv preprint arXiv:2410.05261,

  42. [42]

    Y. Yuan, X. Liu, W. Dikubab, H. Liu, Z. Ji, Z. Wu, and X. Bai. Syntax-aware network for handwritten mathematical expression recognition.arXiv preprint arXiv:2203.01601,

  43. [43]

    Zhang, W

    J. Zhang, W. Yang, S. Lai, Z. Xie, and L. Jin. Dockylin: A large multimodal model for visual document understanding with efficient visual slimming.arXiv preprint arXiv:2406.19101,

  44. [44]

    Zhang, Y

    33 R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In2019international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE,