pith. machine review for the scientific record. sign in

arxiv: 2604.08538 · v3 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang, Daniel B. Ospina, Pierre-Lo\"ic Doulcet, Preston Carlson, Sacha Bron, Sebasti\'an G. Acosta, Simon Suo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords document parsingAI agentsbenchmarksemantic correctnessenterprise documentstableschartsvisual grounding
0
0 comments X

The pith

Current document parsing methods show inconsistent performance across dimensions critical for AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ParseBench to assess how well document parsers support AI agents in making autonomous decisions from enterprise documents. It argues that semantic correctness, rather than simple text matching, is essential for preserving structure in tables, charts, formatting, and visual elements. By testing multiple methods on real-world pages from insurance, finance, and government sources, the work shows inconsistent performance where no single approach excels in all required areas. A sympathetic reader would care because unreliable parsing can lead to flawed decisions by AI systems in practical applications.

Core claim

ParseBench is introduced as a benchmark spanning about 2000 pages from insurance, finance, and government documents. Organized into five capability dimensions, it evaluates a range of parsing approaches and establishes that capabilities remain fragmented with persistent gaps for agent applications.

What carries the argument

ParseBench benchmark organized around five dimensions of semantic correctness for document parsing.

If this is right

  • Improved parsers are needed that balance performance across all dimensions rather than specializing in one.
  • Benchmarks for document parsing should incorporate semantic and structural evaluations instead of relying solely on text similarity.
  • AI agent systems can use such benchmarks to select or fine-tune parsers for enterprise automation tasks.
  • The identified gaps indicate specific areas like chart interpretation where further research is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designers might implement fallback mechanisms or verification steps when using current parsers.
  • Extending the benchmark to include more diverse document types could reveal additional challenges.
  • The emphasis on semantic correctness may influence how future vision-language models are trained for parsing tasks.

Load-bearing premise

That the five dimensions chosen adequately cover the semantic correctness requirements for AI agents operating on enterprise documents.

What would settle it

Finding or developing a parsing method that maintains high accuracy in all five dimensions on the benchmark's documents would falsify the observation of a fragmented capability landscape.

Figures

Figures reproduced from arXiv: 2604.08538 by Boyang Zhang, Daniel B. Ospina, Pierre-Lo\"ic Doulcet, Preston Carlson, Sacha Bron, Sebasti\'an G. Acosta, Simon Suo.

Figure 1
Figure 1. Figure 1: ParseBench comprises ∼2,000 human-verified enterprise document pages and more than 169K test rules, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Each dimension targets a distinct class of agent-critical parsing failures. Abstract AI agents are changing the requirements for document parsing. What matters is semantic correctness… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the TABLERECORDMATCH met￾ric. Predicted records and columns are matched to those in the ground truth. Each matched record is scored by bi￾nary cell-level agreement. ParseBench is designed to reflect real-world conditions. Every table remains embedded in its original PDF page, preserving the full visual and structural context a produc￾tion parser must navigate. The dataset skews toward in￾su… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the CHARTDATAPOINTMATCH evaluation: a chart is parsed into a table, and annotated data points are verified by matching expected values and labels against the table’s rows and columns. column keys and let r[k] be its value at key k. Then: TableRecordMatch(G, P) = P (g,p)∈M RecordSim(g, p) max(|G|, |P|) (1) RecordSim(g, p) = P k∈K(g)∩K(p) 1 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Formatting loss changes document semantics. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quality vs. cost for all evaluated providers ( [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of chart characteristics in the benchmark dataset. Per-chart dimensions are computed over all [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tables: quality vs. cost. 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Cost per page (cents) 0 10 20 30 40 50 60 70 80 Charts score (%) LP Cost-Eff LP Agentic GPT-mini (min) GPT-mini (med) Haiku 4.5 Haiku 4.5 (think) Gemini (min) Gemini (high) Opus 4.6 Gemini Pro GPT-5.4 Textract Google DocAI Azure DI Reducto (agentic) Reducto Extend LandingAI Charts vs. Cost LlamaParse VLMs Specialized Document Parsers [PITH_FULL_IMAGE:f… view at source ↗
Figure 8
Figure 8. Figure 8: Charts: quality vs. cost. behind. Here the relevant cost question is not how much to pay for a few extra points, but whether a system supports the capability at all. Visual Grounding ( [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Content Faithfulness: quality vs. cost. 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Cost per page (cents) 0 20 40 60 80 Semantic Formatting score (%) LP Cost-Eff LP Agentic GPT-mini (min) GPT-mini (med) Haiku 4.5 Haiku 4.5 (think) Gemini (min) Gemini (high) Opus 4.6 Gemini Pro GPT-5.4 Textract Google DocAI Azure DI Reducto Reducto (agentic) Extend LandingAI Semantic Formatting vs. Cost LlamaParse VLMs Specialized Docu… view at source ↗
Figure 10
Figure 10. Figure 10: Semantic Formatting: quality vs. cost. D.3 Fine-Grained Layout Metric Breakdown Tables 9 and 10 unpack the headline visual-grounding score into finer-grained layout metrics for the same comparison set used in the main results. The Pages column reports how many pages each provider returns usable grounding output for out of the nominal 500-page layout slice. Metric values are aggregated over the pages count… view at source ↗
Figure 11
Figure 11. Figure 11: Visual Grounding: quality vs. cost [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Difficulty-stratified layout element pass rate on two reported buckets, [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Two contrasting predictions where GriTS and TEDS disagree with the table’s downstream usefulness. (a) A [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Change in literacy proficiency between cycles, by educational attainment Educational attainment Country Round, Cycle Unadjusted Adjusted Below upper secondary Sweden 1 1 10 Below upper secondary Finland 1 -1 -1 Below upper secondary Spain 1 -4 -4 Below upper secondary Denmark 1 -5 -5 Below upper secondary Canada 1 -5 -5 (b) LlamaParse Agentic (8/10) [PITH_FULL_IMAGE:figures/full_fig_p030_3_7.png] view at source ↗
Figure 14
Figure 14. Figure 14: Source chart and provider outputs for the OECD chart example. (a) Source chart with Sweden / Below [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Source page and provider outputs for the Federal Register example. (a) Source 3-column page. (b)– [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Source page and provider outputs for the infographic example. (a) Source page with 3-level heading [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Ground truth and predicted layout overlays for the Sappi annual report page. (a) Ground truth with 44 [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
read the original abstract

AI agents are changing the requirements for document parsing. What matters is semantic correctness: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce ParseBench, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at 84.9%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on https://huggingface.co/datasets/llamaindex/ParseBench and https://github.com/run-llama/ParseBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ParseBench, a benchmark of ~2,000 human-verified pages drawn from enterprise documents in insurance, finance, and government. It organizes evaluation around five capability dimensions (tables, charts, content faithfulness, semantic formatting, visual grounding) and reports results for 14 methods spanning VLMs, specialized parsers, and LlamaParse. The central finding is a fragmented capability landscape in which no method is consistently strong across all dimensions, with LlamaParse Agentic attaining the highest overall score of 84.9%. Dataset and evaluation code are released publicly.

Significance. If the benchmark construction and metrics prove reliable, ParseBench supplies a needed resource for assessing document parsers under the semantic-correctness requirements of autonomous agents. The public release of the dataset and code is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [Abstract and benchmark construction] The abstract and benchmark-construction description provide no explicit definitions of the per-dimension metrics, no page-selection criteria, and no inter-annotator agreement statistics for the human verification step. These omissions directly limit independent verification of the reported scores and of the claim that capabilities are fragmented.
  2. [Evaluation results] The headline result that 'no method is consistently strong across all five dimensions' is asserted without accompanying per-dimension score tables or breakdowns in the provided summary; such tables are required to substantiate the fragmentation conclusion and to allow readers to assess whether LlamaParse Agentic's 84.9% overall score masks specific weaknesses.
minor comments (2)
  1. [Introduction] The five dimensions and the chosen document domains are presented as capturing agent-critical semantic needs, yet the manuscript does not discuss potential gaps such as multi-page table continuity or cross-reference resolution; a short paragraph acknowledging these scope limitations would strengthen the paper.
  2. [Dataset and code availability] Ensure that the released evaluation code on GitHub includes the exact scripts used to compute the 84.9% aggregate score so that future comparisons remain reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for minor revision. We address each of the major comments below and commit to revisions that will improve the manuscript's clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract and benchmark construction] The abstract and benchmark-construction description provide no explicit definitions of the per-dimension metrics, no page-selection criteria, and no inter-annotator agreement statistics for the human verification step. These omissions directly limit independent verification of the reported scores and of the claim that capabilities are fragmented.

    Authors: We acknowledge that the abstract, being a high-level summary, does not contain these specifics. The full paper describes the benchmark construction in detail, but to directly address this comment, we will revise the manuscript to include explicit definitions of the per-dimension metrics early in the benchmark section, specify the page-selection criteria, and report the inter-annotator agreement statistics for the human verification. These additions will be made in a new or expanded subsection to support independent verification. revision: yes

  2. Referee: [Evaluation results] The headline result that 'no method is consistently strong across all five dimensions' is asserted without accompanying per-dimension score tables or breakdowns in the provided summary; such tables are required to substantiate the fragmentation conclusion and to allow readers to assess whether LlamaParse Agentic's 84.9% overall score masks specific weaknesses.

    Authors: The manuscript's results section presents the overall scores and discusses the fragmentation, but we agree that dedicated per-dimension tables are essential for full substantiation. We will add or expand the per-dimension score tables and breakdowns in the evaluation results to clearly show performance across all five dimensions for each method. This will allow readers to verify the claim and evaluate LlamaParse Agentic's specific profile. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct ground-truth evaluation

full rationale

The paper introduces ParseBench as a new empirical benchmark consisting of ~2000 human-verified enterprise document pages scored across five author-defined capability dimensions. It reports direct evaluation results for 14 methods (including LlamaParse Agentic) against this ground truth, with no equations, derivations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claims (fragmented landscape, highest score of 84.9%) follow immediately from the tabulated scores on the held-out verified pages. This is self-contained against external human annotations and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that human verification reliably captures semantic correctness and that the five dimensions plus the chosen document distribution represent the needs of AI agents in enterprise settings.

axioms (1)
  • domain assumption Human verification of parsed output provides reliable ground truth for semantic correctness in tables, charts, formatting, faithfulness, and visual grounding.
    The benchmark is built on ~2000 human-verified pages presented as the reference standard.

pith-pipeline@v0.9.0 · 5518 in / 1250 out tokens · 49906 ms · 2026-05-10T17:15:48.479759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Amazon Textract pricing

    Amazon Web Services. Amazon Textract pricing. https://aws.amazon.com/textract/ pricing/, 2026. Accessed: 2026-04-01

  2. [2]

    Claude Haiku 4.5 System Card

    Anthropic. Claude Haiku 4.5 System Card. Technical report, October 2025. URL https://www-cdn.anthropic.com/ 7aad69bf12627d42234e01ee7c36305dc2f6a970. pdf

  3. [3]

    Claude Opus 4.5 System Card

    Anthropic. Claude Opus 4.5 System Card. Technical report, November 2025. URL https://www-cdn.anthropic.com/ bf10f64990cfda0ba858290be7b8cc6317685f47. pdf

  4. [4]

    Claude pricing.https: //platform.claude.com/docs/en/ about-claude/pricing, 2026

    Anthropic. Claude pricing.https: //platform.claude.com/docs/en/ about-claude/pricing, 2026. Accessed: 2026-04-01

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Ming- sheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Sh...

  6. [6]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Pad- dleocr 3.0 technical report, 2025. URLhttps: //arxiv.org/abs/2507.05595

  7. [7]

    How credits work – Extend AI

    Extend AI. How credits work – Extend AI. https://docs.extend.ai/2026-02-09/ product/general/how-credits-work,

  8. [9]

    OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An improved bench- mark for evaluating large multimodal models on v...

  9. [10]

    Gemini API pricing.https://ai

    Google. Gemini API pricing.https://ai. google.dev/gemini-api/docs/pricing,

  10. [11]

    Accessed: 2026-04-01

  11. [12]

    Google cloud Document AI pricing.https://cloud.google.com/ document-ai/pricing, 2026

    Google Cloud. Google cloud Document AI pricing.https://cloud.google.com/ document-ai/pricing, 2026. Accessed: 2026-04-01

  12. [13]

    Gemini 3 flash model card

    Google DeepMind. Gemini 3 flash model card. Technical report, Google DeepMind, December 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf

  13. [14]

    Granite vision: a lightweight, open-source multimodal model for enterprise intelligence.arXiv preprint arXiv:2502.09927, 2025

    Granite Vision Team, Leonid Karlinsky, Assaf Ar- belle, Abraham Daniels, Ahmed Nassar, et al. Gran- ite vision: A lightweight, open-source multimodal model for enterprise intelligence, 2025. URL https://arxiv.org/abs/2502.09927

  14. [15]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Sym- posium on Operating Systems Principles (SOSP), pages 611–626. ACM, 2023. doi: 10.1145/3600006. 3613165

  15. [16]

    Landingai Document Extraction pricing.https://docs.landing.ai/ade/ ade-pricing, 2026

    LandingAI. Landingai Document Extraction pricing.https://docs.landing.ai/ade/ ade-pricing, 2026. Accessed: 2026-04-01

  16. [17]

    dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual docu- ment layout parsing in a single vision-language model, 2025. URLhttps://arxiv.org/ abs/2512.02498

  17. [18]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InEuropean Confer- ence on Computer Vision (ECCV), pages 740–755. Springer, 2014. 13

  18. [19]

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vage- nas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lu- cas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling: An efficient open-source toolkit for ai-driven document con...

  19. [20]

    Llamacloud pricing.https: //developers.llamaindex.ai/python/ cloud/general/pricing/, 2026

    LlamaIndex. Llamacloud pricing.https: //developers.llamaindex.ai/python/ cloud/general/pricing/, 2026. Accessed: 2026-04-01

  20. [21]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279. Association for Computational Linguis- tics, 2022

  21. [22]

    Azure AI Document Intelligence pricing.https://azure.microsoft

    Microsoft. Azure AI Document Intelligence pricing.https://azure.microsoft. com/en-us/pricing/details/ document-intelligence/, 2026. Accessed: 2026-04-01

  22. [23]

    Modal: AI infrastructure that develop- ers love.https://modal.com, 2026

    Modal Labs. Modal: AI infrastructure that develop- ers love.https://modal.com, 2026. Accessed: 2026-04-01

  23. [24]

    OpenAI GPT-5 System Card

    OpenAI. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2025

  24. [25]

    OpenAI API pricing.https: //developers.openai.com/api/docs/ pricing, 2026

    OpenAI. OpenAI API pricing.https: //developers.openai.com/api/docs/ pricing, 2026. Accessed: 2026-04-01

  25. [26]

    OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive anno- tations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Con- ghui He. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive anno- tations. InProceedings of the IEEE/CVF...

  26. [27]

    Nassar, and Peter Staar

    Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document- layout segmentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3743–3751. ACM, August 2022. doi: 10.1145/3534678. 3539043. URLhttp://dx.doi.org/10. 1145/3534678.3539043

  27. [28]

    olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Ja- son Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025. URL https://arxiv.org/abs/2502.18443

  28. [29]

    Credit usage – Reducto.https: //docs.reducto.ai/reference/ credit-usage, 2026

    Reducto. Credit usage – Reducto.https: //docs.reducto.ai/reference/ credit-usage, 2026. Accessed: 2026-04- 01

  29. [30]

    Grits: Grid table similarity metric for ta- ble structure recognition, 2023

    Brandon Smock, Rohith Pesala, and Robin Abra- ham. Grits: Grid table similarity metric for ta- ble structure recognition, 2023. URLhttps:// arxiv.org/abs/2203.12555

  30. [31]

    Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extraction, 2024. URL https://arxiv.org/abs/2409.18839

  31. [32]

    Global ta- ble extractor (GTE): A framework for joint table identification and cell structure recognition using vi- sual context

    Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global ta- ble extractor (GTE): A framework for joint table identification and cell structure recognition using vi- sual context. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 697–706, 2021

  32. [33]

    Image-based table recognition: Data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision (ECCV), pages 564–580. Springer, 2020

  33. [34]

    editable

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: Data, model, and evaluation. InEuropean Conference on Computer Vision (ECCV), pages 564–580. Springer, 2020. 14 Appendix Contents A Benchmark Construction Details 15 A.1 Annotation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  34. [35]

    A VLM (Gemini 3.0 Flash) transcribes the source document into Markdown

  35. [36]

    A second VLM pass analyzes the Markdown against the original document, generating a quality report with suggested patches for any errors or omissions

  36. [37]

    A human annotator manually reviews the Markdown and approves or rejects patches in a custom-built annotation interface

  37. [38]

    Primary 2015

    If modifications were made in step 3, we return to step 2 and re-generate patches for human review. The loop terminates when the annotator accepts the transcription without further edits. This VLM-in-the-loop approach substantially reduces annotation time compared to fully manual transcription, while the human review step ensures ground-truth quality. A.2...

  38. [39]

    Bbox format: [x1, y1, x2, y2]

  39. [40]

    Layout Categories: The possible categories are [’Caption’, ’Footnote’, ’Formula’, ’List-item’, ’Page-footer’, ’Page-header’, ’Picture’, ’Section-header’, ’Table’, ’Text’, ’Title’]

  40. [41]

    For non-chart pictures, the text field should be omitted

    Text Extraction & Formatting Rules: - Picture: If the picture is a chart or graph, extract all data points and format as an HTML table with flat combined column headers. For non-chart pictures, the text field should be omitted. - Formula: Format its text as LaTeX. - Table: Format its text as HTML. - All Others: Format their text as Markdown

  41. [42]

    All layout elements must be sorted according to human reading order

    Constraints: The output text must be the original text from the image, with no translation. All layout elements must be sorted according to human reading order

  42. [43]

    Final Output: The entire output must be a single JSON object. Qwen 3 VL (8B).As a general-purpose VLM not specifically trained for document parsing or layout detection, Qwen 3 VL performs significantly worse when asked to produce both parsed content and layout annotations in a single prompt. We therefore use two separate pipelines: one for content parsing...

  43. [44]

    Bbox format: [x1, y1, x2, y2] using normalized 0-1000 coordinates

  44. [45]

    Layout Categories: [’Caption’, ’Footnote’, ’Formula’, ’List-item’, ’Page-footer’, ’Page-header’, ’Picture’, ’Section-header’, ’Table’, ’Text’, ’Title’]

  45. [46]

    Text Extraction & Formatting Rules: Picture charts as HTML tables, Formula as LaTeX, Table as HTML, all others as Markdown

  46. [47]

    Reading order

    Constraints: Original text only, no translation. Reading order

  47. [48]

    502–6522

    Final Output: Return ONLY a JSON array. C.3 Infrastructure Details Open-weight VLMs and the open-source Docling pipeline are deployed on Modal’s [21] serverless GPU infrastructure. Table 7 summarizes the hardware and software stack for each self-hosted model. Model GPU Serving Framework Version Notes Qwen 3 VL (8B) NVIDIA H100 vLLM 0.11.2 Full precision (...