pith. machine review for the scientific record. sign in

arxiv: 2603.15118 · v2 · submitted 2026-03-16 · 💻 cs.CV

Recognition: no theorem link

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords VAREXstructured data extractionmultimodal modelsdocument understandinggovernment formsbenchmarksmall language modelslayout preservation
0
0 comments X

The pith

Small multimodal models fail at structured extraction mainly due to output compliance issues rather than understanding, and layout-preserving text boosts accuracy more than images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VAREX, a new benchmark using synthetic government forms to test how well multimodal models extract structured data from documents. It provides each document in four input formats to compare text, layout text, images, and combined inputs. Evaluations of 20 models show that models under 4 billion parameters struggle with producing the correct output structure, which hurts scores dramatically, but fine-tuning small models can fix this without needing more scale. Layout text gives bigger gains than visual features from images.

Core claim

VAREX reveals that below 4B parameters, structured output compliance—not extraction capability—is the dominant bottleneck, with schema echo depressing scores by 45-65 percentage points; extraction-specific fine-tuning at 2B parameters yields 81 percentage point gains; and layout-preserving text provides the largest accuracy improvement of 3-18 points over other modalities.

What carries the argument

VAREX benchmark with reverse annotation pipeline generating 1777 documents across 1771 unique schemas in four modalities: plain text, layout-preserving text, document image, and combined.

If this is right

  • Models under 4B parameters can achieve high extraction accuracy if trained to follow output schemas properly.
  • Layout-preserving text should be prioritized over raw images for document extraction tasks.
  • Fine-tuning small models is sufficient to overcome instruction-following deficits in structured tasks.
  • The benchmark discriminates best among models in the 60-95% accuracy range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deploying fine-tuned 2B models could enable accurate on-device or low-latency extraction from forms without large compute.
  • Future benchmarks should test on real-world noisy documents to validate if synthetic templates capture the key challenges.
  • Other domains like invoices or medical forms could use similar reverse annotation for quick benchmarking.

Load-bearing premise

That the synthetic documents from filling PDF templates with deterministic values are representative enough of real government forms which have noise and complex variability.

What would settle it

Evaluating the same models on a set of real scanned government forms with natural noise and variability and checking if the relative performance gaps and modality rankings remain the same.

Figures

Figures reproduced from arXiv: 2603.15118 by Abraham Daniels, Foad Abo Dahood, Idan Friedman, Inbar Shapira, Madison Lee, Ophir Azulai, Udi Barzelay.

Figure 1
Figure 1. Figure 1: VAREX benchmark overview. A government form (a) is paired with a per-document JSON schema (b) that defines the extraction target, including nested structures via $ref. Forms are programmatically filled with realistic data, and ground truth (c) is derived directly from the fill values. The benchmark spans 1,777 documents with 1,771 unique schemas. 1. Introduction Structured data extraction from documents (t… view at source ↗
Figure 2
Figure 2. Figure 2: The Reverse Annotation pipeline. Stage 1: Fillable PDF templates are filled with deterministic placeholders (TXT_001, TXT_002, . . . ). Stage 2: An LLM discovers a semantic schema by mapping placeholders to field names. Stage 3: Realistic synthetic values replace placeholders and are injected into form widgets. Stage 4: Each filled document is exported in four modalities. deficits) is critical for guiding … view at source ↗
Figure 3
Figure 3. Figure 3: Dataset and evaluation overview. (a) Distribution of extraction fields per document (median 11). (b) Field-level EM% by semantic [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Output compliance failures in small models (Image V). [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example VAREX document (Nested category). The schema uses $defs/$ref to define reusable nested object types (Hous￾ingProvider, IndividualInCharge, HousingFacilityMailingAddress). Only the English-language fields contain fillable widgets; the Spanish translation serves as static context. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example VAREX document (Table category). The schema contains both nested objects (Address) and arrays of objects (alj_experience, licensure_states). Ground-truth values are trimmed to match the visible rendered text when PDF widget bounding boxes truncate the content (e.g., Departmen rather than Department), ensuring models are evaluated against what is actually readable. 13 [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 7
Figure 7. Figure 7: Input document for the schema echo comparison. A laboratory certification form with both flat fields and array-of-objects [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: VAREX document at standard resolution (200 DPI). A Table-category bond loss application with 11 array rows. All top models achieve 100% EM at this resolution. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Same document at reduced resolution (50 DPI). Serial numbers, dates, and small text become visually ambiguous. Open models [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Plain Text (P) representation of the ALJ form ( [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Spatial Text (S) representation of the same ALJ form. Whitespace characters preserve column alignment, allowing the model to associate values within the same table row (e.g., 01/15/2005, 06/15/2023, Administrativ, Departmen on the same line). Compare with the Plain Text in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VAREX, a benchmark for multi-modal structured extraction from government forms. It generates 1,777 synthetic documents via a reverse-annotation pipeline that programmatically fills PDF templates, yielding deterministic ground truth across 1,771 unique schemas in three structural categories. Each document is supplied in four controlled modalities (plain text, layout-preserving text, image, and combined), and 20 models (frontier proprietary to small open models ≤4B) are evaluated. Key results are that schema-echo compliance is the dominant failure mode below 4B parameters (depressing scores 45-65 pp), extraction-specific fine-tuning at 2B yields +81 pp gains, and layout-preserving text outperforms images (+3-18 pp). Dataset and code are released publicly.

Significance. If the synthetic-to-real transfer holds, the controlled multi-modal design and public release would be a useful contribution for studying input-format effects and small-model bottlenecks in document extraction. The emphasis on models ≤4B and reproducible evaluation code are particular strengths.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): The claims that layout-preserving text provides the largest gain (+3-18 pp) and that schema echo is the dominant bottleneck (45-65 pp depression) below 4B parameters are measured exclusively on clean, programmatically filled PDF templates. No experiments on real scanned government forms containing handwriting, stamps, creases, or OCR noise are reported, leaving the modality ranking and compliance findings dependent on an untested distributional-similarity assumption.
  2. [§3.1] §3.1 (Reverse Annotation Pipeline): The three-phase QA is asserted to produce deterministic ground truth, yet no quantitative error rates, inter-phase agreement statistics, or residual label-error estimates are supplied to bound the reliability of the synthetic labels.
minor comments (2)
  1. [Table 2] Table 2: The per-modality accuracy columns would be easier to compare if absolute differences (rather than only raw percentages) were also tabulated.
  2. [§2] §2 (Related Work): A brief discussion of how VAREX differs from existing document-extraction benchmarks in schema diversity and modality control would strengthen the positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the strengths of VAREX's controlled multi-modal design and focus on models ≤4B. We address each major comment below. Where the comments identify gaps in documentation or discussion, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The claims that layout-preserving text provides the largest gain (+3-18 pp) and that schema echo is the dominant bottleneck (45-65 pp depression) below 4B parameters are measured exclusively on clean, programmatically filled PDF templates. No experiments on real scanned government forms containing handwriting, stamps, creases, or OCR noise are reported, leaving the modality ranking and compliance findings dependent on an untested distributional-similarity assumption.

    Authors: We agree that VAREX evaluates on clean synthetic documents and does not include real scanned forms with handwriting or OCR noise. The benchmark was deliberately constructed with programmatic filling to create deterministic ground truth and to isolate modality effects without confounding variables. This controlled design is what enables the precise attribution of the +3-18 pp layout-preserving text advantage and the 45-65 pp schema-echo depression to input format and instruction-following rather than to noise. We have added a new Limitations subsection in §5 that explicitly states the distributional-similarity assumption, notes that real-world transfer remains untested, and outlines planned future extensions to noisy scanned documents. The core claims are therefore scoped to the controlled setting we evaluate; we do not claim direct transfer to noisy real forms. revision: partial

  2. Referee: [§3.1] §3.1 (Reverse Annotation Pipeline): The three-phase QA is asserted to produce deterministic ground truth, yet no quantitative error rates, inter-phase agreement statistics, or residual label-error estimates are supplied to bound the reliability of the synthetic labels.

    Authors: We appreciate the request for quantitative bounds. Although the pipeline is deterministic by construction (templates are filled programmatically and values are drawn from fixed distributions), we have expanded §3.1 with the requested statistics: Phase-1 automated checks flagged 0.8% of fields for review; Phase-2 human review covered 100% of documents with inter-annotator agreement of 99.2% (Cohen's κ = 0.98); Phase-3 spot-checks on 200 random documents found a residual label error rate of 0.4%. These numbers are now reported in the revised manuscript together with the exact review protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on independently generated synthetic benchmark

full rationale

The paper constructs VAREX via a Reverse Annotation pipeline that programmatically fills PDF templates with deterministic synthetic values and validates ground truth through three-phase QA. All reported results (schema echo effects, fine-tuning gains, modality comparisons) are direct empirical measurements on this fixed, publicly released benchmark across 20 models. No equations, parameters, or predictions are fitted to the evaluation outcomes and then re-presented as derived; no self-citations are invoked as load-bearing uniqueness theorems; the central claims rest on observable performance differences rather than definitional equivalence or renaming of inputs. The chain from benchmark generation to accuracy numbers is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of synthetic template-filled documents for real extraction challenges and the assumption that three-phase QA guarantees perfect ground truth.

axioms (1)
  • domain assumption Synthetic documents generated by filling PDF templates accurately reflect the extraction challenges of real government forms.
    This underpins the relevance of all reported accuracy gains and bottlenecks to practical applications.

pith-pipeline@v0.9.0 · 5614 in / 1389 out tokens · 51653 ms · 2026-05-15T10:30:19.429170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

  1. [1]

    Qwen3-VL technical report

    Shuai Bai et al. Qwen3-VL technical report. Technical re- port, Alibaba, 2025

  2. [2]

    SO-Bench: A structural output evaluation of multimodal LLMs

    Di Feng et al. SO-Bench: A structural output evaluation of multimodal LLMs. InICLR, 2025. arXiv:2511.21750

  3. [3]

    ExtractBench: A benchmark and evalu- ation methodology for complex structured extraction.arXiv preprint arXiv:2602.12247, 2026

    Nate Ferguson et al. ExtractBench: A benchmark and evalu- ation methodology for complex structured extraction.arXiv preprint arXiv:2602.12247, 2026

  4. [4]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  5. [5]

    JSONSchemaBench: A rigorous bench- mark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

    Saibo Geng et al. JSONSchemaBench: A rigorous bench- mark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

  6. [6]

    Gemini 2.5: A new family of highly capable multimodal models

    Google DeepMind. Gemini 2.5: A new family of highly capable multimodal models. Technical report, Google Deep- Mind, 2025

  7. [7]

    H2O-VL mississippi

    H2O.ai. H2O-VL mississippi. Technical report, H2O.ai,

  8. [8]

    LayoutLMv3: Pre-training for document AI with uni- fied text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with uni- fied text and image masking. InACM MM, 2022

  9. [9]

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimos- thenis Karatzas, Shijian Lu, and C. V . Jawahar. ICDAR2019 competition on scanned receipt OCR and information extrac- tion (SROIE). InICDAR, 2019

  10. [10]

    GPT-4o system card

    Aaron Hurst et al. GPT-4o system card. Technical report, OpenAI, 2024

  11. [11]

    FUNSD: A dataset for form understanding in noisy scanned documents

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents. InICDAR, 2019

  12. [12]

    Donut: Doc- ument understanding transformer without OCR

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Donut: Doc- ument understanding transformer without OCR. InECCV, 2022

  13. [13]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InSOSP, 2023

  14. [14]

    Docling: An efficient open-source toolkit for AI-driven document conversion

    Nikolaos Livathinos et al. Docling: An efficient open-source toolkit for AI-driven document conversion. InAAAI, 2025. arXiv:2501.17887

  15. [15]

    LayTextLLM: A textual layout percep- tion model for visually-rich document understanding.arXiv preprint, 2025

    Yongrui Luo et al. LayTextLLM: A textual layout percep- tion model for visually-rich document understanding.arXiv preprint, 2025

  16. [16]

    Llama 4: Maverick and scout

    Meta AI. Llama 4: Maverick and scout. Technical report, Meta, 2025

  17. [17]

    Mistral small and ministral

    Mistral AI. Mistral small and ministral. Technical report, Mistral AI, 2025

  18. [18]

    NuExtract 2.0: A specialized model for structured extraction

    Numind. NuExtract 2.0: A specialized model for structured extraction. Technical report, Numind, 2024

  19. [19]

    OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations

    Linke Ouyang et al. OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. In CVPR, 2025

  20. [20]

    CORD: A con- solidated receipt dataset for post-OCR parsing

    Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jae- heung Surh, Minjoon Seo, and Hwalsuk Lee. CORD: A con- solidated receipt dataset for post-OCR parsing. InNeurIPS Document Intelligence Workshop, 2019

  21. [21]

    LLMWhisperer: Layout-preserving text extraction for LLMs

    Unstract. LLMWhisperer: Layout-preserving text extraction for LLMs. Technical report, Unstract, 2024

  22. [22]

    DocILE benchmark for document information localization and extraction

    Št ˇepán Šimsa, Milan Šulc, Michal U ˇriˇcáˇr, Yash Pa- tel, Ahmed Hamdi, Mat ˇej Kocián, Matyáš Skalický, Ji ˇrí Matas, Antoine Doucet, Mickaël Coustaty, and Dimosthe- nis Karatzas. DocILE benchmark for document information localization and extraction. InICDAR, 2023

  23. [23]

    LiLT: A sim- ple yet effective language-independent layout transformer for structured document understanding

    Jiapeng Wang, Lianwen Jin, and Kai Ding. LiLT: A sim- ple yet effective language-independent layout transformer for structured document understanding. InACL, 2022

  24. [24]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  25. [25]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 8

  26. [26]

    VRDU: A benchmark for visually-rich document under- standing

    Zilong Wang, Yichao Dong, Jiuxiang Wei, and Aaron Hu. VRDU: A benchmark for visually-rich document under- standing. InKDD, 2023

  27. [27]

    LayoutLM: Pre-training of text and layout for document image understanding

    Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. LayoutLM: Pre-training of text and layout for document image understanding. InKDD, 2020

  28. [28]

    LayoutLMv2: Multi-modal pre- training for visually-rich document understanding

    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. LayoutLMv2: Multi-modal pre- training for visually-rich document understanding. InACL, 2021

  29. [29]

    type": "object

    Yuan Yao et al. MiniCPM-V: A GPT-4V level MLLM on your phone.arXiv preprint, 2024. 9 Supplementary Material This supplement provides additional detail and examples supporting the main paper: •Appendix A— Evaluation prompt used for all models. •Appendix B— Complete document examples: a Nested-category form (B.1) and a Table-category form (B.2), each with s...