arxiv: 2605.01018 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

Junzhe Huang , Xiaoxiao Sun , Yan Yang , Yuxuan Hou , Ruotian Zhang , Sirui Li , Hehe Fan , Serena Yeung-Levy

show 1 more author

Xin Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal foundation modelstable understandingbenchmarkreal-world imagesquestion answeringstructural perceptionnumerical reasoningmodel evaluation

0 comments

The pith

Current multimodal models largely fail to understand tables in real-world images, with only one of 21 exceeding 50 percent accuracy on a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WildTableBench to test multimodal foundation models on question answering with naturally occurring table images rather than clean or synthetic ones. It assembles 402 high-density table images from online forums and websites along with 928 verified questions that span layout interpretation and numerical reasoning. Evaluation shows most models score between 4.1 and 49.9 percent, revealing repeated failures in perceiving table structures and drawing conclusions from the data. The results matter because many real consumer and enterprise tasks require reliable extraction from photos or screenshots of tables. The benchmark therefore supplies a concrete diagnostic for where current capabilities fall short in uncontrolled visual settings.

Core claim

We introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning

What carries the argument

WildTableBench, a dataset of 402 real-world table images paired with 928 questions that require both structural perception of varied layouts and numerical reasoning over the contained data.

If this is right

Evaluations that rely on clean rendered tables overestimate the capabilities of current multimodal models for practical use.
Persistent gaps in structural perception limit reliability for extracting information from photographed or screenshot tables.
Numerical reasoning over data embedded in complex visual layouts remains a shared weakness across proprietary and open-source systems.
The benchmark supplies a repeatable test to track whether future models close the identified failure modes.
Applications that depend on table understanding in consumer or enterprise documents would currently encounter low accuracy in real conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks focused on other unstructured visual documents such as forms or receipts could expose parallel limitations.
Models might improve on these images if training data explicitly includes noisy layouts and varied domains rather than only synthetic tables.
Integrating dedicated table-parsing modules with general vision-language reasoning could be tested as a way to raise scores on this benchmark.
Extending the question set to include more languages or additional image sources would clarify whether the observed weaknesses generalize.

Load-bearing premise

The 402 collected table images and 928 questions are representative of the visual complexity, layout diversity, and reasoning demands found in naturally occurring real-world tables.

What would settle it

An independent set of several hundred wild table images on which the same 21 models produce accuracy distributions that differ substantially from the reported 4.1 to 50+ percent range would indicate the original collection does not capture typical difficulty.

Figures

Figures reproduced from arXiv: 2605.01018 by Hehe Fan, Junzhe Huang, Ruotian Zhang, Serena Yeung-Levy, Sirui Li, Xiaoxiao Sun, Xin Yu, Yan Yang, Yuxuan Hou.

**Figure 1.** Figure 1: WildTableBench overview. (a) A benchmark example requiring multi-hop reasoning over a real-world train schedule. All three frontier models answer incorrectly. (b) Overall accuracy of 14 representative models; most fall below 50%. (c) Category-level comparison of the top-4 models from different providers. Performance varies markedly across question types, with Color-related questions proving most challengi… view at source ↗

**Figure 2.** Figure 2: Overview of WildTableBench. (a) Data construction pipeline. (b) Domain distribution: WildTableBench covers diverse real-world scenarios, incorporating both high-fidelity digital screenshots and natural photographs across various professional and daily domains. 3 Benchmark Construction 3.1 Image Collection To ensure that WildTableBench reflects practical table-understanding scenarios, we collect real-world… view at source ↗

**Figure 3.** Figure 3: Cell-retrieval accuracy across a 10×10 (row × column) grid, based on 2,489 needles from 50 real-world spreadsheet images. Each subplot is normalised by its own min–max range to highlight within-model positional sensitivity, and the overall accuracy of each model is shown in the subplot title. Results for eight additional models are provided in Appendix E. middle of the table. Notably, this bottom-right dis… view at source ↗

**Figure 4.** Figure 4: Reasoning budget. (a) accuracy vs. average reasoning tokens; (b) accuracy vs. per-query cost ($) for Gemini-3-Pro (low/high), Gemini-3-Flash (minimal/low/medium/high), and Kimi-K2.5 (disabled/enabled) view at source ↗

**Figure 5.** Figure 5: Error analysis. (a) Error-type breakdown (locating, recognition, reasoning, comprehension) across representative models. Each horizontal stacked bar reports absolute error counts. Perceptionrelated errors (Locating + Recognition) dominate across most model families. (b) Examples of questions that all models answer incorrectly. once the region has been found. As overall accuracy decreases, both error types… view at source ↗

**Figure 13.** Figure 13: Cell-retrieval accuracy heatmaps for eight additional evaluated models. Per view at source ↗

**Figure 6.** Figure 6: Overview of keyword-based image collection. Keyword schema used for candidate retrieval. Each entry denotes a representative table type and shows example spreadsheet-style and scenario-grounded queries used to retrieve candidate table images from the open web. Together, these query families expand coverage across diverse realworld table scenarios. 19 view at source ↗

**Figure 7.** Figure 7: Business & Management view at source ↗

**Figure 8.** Figure 8: Finance & Accounting 20 view at source ↗

**Figure 9.** Figure 9: Sports & Health view at source ↗

**Figure 10.** Figure 10: Education & Science 21 view at source ↗

**Figure 11.** Figure 11: Transportation view at source ↗

**Figure 12.** Figure 12: Society & Media 22 view at source ↗

read the original abstract

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildTableBench adds a new test set of 402 messy real-world table images that exposes weaknesses in current multimodal models, but its modest size and thin details on sampling leave the representativeness claim open.

read the letter

The main point is straightforward: this benchmark collects 402 high-density table images from online sources, pairs them with 928 questions across five categories, and shows that 20 out of 21 frontier models score below 50 percent accuracy, with most clustering in the low-to-mid 40s or worse. Only one model clears the halfway mark. That result lines up with the claim that structural perception and numerical reasoning remain weak points on uncontrolled table images rather than clean rendered ones.

Referee Report

2 major / 1 minor

Summary. The paper introduces WildTableBench, the first QA benchmark for naturally occurring table images. It consists of 402 high-information-density table images collected from online forums and websites across diverse domains, paired with 928 manually annotated and verified questions spanning 17 subtypes in five categories. The authors evaluate 21 proprietary and open-source multimodal foundation models, finding that only one exceeds 50% accuracy (with the rest ranging from 4.1% to 49.9%), and provide diagnostic analyses highlighting persistent weaknesses in structural perception and numerical reasoning on real-world tables.

Significance. If the dataset construction is shown to be representative, this work addresses an important gap by moving beyond clean rendered or structured-text tables to evaluate multimodal models on in-the-wild images that reflect consumer and enterprise use cases. The scale of the evaluation (21 models) and the diagnostic failure analysis provide concrete insights into current limitations. The direct empirical nature of the benchmark creation, with manual annotation and verification, is a strength that supports its potential as a diagnostic tool.

major comments (2)

[Benchmark Construction] Benchmark construction (as described in the abstract and implied methods): the claim that the 402 images and 928 questions are representative of naturally occurring real-world tables with varied layouts and reasoning demands rests on an unverified collection and annotation process. No explicit sampling protocol, quantitative diversity metrics (e.g., layout complexity, domain coverage), inter-annotator agreement scores, or human performance baseline are reported. This is load-bearing for the central interpretation that low model accuracies demonstrate 'persistent weaknesses' rather than benchmark-specific artifacts.
[Evaluation and Analysis] Evaluation section: the headline result (only one model >50% accuracy) is presented without sufficient controls for question difficulty or ambiguity. Without reporting how questions were generated, verified for correctness, or balanced across the 17 subtypes, it is difficult to assess whether the performance gap reflects model limitations or properties of the annotation process.

minor comments (1)

[Abstract] The abstract uses the term 'high-information-density' without providing a definition or quantitative measure (e.g., average cells per table or OCR error rates).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. Below we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Benchmark Construction] the claim that the 402 images and 928 questions are representative of naturally occurring real-world tables with varied layouts and reasoning demands rests on an unverified collection and annotation process. No explicit sampling protocol, quantitative diversity metrics (e.g., layout complexity, domain coverage), inter-annotator agreement scores, or human performance baseline are reported. This is load-bearing for the central interpretation that low model accuracies demonstrate 'persistent weaknesses' rather than benchmark-specific artifacts.

Authors: We agree that additional transparency on the collection and annotation process would strengthen the paper. The images were selected manually by the authors from online forums and websites to capture high-information-density tables across diverse domains, but without a formal probabilistic sampling protocol or pre-computed quantitative diversity metrics. In the revised manuscript we will expand the methods section with a detailed description of the selection criteria, domain distribution statistics, and layout complexity indicators. We will also report inter-annotator agreement from the multi-round verification process. A small-scale human performance baseline will be added to provide context for interpreting model results. revision: yes
Referee: [Evaluation and Analysis] the headline result (only one model >50% accuracy) is presented without sufficient controls for question difficulty or ambiguity. Without reporting how questions were generated, verified for correctness, or balanced across the 17 subtypes, it is difficult to assess whether the performance gap reflects model limitations or properties of the annotation process.

Authors: We appreciate the concern and will clarify these aspects. Questions were created manually by annotators who inspected each table and wrote items spanning the five categories and 17 subtypes, followed by independent verification for factual correctness and ambiguity reduction. The paper already reports the subtype counts; we will add the annotation guidelines, verification protocol, and explicit subtype distribution table in the revision. These additions should demonstrate that the observed performance gaps arise from model limitations on real-world tables rather than annotation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential reductions

full rationale

The paper introduces WildTableBench via direct collection of 402 table images from online sources and manual annotation of 928 questions, followed by straightforward model evaluation on 21 multimodal foundation models. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claims rest on empirical results (e.g., accuracy ranges) rather than any derivation chain that reduces to its own definitions or prior self-work by construction. This is a standard non-circular benchmark paper; the representativeness concern raised in the skeptic note pertains to external validity, not internal circularity of reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5527 in / 1138 out tokens · 46823 ms · 2026-05-09T19:25:18.791564+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Compositional Semantic Parsing on Semi-Structured Tables , author =. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[3]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Search-Based Neural Structured Learning for Sequential Question Answering , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[4]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

TabFact: A Large-scale Dataset for Table-based Fact Verification , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[5]

ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

Tabfact: A large-scale dataset for table-based fact verification , author =. arXiv preprint arXiv:1909.02164 , year =

work page arXiv 1909
[6]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

FinQA: A Dataset of Numerical Reasoning over Financial Data , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

2021
[7]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages =

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages =
[8]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Dynamic Prompt Learning via Policy Gradient for Semi-Structured Mathematical Reasoning , author =. arXiv preprint arXiv:2209.14610 , year =

work page arXiv
[9]

Journal of wrist surgery , volume =

The use of tables , author =. Journal of wrist surgery , volume =. 2014 , publisher =

2014
[10]

2006 , publisher =

Advanced modelling in finance using Excel and VBA , author =. 2006 , publisher =

2006
[11]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

2023
[12]

Forty-first International Conference on Machine Learning , year =

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author =. Forty-first International Conference on Machine Learning , year =
[13]

2025 , url =

Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...

2025
[14]

2026 , howpublished =

2026
[15]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5: Visual Agentic Intelligence , author =. arXiv preprint arXiv:2602.02276 , year =

work page internal anchor Pith review arXiv
[16]

arXiv preprint arXiv:2508.06471 , year =

work page internal anchor Pith review arXiv
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Tablebench: A Comprehensive and Complex Benchmark for Table Question Answering , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[18]

arXiv preprint arXiv:2504.06560 , year =

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables , author =. arXiv preprint arXiv:2504.06560 , year =

work page arXiv
[19]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Hitab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[20]

Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

Tablevqa-bench: a visual question answering benchmark on multiple table domains (2024) , author =. arXiv preprint arXiv:2404.19205 , year =

work page arXiv 2024
[21]

Findings of the Association for Computational Linguistics: ACL 2025 , pages =

M2-TabFact: Multi-Document Multi-Modal Fact Verification With Visual and Textual Representations of Tabular Data , author =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =

2025
[22]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Tableeval: A real-world benchmark for complex, multilingual, and multi-structured table question answering , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[23]

Findings of the Association for Computational Linguistics: ACL 2025 , pages =

Realhitbench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis , author =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =

2025
[24]

2024 , howpublished =

Hello. 2024 , howpublished =

2024
[25]

2026 , howpublished =

Introducing. 2026 , howpublished =

2026
[26]

2024 , howpublished =

2024
[27]

2026 , howpublished =

Gemini 3 Pro , author =. 2026 , howpublished =

2026
[28]

2026 , howpublished =

Gemini 3 Flash , author =. 2026 , howpublished =

2026
[29]

2026 , howpublished =

Claude Sonnet 4.6 , author =. 2026 , howpublished =

2026
[30]

2026 , howpublished =

Claude Opus 4.6 , author =. 2026 , howpublished =

2026
[31]

IEEE Transactions on Visualization and Computer Graphics , volume=

Untidy data: The unreasonable effectiveness of tables , author=. IEEE Transactions on Visualization and Computer Graphics , volume=
[32]

Proceedings of the International Working Conference on Advanced Visual Interfaces , pages=

Spreadsheets as user interfaces , author=. Proceedings of the International Working Conference on Advanced Visual Interfaces , pages=
[33]

Proceedings of the VLDB Endowment , volume=

Ten years of webtables , author=. Proceedings of the VLDB Endowment , volume=
[34]

Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing , pages=

A lived informatics model of personal informatics , author=. Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing , pages=

2015