Recognition: unknown
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
Pith reviewed 2026-05-09 19:25 UTC · model grok-4.3
The pith
Current multimodal models largely fail to understand tables in real-world images, with only one of 21 exceeding 50 percent accuracy on a new benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning
What carries the argument
WildTableBench, a dataset of 402 real-world table images paired with 928 questions that require both structural perception of varied layouts and numerical reasoning over the contained data.
If this is right
- Evaluations that rely on clean rendered tables overestimate the capabilities of current multimodal models for practical use.
- Persistent gaps in structural perception limit reliability for extracting information from photographed or screenshot tables.
- Numerical reasoning over data embedded in complex visual layouts remains a shared weakness across proprietary and open-source systems.
- The benchmark supplies a repeatable test to track whether future models close the identified failure modes.
- Applications that depend on table understanding in consumer or enterprise documents would currently encounter low accuracy in real conditions.
Where Pith is reading between the lines
- Similar benchmarks focused on other unstructured visual documents such as forms or receipts could expose parallel limitations.
- Models might improve on these images if training data explicitly includes noisy layouts and varied domains rather than only synthetic tables.
- Integrating dedicated table-parsing modules with general vision-language reasoning could be tested as a way to raise scores on this benchmark.
- Extending the question set to include more languages or additional image sources would clarify whether the observed weaknesses generalize.
Load-bearing premise
The 402 collected table images and 928 questions are representative of the visual complexity, layout diversity, and reasoning demands found in naturally occurring real-world tables.
What would settle it
An independent set of several hundred wild table images on which the same 21 models produce accuracy distributions that differ substantially from the reported 4.1 to 50+ percent range would indicate the original collection does not capture typical difficulty.
Figures
read the original abstract
Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WildTableBench, the first QA benchmark for naturally occurring table images. It consists of 402 high-information-density table images collected from online forums and websites across diverse domains, paired with 928 manually annotated and verified questions spanning 17 subtypes in five categories. The authors evaluate 21 proprietary and open-source multimodal foundation models, finding that only one exceeds 50% accuracy (with the rest ranging from 4.1% to 49.9%), and provide diagnostic analyses highlighting persistent weaknesses in structural perception and numerical reasoning on real-world tables.
Significance. If the dataset construction is shown to be representative, this work addresses an important gap by moving beyond clean rendered or structured-text tables to evaluate multimodal models on in-the-wild images that reflect consumer and enterprise use cases. The scale of the evaluation (21 models) and the diagnostic failure analysis provide concrete insights into current limitations. The direct empirical nature of the benchmark creation, with manual annotation and verification, is a strength that supports its potential as a diagnostic tool.
major comments (2)
- [Benchmark Construction] Benchmark construction (as described in the abstract and implied methods): the claim that the 402 images and 928 questions are representative of naturally occurring real-world tables with varied layouts and reasoning demands rests on an unverified collection and annotation process. No explicit sampling protocol, quantitative diversity metrics (e.g., layout complexity, domain coverage), inter-annotator agreement scores, or human performance baseline are reported. This is load-bearing for the central interpretation that low model accuracies demonstrate 'persistent weaknesses' rather than benchmark-specific artifacts.
- [Evaluation and Analysis] Evaluation section: the headline result (only one model >50% accuracy) is presented without sufficient controls for question difficulty or ambiguity. Without reporting how questions were generated, verified for correctness, or balanced across the 17 subtypes, it is difficult to assess whether the performance gap reflects model limitations or properties of the annotation process.
minor comments (1)
- [Abstract] The abstract uses the term 'high-information-density' without providing a definition or quantitative measure (e.g., average cells per table or OCR error rates).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the work's significance. Below we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] the claim that the 402 images and 928 questions are representative of naturally occurring real-world tables with varied layouts and reasoning demands rests on an unverified collection and annotation process. No explicit sampling protocol, quantitative diversity metrics (e.g., layout complexity, domain coverage), inter-annotator agreement scores, or human performance baseline are reported. This is load-bearing for the central interpretation that low model accuracies demonstrate 'persistent weaknesses' rather than benchmark-specific artifacts.
Authors: We agree that additional transparency on the collection and annotation process would strengthen the paper. The images were selected manually by the authors from online forums and websites to capture high-information-density tables across diverse domains, but without a formal probabilistic sampling protocol or pre-computed quantitative diversity metrics. In the revised manuscript we will expand the methods section with a detailed description of the selection criteria, domain distribution statistics, and layout complexity indicators. We will also report inter-annotator agreement from the multi-round verification process. A small-scale human performance baseline will be added to provide context for interpreting model results. revision: yes
-
Referee: [Evaluation and Analysis] the headline result (only one model >50% accuracy) is presented without sufficient controls for question difficulty or ambiguity. Without reporting how questions were generated, verified for correctness, or balanced across the 17 subtypes, it is difficult to assess whether the performance gap reflects model limitations or properties of the annotation process.
Authors: We appreciate the concern and will clarify these aspects. Questions were created manually by annotators who inspected each table and wrote items spanning the five categories and 17 subtypes, followed by independent verification for factual correctness and ambiguity reduction. The paper already reports the subtype counts; we will add the annotation guidelines, verification protocol, and explicit subtype distribution table in the revision. These additions should demonstrate that the observed performance gaps arise from model limitations on real-world tables rather than annotation artifacts. revision: yes
Circularity Check
No circularity: empirical benchmark construction with no derivations or self-referential reductions
full rationale
The paper introduces WildTableBench via direct collection of 402 table images from online sources and manual annotation of 928 questions, followed by straightforward model evaluation on 21 multimodal foundation models. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claims rest on empirical results (e.g., accuracy ranges) rather than any derivation chain that reduces to its own definitions or prior self-work by construction. This is a standard non-circular benchmark paper; the representativeness concern raised in the skeptic note pertains to external validity, not internal circularity of reasoning.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
Compositional Semantic Parsing on Semi-Structured Tables , author =. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[3]
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
Search-Based Neural Structured Learning for Sequential Question Answering , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[4]
Proceedings of the International Conference on Learning Representations (ICLR) , year =
TabFact: A Large-scale Dataset for Table-based Fact Verification , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[5]
ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392
Tabfact: A large-scale dataset for table-based fact verification , author =. arXiv preprint arXiv:1909.02164 , year =
-
[6]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =
FinQA: A Dataset of Numerical Reasoning over Financial Data , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =
2021
-
[7]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages =
Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages =
-
[8]
Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning
Dynamic Prompt Learning via Policy Gradient for Semi-Structured Mathematical Reasoning , author =. arXiv preprint arXiv:2209.14610 , year =
-
[9]
Journal of wrist surgery , volume =
The use of tables , author =. Journal of wrist surgery , volume =. 2014 , publisher =
2014
-
[10]
2006 , publisher =
Advanced modelling in finance using Excel and VBA , author =. 2006 , publisher =
2006
-
[11]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =
2023
-
[12]
Forty-first International Conference on Machine Learning , year =
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author =. Forty-first International Conference on Machine Learning , year =
-
[13]
2025 , url =
Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...
2025
-
[14]
2026 , howpublished =
2026
-
[15]
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5: Visual Agentic Intelligence , author =. arXiv preprint arXiv:2602.02276 , year =
work page internal anchor Pith review arXiv
-
[16]
arXiv preprint arXiv:2508.06471 , year =
work page internal anchor Pith review arXiv
-
[17]
Proceedings of the AAAI Conference on Artificial Intelligence , volume =
Tablebench: A Comprehensive and Complex Benchmark for Table Question Answering , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
-
[18]
arXiv preprint arXiv:2504.06560 , year =
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables , author =. arXiv preprint arXiv:2504.06560 , year =
-
[19]
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
Hitab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[20]
Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024
Tablevqa-bench: a visual question answering benchmark on multiple table domains (2024) , author =. arXiv preprint arXiv:2404.19205 , year =
-
[21]
Findings of the Association for Computational Linguistics: ACL 2025 , pages =
M2-TabFact: Multi-Document Multi-Modal Fact Verification With Visual and Textual Representations of Tabular Data , author =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =
2025
-
[22]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Tableeval: A real-world benchmark for complex, multilingual, and multi-structured table question answering , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[23]
Findings of the Association for Computational Linguistics: ACL 2025 , pages =
Realhitbench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis , author =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =
2025
-
[24]
2024 , howpublished =
Hello. 2024 , howpublished =
2024
-
[25]
2026 , howpublished =
Introducing. 2026 , howpublished =
2026
-
[26]
2024 , howpublished =
2024
-
[27]
2026 , howpublished =
Gemini 3 Pro , author =. 2026 , howpublished =
2026
-
[28]
2026 , howpublished =
Gemini 3 Flash , author =. 2026 , howpublished =
2026
-
[29]
2026 , howpublished =
Claude Sonnet 4.6 , author =. 2026 , howpublished =
2026
-
[30]
2026 , howpublished =
Claude Opus 4.6 , author =. 2026 , howpublished =
2026
-
[31]
IEEE Transactions on Visualization and Computer Graphics , volume=
Untidy data: The unreasonable effectiveness of tables , author=. IEEE Transactions on Visualization and Computer Graphics , volume=
-
[32]
Proceedings of the International Working Conference on Advanced Visual Interfaces , pages=
Spreadsheets as user interfaces , author=. Proceedings of the International Working Conference on Advanced Visual Interfaces , pages=
-
[33]
Proceedings of the VLDB Endowment , volume=
Ten years of webtables , author=. Proceedings of the VLDB Endowment , volume=
-
[34]
Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing , pages=
A lived informatics model of personal informatics , author=. Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing , pages=
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.