pith. machine review for the scientific record. sign in

arxiv: 2604.16099 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords table recognitiontable visual question answeringdental estimatesvision-language modelsdatasetOCRarithmetic reasoning
0
0 comments X

The pith

Dental estimate tables reveal that vision-language models recover structure accurately but still fail at multi-step arithmetic and consistency reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DenTab, a dataset of 2,000 real-world table images from dental estimates annotated with HTML structure. It includes over 2,200 questions across eleven categories testing retrieval, aggregation, and logic. Benchmarks of 16 systems show that good table structure recovery does not ensure reliable answers on arithmetic and consistency questions. These failures remain even when models are given ground-truth HTML tables instead of images. The authors also introduce a Table Router Pipeline that routes arithmetic questions to a deterministic executor for better reliability.

Core claim

DenTab demonstrates that on noisy administrative tables from dental estimates, current vision-language models and OCR systems achieve strong structure recovery yet exhibit persistent errors on multi-step arithmetic and consistency checks, and these errors continue even when provided with perfect ground-truth table representations in HTML.

What carries the argument

The DenTab dataset of cropped table images with HTML annotations and 2,208 questions in eleven categories, used to benchmark table recognition and visual QA.

If this is right

  • Improving table structure recognition alone is insufficient for reliable TableVQA on real-world noisy tables.
  • Arithmetic and logic reasoning capabilities in VLMs need targeted improvements beyond visual input processing.
  • The Table Router Pipeline shows that combining VLMs with rule-based execution can enhance performance on calculation tasks without additional training.
  • Real-world administrative tables like dental estimates provide a more challenging testbed than synthetic or clean datasets for evaluating table understanding systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reasoning gaps may exist in other domains involving financial or medical administrative documents.
  • Future datasets could focus more on consistency checks and multi-step logic to better expose model limitations.
  • Integrating symbolic computation modules with VLMs could be a general strategy for improving reliability on quantitative table tasks.
  • Testing the pipeline on other table domains would confirm if the approach generalizes beyond dental estimates.

Load-bearing premise

That dental estimates are representative of real-world administrative tables and that the eleven question categories cover the key reasoning failures.

What would settle it

Observing whether the same structure-reasoning discrepancy appears when testing the same models on a different collection of noisy administrative tables from another domain.

Figures

Figures reproduced from arXiv: 2604.16099 by Amine Tamasna, Laziz Hamdi, Thierry Paquet.

Figure 1
Figure 1. Figure 1: Samples from DenTab showing diversity in table structures and real acquisition artifacts. tions for table detection and structure recognition, including recognition-only settings where the table crop is provided and only structure must be recov￾ered. Large TSR datasets include TableBank [17], PubTabNet [42], SciTSR [8], FinTabNet [41], and PubTables-1M [32]. While influential, most are based on digital-bor… view at source ↗
Figure 2
Figure 2. Figure 2: Overview statistics of DenTab: distributions of rows per table and columns per table, image size (width vs. height), and prevalence of spanning (merged) cells [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Table Router Pipeline for arithmetic TableVQA. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: S-TEDS under different image degradations; dashed lines indicate each model’s clean baseline. A Complementary results In this section report differents results and additional experiments, we did not include in the main paper due to lack of space. A.1 Impact of Image Degradations To quantify robustness to realistic acquisition artifacts, we re-evaluate selected strong TSR systems under controlled degradatio… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Runtime breakdown. aggregation_sum aggregation_sum_conditional comparison_argmax comparison_argmax_rows consistency_diff_total count_equals kth_row_value lookup_by_header lookup_list_by_header other total_row_value Predicted aggregation_sum aggregation_sum_conditional comparison_argmax comparison_argmax_rows consistency_diff_total count_equals kth_row_value lookup_by_header lookup_list_by_header other … view at source ↗
Figure 8
Figure 8. Figure 8: Web UI of our built in HTML annotator [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Input image with VLMs output rendered in HTML (sample1) [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Input image with VLMs output rendered in HTML (sample2) [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample 1 Input image Question GT Nanonets-OCR-s Qwen3-VL-32B What is the “Réservé à l'organisme complémentaire” value for the row where “Code CCAM Ou pour l'orthodontie cotation NGAP” = TO 75? N/A What are all the “(D) Base de remboursement Assurance Maladie obligatoire ou NR” values for rows where “Code CCAM Ou pour l'orthodontie cotation NGAP” = TO 90? Return a comma-separated list in table order. 193.5… view at source ↗
Figure 12
Figure 12. Figure 12: Sample 2 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DenTab, a dataset of 2,000 cropped real-world dental estimate table images with high-quality HTML annotations for table structure recognition and TableVQA. It benchmarks 16 systems (14 VLMs and 2 OCR baselines) on 2,208 questions across 11 categories covering retrieval, aggregation, and logic/consistency. The central observation is that strong structure recovery does not reliably translate to good performance on multi-step arithmetic and consistency questions, with these failures persisting even when ground-truth HTML table inputs are provided to the models. The authors also propose the Table Router Pipeline, which routes arithmetic questions to a VLM for structured representation plus a rule-based executor for exact computation, and commit to releasing the dataset and code.

Significance. If the empirical observations hold under detailed scrutiny, the work provides a useful new resource for evaluating table understanding under realistic administrative capture conditions and highlights a persistent gap between structure recovery and multi-step reasoning in current VLMs. The public release of data and code, together with the practical Table Router Pipeline that avoids additional training, are clear strengths that could support follow-on research on hybrid VLM-plus-executor systems for table tasks.

major comments (2)
  1. [Experimental evaluation of VQA with ground-truth HTML] The central claim that reasoning failures on arithmetic and consistency questions persist even with ground-truth HTML inputs (stated in the abstract and evaluated in the experimental section) is load-bearing but rests on an unspecified prompting strategy for supplying the HTML to the VLMs. Because the models are vision-language systems, their ability to perform exact arithmetic over serialized HTML depends on prompt wording, serialization format, few-shot examples, and any constraints provided; without these details it is impossible to distinguish intrinsic reasoning deficits from input-processing limitations.
  2. [Table Router Pipeline and results] The paper reports benchmark results and proposes the Table Router Pipeline to improve arithmetic reliability, yet provides no quantitative metrics, error bars, statistical significance tests, or ablation numbers comparing the pipeline against the 14 VLMs and 2 OCR baselines on the arithmetic and consistency question subsets. This absence makes it difficult to assess the magnitude or reliability of the claimed improvement.
minor comments (2)
  1. [Dataset construction] The eleven question categories are described at a high level; a table or appendix listing example questions per category with their expected reasoning type would improve reproducibility and allow readers to judge coverage of practical failure modes.
  2. [Conclusion] The manuscript states that source code and dataset will be released at a GitHub URL but does not include a data statement or license details in the current version; adding these would strengthen the reproducibility claim.

Circularity Check

0 steps flagged

No circularity: purely empirical dataset introduction and benchmarking

full rationale

The paper introduces a new dataset (DenTab) of real-world dental estimate tables with HTML annotations and evaluates 16 external systems (VLMs and OCR baselines) on table recognition and TableVQA tasks. The central claim—that strong structure recovery does not reliably yield good multi-step arithmetic/consistency performance, even with ground-truth HTML—is supported solely by reported benchmark metrics on held-out questions. The proposed Table Router Pipeline is a practical engineering combination of a VLM plus a deterministic executor, not a derived result. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text; all claims rest on external model performance against the new data. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset-and-benchmark paper with no mathematical derivations, free parameters, or postulated entities. All components rely on standard computer-vision and language-model techniques applied to a new domain.

pith-pipeline@v0.9.0 · 5581 in / 1245 out tokens · 36953 ms · 2026-05-10T08:30:18.940864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Better patching using llm prompting, via self-consistency

    Toufique Ahmed and Premkumar Devanbu. Better patching using llm prompting, via self-consistency. InIEEE/ACM, 2023

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai et al. Qwen2.5-vl technical report.arXiv:2502.13923, 2025

  4. [4]

    Tabfact: A large-scale dataset for table-based fact verification

    Wenhu Chen et al. Tabfact: A large-scale dataset for table-based fact verification. InICLR, 2020

  5. [5]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen et al. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv:2211.12588, 2022

  6. [6]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

  7. [7]

    Hitab: A hierarchical table dataset for question answering and natural language generation

    Zhoujun Cheng et al. Hitab: A hierarchical table dataset for question answering and natural language generation. InACL, pages 1094–1110, 2022

  8. [8]

    Fang, J.; Tao, X.; Tang, Z.; Qiu, R.; and Liu, Y

    Zewen Chi et al. Complicated table structure recognition.arXiv:1908.04729, 2019

  9. [9]

    Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059,

    Hao Feng et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv:2505.14059, 2025

  10. [10]

    Icdar 2019 competition on table detection and recognition (ctdar)

    Liangcai Gao et al. Icdar 2019 competition on table detection and recognition (ctdar). InICDAR, pages 1510–1515, 2019

  11. [11]

    Pal: Program-aided language models

    Luyu Gao et al. Pal: Program-aided language models. InICML, pages 10764– 10799, 2023

  12. [12]

    Gemma 3 Technical Report

    Gemma Team et al. Gemma 3 technical report.arXiv:2503.19786, 2025

  13. [13]

    Visual programming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InCVPR, pages 14953–14962, 2023

  14. [14]

    Ocr-free document understanding transformer

    Geewook Kim et al. Ocr-free document understanding transformer. InECCV, 2022

  15. [15]

    Tablevqa-bench: A visual question answering benchmark on multiple table domains, 2024

    YoonsikKim,MoonbinYim,andKaYeonSong. Tablevqa-bench:Avisualquestion answering benchmark on multiple table domains.arXiv:2404.19205, 2024

  16. [16]

    Building and better understanding vision-language models: insights and future directions

    Hugo Laurençon et al. Building and better understanding vision-language models: insights and future directions.arXiv:2408.12637, 2024

  17. [17]

    Tablebank: Table benchmark for image-based table detection and recognition

    Minghao Li et al. Tablebank: Table benchmark for image-based table detection and recognition. InLREC, pages 1918–1925, 2020

  18. [18]

    Deplot: One-shot visual language reasoning by plot-to-table translation

    Fangyu Liu et al. Deplot: One-shot visual language reasoning by plot-to-table translation. InFindings of ACL, 2023. Title Suppressed Due to Excessive Length 15

  19. [19]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  20. [20]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InW ACV, pages 2200–2209, 2021

  21. [21]

    Mistral 3: Open-weight language model release, 2025

    Mistral AI. Mistral 3: Open-weight language model release, 2025

  22. [22]

    Nanonets-ocr-s: Image-to-markdown ocr model, 2025.https:// huggingface.co/nanonets/Nanonets-OCR-s

    Nanonets. Nanonets-ocr-s: Image-to-markdown ocr model, 2025.https:// huggingface.co/nanonets/Nanonets-OCR-s

  23. [23]

    Nanonets-ocr2-3b: Document to structured markdown, 2025.https: //huggingface.co/nanonets/Nanonets-OCR2-3B

    Nanonets. Nanonets-ocr2-3b: Document to structured markdown, 2025.https: //huggingface.co/nanonets/Nanonets-OCR2-3B

  24. [24]

    Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images.arXiv:2001.01469, 2020

    Shubham Paliwal, Dheeraj Sharma, and Lovekesh Vig. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images.arXiv:2001.01469, 2020

  25. [25]

    Compositional semantic parsing on semi- structured tables

    Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi- structured tables. InACL, 2015

  26. [26]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

    J.Poznanskietal. olmocr:Unlockingtrillionsoftokensinpdfswithvisionlanguage models.arXiv:2502.18443, 2025

  27. [27]

    Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents.arXiv:2004.12629, 2020

    Devashish Prasad et al. Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents.arXiv:2004.12629, 2020

  28. [28]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick et al. Toolformer: Language models can teach themselves to use tools. InNeurIPS, volume 36, pages 68539–68551, 2023

  29. [29]

    Deepdesrt: Deep learning for detection and structure recognition of tables in document images

    Sebastian Schreiber et al. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. InICDAR, 2017

  30. [30]

    Wikidt: Visual-based table recognition and question answering dataset

    Hui Shi et al. Wikidt: Visual-based table recognition and question answering dataset. InICDAR, pages 406–437, 2024

  31. [31]

    Smock, R

    Brandon Smock, Rohith Pesala, and Robin Abraham. Grits: Grid table similarity metric for table structure recognition.arXiv:2203.12555, 2022

  32. [32]

    Pubtables-1m: Towards comprehensive table extraction from unstructured documents

    Brandon Smock, Rohith Pesala, and Robin Abraham. Pubtables-1m: Towards comprehensive table extraction from unstructured documents. InCVPR, 2022

  33. [33]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. InICCV, pages 11888–11898, 2023

  34. [34]

    General ocr theory: Towards ocr-2.0 via a unified end-to-end model

    Haoran Wei et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. InCVPR, 2024

  35. [35]

    Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 35:24824–24837, 2022

    Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 35:24824–24837, 2022

  36. [36]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu et al. Deepseek-vl2: Mixture-of-experts vision-language models for ad- vanced multimodal understanding.arXiv:2412.10302, 2024

  37. [37]

    Qwen3 Technical Report

    A. Yang et al. Qwen3 technical report.arXiv:2505.09388, 2025

  38. [38]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang et al. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv:2303.11381, 2023

  39. [39]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao et al. React: Synergizing reasoning and acting in language models. In ICLR, 2023

  40. [40]

    arXiv preprint arXiv:2406.08100 , year =

    Mingyu Zheng et al. Multimodal table understanding.arXiv:2406.08100, 2024

  41. [41]

    Global table extractor (gte): A framework for joint table iden- tification and cell structure recognition using visual context

    Xinyi Zheng et al. Global table extractor (gte): A framework for joint table iden- tification and cell structure recognition using visual context. InW ACV, 2021

  42. [42]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InECCV, 2020

  43. [43]

    answers": [

    Feng Zhu et al. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. InACL, 2021. 16 L. Hamdi 0.8 1.5 2.5 0.0 0.2 0.4 0.6 0.8 1.0S-TEDS blur gaussian x2 x3 x4 downscale 90 70 50 30 jpeg 5 10 20 level 0.0 0.2 0.4 0.6 0.8 1.0S-TEDS noise gaussian 0.03 0.07 level occlusion 2° 5° 8° level rotate Qwen2.5-VL-7B-Instruct ...