pith. sign in

arxiv: 2606.09788 · v1 · pith:3A5LSQSBnew · submitted 2026-06-08 · 💻 cs.CV

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Pith reviewed 2026-06-27 17:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords table extractiondocument processingimage-to-graphlightweight modelPubTables-v2bounding boxespage-level extraction
0
0 comments X

The pith

A 29M-parameter image-to-graph model extracts tables from document pages with higher accuracy than frontier systems while running over 130 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POTATR as a lightweight model designed for page-level table extraction that handles context across an entire document page. It extends the Table Transformer into an image-to-graph converter with 29 million parameters, avoiding the need for billions of parameters or hundreds of autoregressive steps. On the PubTables-v2 Single Pages benchmark it records a GriTS_Con score of 0.964, exceeding the results of all tested models including multimodal large language models. The model produces spatially grounded outputs with bounding boxes for every element, supporting direct visual verification and integration with external tools. This design enables efficient composition with OCR for scanned inputs and cross-page techniques for full documents.

Core claim

POTATR is a 29 million parameter image-to-graph model that extends the Table Transformer for contextualized page-level table extraction, achieving a GriTS_Con of 0.964 on the PubTables-v2 Single Pages benchmark while running over 130 times faster and at roughly 300 times lower cost than the compared models.

What carries the argument

The image-to-graph architecture that maps a page image to a graph of table elements, each with an associated bounding box, extending TATR to incorporate page-level context.

If this is right

  • Enables unified page-level table extraction with direct visual verification through bounding boxes.
  • Supports composition with external OCR systems to handle scanned documents.
  • Extends to full-document extraction by applying cross-page merging techniques to its outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The speed and cost profile could allow table extraction to run on-device rather than through cloud APIs.
  • Spatially grounded graph outputs may simplify downstream tasks such as layout analysis or information retrieval.
  • The same lightweight image-to-graph pattern might apply to other structured elements like forms or diagrams.

Load-bearing premise

The PubTables-v2 Single Pages benchmark together with the chosen comparison models provide a fair and representative test of contextual page-level table extraction under consistent conditions.

What would settle it

A controlled re-evaluation on the same benchmark where POTATR fails to exceed the accuracy of the tested frontier models or loses its reported speed and cost advantage.

Figures

Figures reproduced from arXiv: 2606.09788 by Amrit Ramesh, Brandon Smock, Libin Liang, Maury Courtland, Max Sokolov, Tayyibah Khanam, Valerie Faucon-Morin.

Figure 1
Figure 1. Figure 1: Accuracy vs. efficiency on PubTables-v2 Single Pages. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: POTATR architecture. POTATR extends TATR with 125 additional object queries, 10 new page-level object classes, and a relation head that predicts parent￾child relationships between detected objects. Same (gray): modules unchanged from pre-trained TATR. Expanded (orange, dashed): modules with pre-trained components plus additional randomly initialized parts. New (teal): modules added in POTATR and randomly i… view at source ↗
Figure 3
Figure 3. Figure 3: Generation cost vs. page table cell volume. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy vs. page table cell count. GriTSCon for each model broken down by cells-per-page on PubTables-v2 Single Pages. Appendix A Licenses We plan to open source models and code under the MIT license. B Accuracy vs. Page Table Cell Count We break down GriTSCon by the number of table cells per page ( [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces POTATR, a 29M-parameter image-to-graph model extending the Table Transformer (TATR) for contextual page-level table extraction from document images. It claims state-of-the-art performance on the PubTables-v2 Single Pages benchmark, achieving GriTS_Con of 0.964 while outperforming all tested models including frontier MLLMs, with reported speedups of over 130× and cost reductions of roughly 300×. The model outputs spatially grounded elements with bounding boxes, supports composition with external OCR or cross-page merging, and is positioned as a lightweight alternative for large-scale document processing.

Significance. If the benchmark comparisons hold under identical conditions, the result would be significant for efficient document AI: it demonstrates that a compact, non-autoregressive image-to-graph architecture can match or exceed much larger MLLMs on contextual table extraction while enabling visual verification and modular extension to scanned or multi-page documents. The release of code and models would further support reproducibility.

major comments (2)
  1. [Experiments] Experiments section (benchmark comparison): The central claim of outperforming frontier MLLMs on GriTS_Con (0.964) and the associated 130× speed / 300× cost advantages rests on the PubTables-v2 Single Pages evaluation being performed under fully consistent conditions. The manuscript must explicitly document the MLLM evaluation protocol, including input representation (raw page images), prompting strategy, output parsing into the required graph format, exact test-set splits, and the GriTS_Con implementation used, to substantiate direct numerical comparisons.
  2. [Methods / Experiments] Methods and Experiments: Training details (data splits, augmentation, loss weighting for the graph prediction head, and hyperparameter choices) are referenced only at a high level. Because the performance numbers are the primary evidence for the efficiency-accuracy tradeoff, these details are load-bearing and must be provided to allow verification that the reported gains are not artifacts of differential training or post-processing.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrase 'outperforms all models tested' should be qualified to 'all models evaluated in this study' to avoid implying exhaustive coverage of the literature.
  2. [Introduction] Notation: The distinction between GriTS_Con and other GriTS variants should be defined on first use with a brief equation or reference to the original metric definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide the requested documentation and details.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (benchmark comparison): The central claim of outperforming frontier MLLMs on GriTS_Con (0.964) and the associated 130× speed / 300× cost advantages rests on the PubTables-v2 Single Pages evaluation being performed under fully consistent conditions. The manuscript must explicitly document the MLLM evaluation protocol, including input representation (raw page images), prompting strategy, output parsing into the required graph format, exact test-set splits, and the GriTS_Con implementation used, to substantiate direct numerical comparisons.

    Authors: We agree that explicit documentation of the MLLM evaluation protocol is required to substantiate the comparisons. In the revised manuscript we will add a dedicated subsection in Experiments that specifies input representation (raw page images), prompting strategy, output parsing into graph format, exact test-set splits, and the GriTS_Con implementation used for every model, including the frontier MLLMs. revision: yes

  2. Referee: [Methods / Experiments] Methods and Experiments: Training details (data splits, augmentation, loss weighting for the graph prediction head, and hyperparameter choices) are referenced only at a high level. Because the performance numbers are the primary evidence for the efficiency-accuracy tradeoff, these details are load-bearing and must be provided to allow verification that the reported gains are not artifacts of differential training or post-processing.

    Authors: We acknowledge that training details are currently high-level. In the revised manuscript we will expand the Methods and Experiments sections with full specifications of data splits, augmentation, loss weighting for the graph prediction head, and hyperparameter choices so that the reported performance can be verified and shown not to result from differential training or post-processing. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on external benchmark measurements

full rationale

The paper introduces POTATR as an extension of prior TATR work and reports direct empirical results (GriTS_Con, speed, cost) on the external PubTables-v2 Single Pages benchmark against other models. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the central claims to tautologies or self-definitions. All key metrics are stated as measured quantities under the benchmark protocol, with no renaming of known results or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities beyond the high-level model description. The 29M parameter count is stated but not analyzed as a fitted value here.

pith-pipeline@v0.9.1-grok · 5736 in / 1101 out tokens · 27668 ms · 2026-06-27T17:05:03.018278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  2. [2]

    In: International Conference on Document Analysis and Recognition

    Baek, Y., Nam, D., Surh, J., Shin, S., Kim, S.: TRACE: table reconstruction aligned to corner and edges. In: International Conference on Document Analysis and Recognition. pp. 472–489. Springer (2023)

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    In: ECCV

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: ECCV. pp. 213–229 (2020)

  5. [5]

    In: International Conference on Document Analysis and Recognition

    Choi, Y.Y., Kim, T., Kim, N., Lee, T., Joe, S.: End to end table transformer. In: International Conference on Document Analysis and Recognition. pp. 331–345. Springer (2024)

  6. [6]

    PaddleOCR 3.0 Technical Report

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al.: PaddleOCR 3.0 technical report. arXiv preprint arXiv:2507.05595 (2025)

  7. [7]

    IBM-Granite: Granite Docling.https://huggingface.co/ibm-granite/granite- docling-258M(2025)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Im, J., Nam, J., Park, N., Lee, H., Park, S.: EGTR: Extracting graph from trans- former for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24229–24238 (2024)

  9. [9]

    JaidedAI: EasyOCR (09 2024),https://github.com/JaidedAI/EasyOCR

  10. [10]

    dots.ocr: Multilingual document layout parsing in a single vision-language model, 2025

    Li, Y., Yang, G., Liu, H., Wang, B., Zhang, C.: dots.ocr: Multilingual document layout parsing in a single vision-language model. arXiv preprint arXiv:2512.02498 (2025)

  11. [11]

    Luccioni,A.S.,Jernite,Y.,Strubell,E.:Powerhungryprocessing:Wattsdrivingthe cost of ai deployment? In: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT). pp. 85–99. ACM (2024)

  12. [12]

    Smock et al

    Mindee: docTR: Document text recognition.https://github.com/mindee/doctr (2021) 12 B. Smock et al

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Nassar, A., Livathinos, N., Lysak, M., Staar, P.: Tableformer: Table structure un- derstanding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4614–4623 (2022)

  14. [14]

    arXiv preprint arXiv:2503.11576 (2025)

    Nassar, A., Marafioti, A., Omenetti, M., Lysak, M., Livathinos, N., Auer, C., Morin, L., de Lima, R.T., Kim, Y., Gurbuz, A.S., et al.: SmolDocling: An ultra- compact vision-language model for end-to-end multi-modal document conversion. arXiv preprint arXiv:2503.11576 (2025)

  15. [15]

    In: 2019 international conference on document analysis and recognition (ICDAR)

    Paliwal, S.S., Vishwanath, D., Rahul, R., Sharma, M., Vig, L.: Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 international conference on document analysis and recognition (ICDAR). pp. 128–133. IEEE (2019)

  16. [16]

    Poznanski, A

    Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., Soldaini, L.: olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops

    Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: An approach for end to end table detection and structure recognition from image- based documents. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 572–573 (2020)

  18. [18]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Rausch, J., Martinez, O., Bissig, F., Zhang, C., Feuerriegel, S.: Docparser: Hierar- chical document structure parsing from renderings. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 4328–4338 (2021)

  19. [19]

    In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR)

    Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSTR: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162–1167. IEEE (2017)

  20. [20]

    In: European Conference on Computer Vision

    Shit, S., Koner, R., Wittmann, B., Paetzold, J., Ezhov, I., Li, H., Pan, J., Shar- ifzadeh, S., Kaissis, G., Tresp, V., et al.: Relationformer: A unified framework for image-to-graph generation. In: European Conference on Computer Vision. pp. 422–439. Springer (2022)

  21. [21]

    PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

    Smock, B., Faucon-Morin, V., Sokolov, M., Liang, L., Khanam, T., Ramesh, A., Courtland, M.: PubTables-v2: A new large-scale dataset for full-page and multi- page table extraction. arXiv preprint arXiv:2512.10888 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Smock, B., Pesala, R., Abraham, R.: PubTables-1M: Towards comprehensive ta- ble extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4634–4642 (June 2022)

  23. [23]

    In: International Conference on Document Analysis and Recog- nition

    Smock, B., Pesala, R., Abraham, R.: Aligning benchmark datasets for table struc- ture recognition. In: International Conference on Document Analysis and Recog- nition. pp. 371–386. Springer (2023)

  24. [24]

    In: International Conference on Document Analysis and Recognition

    Smock, B., Pesala, R., Abraham, R.: GriTS: Grid table similarity metric for ta- ble structure recognition. In: International Conference on Document Analysis and Recognition. pp. 535–549. Springer (2023)

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  26. [26]

    arXiv preprint arXiv:2502.09927 (2025)

    Team, G.V., Karlinsky, L., Arbelle, A., Daniels, A., Nassar, A., Alfassi, A., Wu, B., Schwartz, E., Joshi, D., Kondic, J., et al.: Granite vision: a lightweight, open-source multimodal model for enterprise intelligence. arXiv preprint arXiv:2502.09927 (2025)

  27. [27]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025) POTATR for Page-Level Table Extraction 13

  28. [28]

    Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

    Wei, H., Sun, Y., Li, Y.: Deepseek-ocr 2: Visual causal flow. arXiv preprint arXiv:2601.20552 (2026)

  29. [29]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 697–706 (2021)

  30. [30]

    In: European conference on computer vision

    Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: European conference on computer vision. pp. 564–580. Springer (2020)

  31. [31]

    In: ICLR (2021) 14 B

    Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR (2021) 14 B. Smock et al. 1 20 21 40 41 70 71 100 101 150 151 200 201 250 250+ Cells per page 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00GriTSCon Accuracy vs. Page T able Cell Volume POTATR-29M Gemini 3.1 Pro Claude ...