pith. machine review for the scientific record. sign in

arxiv: 2605.06021 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.DL

Recognition: unknown

PlotPick: AI-powered batch extraction of numerical data from scientific figures

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:17 UTC · model grok-4.3

classification 💻 cs.CV cs.DL
keywords plot extractionvision-language modelsscientific figuresdata digitizationchart-to-tablebatch processingmeta-analysiscomputer vision
0
0 comments X

The pith

Vision-language models extract numerical data from scientific figures more reliably than dedicated chart models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PlotPick as an open-source tool that applies vision-language models to convert images of scientific plots into structured tables in batch. Evaluation on the ChartX and PlotQA benchmarks shows all six tested VLMs exceed the performance of the specialized DePlot model, with recall rates of 88-96 percent on ChartX versus 71 percent and RMSF1 scores of 86-99 percent on PlotQA versus 94 percent. The advantage grows on chart types such as box plots that were absent from the dedicated model's training. This matters because systematic reviews and meta-analyses often need to recover numbers that authors publish only in figures, and manual digitization does not scale. The work supplies a practical web-accessible implementation that can be applied directly to published papers.

Core claim

PlotPick shows that general vision-language models achieve higher accuracy than the dedicated DePlot model at converting bar charts, line charts, box plots, and histograms into tabular data on the ChartX (n=300) and PlotQA (n=529) benchmarks. On ChartX the VLMs reach 88-96 percent recall while DePlot reaches 71 percent; on PlotQA the VLMs reach 86-99 percent RMSF1 while DePlot reaches 94 percent. The gap is largest for box plots, where DePlot scores 24 percent RMSF1 and the VLMs score 83-97 percent. The tool is released at a public Streamlit address for immediate use on batches of figures.

What carries the argument

The PlotPick tool, which feeds figure images to vision-language models to produce structured tabular output without requiring chart-type-specific training.

If this is right

  • Systematic reviews and meta-analyses can process larger numbers of papers by automating the recovery of data shown only in figures.
  • Verification of reported results becomes faster when readers can extract the underlying numbers directly from published plots.
  • Open release of the tool lowers the barrier for researchers to test the approach on their own collections of papers.
  • General models gain an edge over narrowly trained chart parsers precisely on chart types not seen during specialized training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same batch workflow could be combined with literature search tools to create end-to-end data pipelines for entire research topics.
  • Extensions to figures that combine multiple chart types or heavy annotation would test the limits of current general models.
  • Error rates on real journal figures could be lowered further by post-processing steps that enforce consistency with axis labels and legends.
  • Adoption might reduce transcription mistakes that currently affect secondary analyses in fields that rely on visual data presentation.

Load-bearing premise

Performance measured on the ChartX and PlotQA benchmarks will transfer to the varied layouts, annotations, and visual styles that appear in real published scientific papers.

What would settle it

Ground-truth numerical values recovered from a new set of actual peer-reviewed journal figures and compared against PlotPick outputs would show whether the benchmark gains persist outside the test distributions.

Figures

Figures reproduced from arXiv: 2605.06021 by Tommy Carstensen.

Figure 1
Figure 1. Figure 1: Chart-to-table extraction accuracy by method and chart type. view at source ↗
Figure 2
Figure 2. Figure 2: RMSF1 (%) by model and chart type. All VLMs outperform view at source ↗
read the original abstract

Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PlotPick, an open-source Streamlit-based tool that leverages vision-language models (VLMs) for batch extraction of structured tabular data from scientific figures. The authors evaluate six VLMs from three providers on two public chart-to-table benchmarks—ChartX (restricted to bar, line, box, and histogram charts; n=300) and PlotQA (n=529)—and compare them against the dedicated DePlot model. They claim that all six VLMs outperform DePlot on both benchmarks, with recall of 88-96% (vs. 71%) on ChartX and RMSF1 of 86-99% (vs. 94%) on PlotQA, and highlight larger gains on chart types absent from DePlot's training data such as box plots (83-97% vs. 24%). The tool is made publicly available at https://plotpick.streamlit.app.

Significance. If the performance claims hold after correction, PlotPick would provide a practical, scalable solution for systematic reviews and meta-analyses that require digitizing numerical data from figures, an otherwise labor-intensive task. The use of established public benchmarks (ChartX, PlotQA), direct comparison to a dedicated baseline (DePlot), and release of an open-source implementation are clear strengths that support reproducibility and allow community extension. However, the reported results do not yet demonstrate generalization to the complex, annotated, and stylistically varied figures typical of published scientific literature.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'All six VLMs outperform DePlot on both benchmarks' is internally inconsistent with the reported PlotQA results. The VLM RMSF1 range of 86-99% includes values below DePlot's 94%, which directly falsifies the 'all outperform' assertion if the range represents the min-max across the six models (as the parallel ChartX phrasing 88-96% suggests). This is a load-bearing reporting error for the paper's primary empirical contribution and requires either per-model scores or a corrected statement.
  2. [Abstract and §4] Abstract and §4 (Evaluation): The manuscript provides aggregate ranges but no per-model breakdown, error analysis, statistical significance tests, or details on prompting strategy and post-processing. Without these, it is impossible to verify whether the claimed superiority is robust or driven by specific VLMs, chart subsets, or implementation choices.
minor comments (2)
  1. [Abstract] Abstract: The ChartX evaluation is restricted to four chart types; the paper should explicitly state the total size of ChartX and the fraction retained after filtering to allow readers to assess selection bias.
  2. The manuscript should include a limitations section discussing failure modes on real-world figures (e.g., multi-panel plots, heavy annotations, non-standard color schemes) that are not represented in the chosen benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. We address each major comment below and will revise the manuscript accordingly to improve accuracy and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'All six VLMs outperform DePlot on both benchmarks' is internally inconsistent with the reported PlotQA results. The VLM RMSF1 range of 86-99% includes values below DePlot's 94%, which directly falsifies the 'all outperform' assertion if the range represents the min-max across the six models (as the parallel ChartX phrasing 88-96% suggests). This is a load-bearing reporting error for the paper's primary empirical contribution and requires either per-model scores or a corrected statement.

    Authors: We acknowledge the inconsistency. The abstract's statement that all six VLMs outperform DePlot cannot be reconciled with a reported RMSF1 range of 86-99% on PlotQA when the lower bound falls below DePlot's 94%. This was an error in the abstract's generalization of the aggregate results. We will revise the abstract to remove the absolute claim and instead report that VLMs achieve RMSF1 scores ranging from 86-99% (versus 94% for DePlot), with the largest gains on underrepresented chart types. We will also add per-model scores in the revised evaluation section. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The manuscript provides aggregate ranges but no per-model breakdown, error analysis, statistical significance tests, or details on prompting strategy and post-processing. Without these, it is impossible to verify whether the claimed superiority is robust or driven by specific VLMs, chart subsets, or implementation choices.

    Authors: We agree that aggregate ranges alone limit the ability to assess robustness. The current manuscript emphasizes overall trends across models and benchmarks, but we recognize the need for greater detail. In the revision we will include a per-model performance table for both ChartX and PlotQA, describe the prompting templates and any post-processing rules applied, and add an error analysis section highlighting failure modes by chart type. Statistical significance testing will be incorporated for the key comparisons where sample sizes permit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on independent public datasets

full rationale

The paper reports direct empirical results from running six VLMs and DePlot on the public ChartX (n=300) and PlotQA (n=529) benchmarks. No equations, parameter fitting, self-referential predictions, or derivation chains appear. Performance claims rest on external, pre-existing benchmarks and an external baseline model; the evaluation is self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical tool and benchmark study. It relies on the established capabilities of vision-language models and public chart datasets rather than new mathematical axioms, fitted parameters, or postulated entities.

axioms (1)
  • domain assumption Vision-language models can interpret and extract numerical information from chart images when properly prompted.
    The central claim depends on this capability of existing VLMs; the abstract provides no theoretical derivation or proof.

pith-pipeline@v0.9.0 · 5486 in / 1395 out tokens · 76892 ms · 2026-05-08T14:17:01.759971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 1 canonical work pages

  1. [1]

    Extracting data from figures with software was faster, with higher interrater reliability than manual extraction.Journal of Clinical Epidemiology, 74:119–123, 2016

    Antonia Jelicic Kadic, Katarina Vucic, Svjetlana Dosenovic, Damir Sa- punar, and Livia Puljak. Extracting data from figures with software was faster, with higher interrater reliability than manual extraction.Journal of Clinical Epidemiology, 74:119–123, 2016

  2. [2]

    DePlot: One-shot visual language reasoning by plot-to-table translation

    FangyuLiu, JulianMartinEisenschlos, FrancescoPiccinno, SyrineKrich- ene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Col- lier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. InFindings of ACL, 2023

  3. [3]

    TinyChart: Efficient chart understanding with visual token merging and program-of-thoughts learning

    Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. TinyChart: Efficient chart understanding with visual token merging and program-of-thoughts learning. InEMNLP, 2024

  4. [4]

    ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning.arXiv preprint arXiv:2402.12185, 2024

    Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Peng Ye, Min Dou, Botian Shi, Junchi Yan, and Yu Qiao. ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning.arXiv preprint arXiv:2402.12185, 2024

  5. [5]

    PyMuPDF: Python bindings for MuPDF

    Artifex Software. PyMuPDF: Python bindings for MuPDF. 7