Recognition: unknown
PlotPick: AI-powered batch extraction of numerical data from scientific figures
Pith reviewed 2026-05-08 14:17 UTC · model grok-4.3
The pith
Vision-language models extract numerical data from scientific figures more reliably than dedicated chart models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PlotPick shows that general vision-language models achieve higher accuracy than the dedicated DePlot model at converting bar charts, line charts, box plots, and histograms into tabular data on the ChartX (n=300) and PlotQA (n=529) benchmarks. On ChartX the VLMs reach 88-96 percent recall while DePlot reaches 71 percent; on PlotQA the VLMs reach 86-99 percent RMSF1 while DePlot reaches 94 percent. The gap is largest for box plots, where DePlot scores 24 percent RMSF1 and the VLMs score 83-97 percent. The tool is released at a public Streamlit address for immediate use on batches of figures.
What carries the argument
The PlotPick tool, which feeds figure images to vision-language models to produce structured tabular output without requiring chart-type-specific training.
If this is right
- Systematic reviews and meta-analyses can process larger numbers of papers by automating the recovery of data shown only in figures.
- Verification of reported results becomes faster when readers can extract the underlying numbers directly from published plots.
- Open release of the tool lowers the barrier for researchers to test the approach on their own collections of papers.
- General models gain an edge over narrowly trained chart parsers precisely on chart types not seen during specialized training.
Where Pith is reading between the lines
- The same batch workflow could be combined with literature search tools to create end-to-end data pipelines for entire research topics.
- Extensions to figures that combine multiple chart types or heavy annotation would test the limits of current general models.
- Error rates on real journal figures could be lowered further by post-processing steps that enforce consistency with axis labels and legends.
- Adoption might reduce transcription mistakes that currently affect secondary analyses in fields that rely on visual data presentation.
Load-bearing premise
Performance measured on the ChartX and PlotQA benchmarks will transfer to the varied layouts, annotations, and visual styles that appear in real published scientific papers.
What would settle it
Ground-truth numerical values recovered from a new set of actual peer-reviewed journal figures and compared against PlotPick outputs would show whether the benchmark gains persist outside the test distributions.
Figures
read the original abstract
Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PlotPick, an open-source Streamlit-based tool that leverages vision-language models (VLMs) for batch extraction of structured tabular data from scientific figures. The authors evaluate six VLMs from three providers on two public chart-to-table benchmarks—ChartX (restricted to bar, line, box, and histogram charts; n=300) and PlotQA (n=529)—and compare them against the dedicated DePlot model. They claim that all six VLMs outperform DePlot on both benchmarks, with recall of 88-96% (vs. 71%) on ChartX and RMSF1 of 86-99% (vs. 94%) on PlotQA, and highlight larger gains on chart types absent from DePlot's training data such as box plots (83-97% vs. 24%). The tool is made publicly available at https://plotpick.streamlit.app.
Significance. If the performance claims hold after correction, PlotPick would provide a practical, scalable solution for systematic reviews and meta-analyses that require digitizing numerical data from figures, an otherwise labor-intensive task. The use of established public benchmarks (ChartX, PlotQA), direct comparison to a dedicated baseline (DePlot), and release of an open-source implementation are clear strengths that support reproducibility and allow community extension. However, the reported results do not yet demonstrate generalization to the complex, annotated, and stylistically varied figures typical of published scientific literature.
major comments (2)
- [Abstract] Abstract: The central claim that 'All six VLMs outperform DePlot on both benchmarks' is internally inconsistent with the reported PlotQA results. The VLM RMSF1 range of 86-99% includes values below DePlot's 94%, which directly falsifies the 'all outperform' assertion if the range represents the min-max across the six models (as the parallel ChartX phrasing 88-96% suggests). This is a load-bearing reporting error for the paper's primary empirical contribution and requires either per-model scores or a corrected statement.
- [Abstract and §4] Abstract and §4 (Evaluation): The manuscript provides aggregate ranges but no per-model breakdown, error analysis, statistical significance tests, or details on prompting strategy and post-processing. Without these, it is impossible to verify whether the claimed superiority is robust or driven by specific VLMs, chart subsets, or implementation choices.
minor comments (2)
- [Abstract] Abstract: The ChartX evaluation is restricted to four chart types; the paper should explicitly state the total size of ChartX and the fraction retained after filtering to allow readers to assess selection bias.
- The manuscript should include a limitations section discussing failure modes on real-world figures (e.g., multi-panel plots, heavy annotations, non-standard color schemes) that are not represented in the chosen benchmarks.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. We address each major comment below and will revise the manuscript accordingly to improve accuracy and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'All six VLMs outperform DePlot on both benchmarks' is internally inconsistent with the reported PlotQA results. The VLM RMSF1 range of 86-99% includes values below DePlot's 94%, which directly falsifies the 'all outperform' assertion if the range represents the min-max across the six models (as the parallel ChartX phrasing 88-96% suggests). This is a load-bearing reporting error for the paper's primary empirical contribution and requires either per-model scores or a corrected statement.
Authors: We acknowledge the inconsistency. The abstract's statement that all six VLMs outperform DePlot cannot be reconciled with a reported RMSF1 range of 86-99% on PlotQA when the lower bound falls below DePlot's 94%. This was an error in the abstract's generalization of the aggregate results. We will revise the abstract to remove the absolute claim and instead report that VLMs achieve RMSF1 scores ranging from 86-99% (versus 94% for DePlot), with the largest gains on underrepresented chart types. We will also add per-model scores in the revised evaluation section. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The manuscript provides aggregate ranges but no per-model breakdown, error analysis, statistical significance tests, or details on prompting strategy and post-processing. Without these, it is impossible to verify whether the claimed superiority is robust or driven by specific VLMs, chart subsets, or implementation choices.
Authors: We agree that aggregate ranges alone limit the ability to assess robustness. The current manuscript emphasizes overall trends across models and benchmarks, but we recognize the need for greater detail. In the revision we will include a per-model performance table for both ChartX and PlotQA, describe the prompting templates and any post-processing rules applied, and add an error analysis section highlighting failure modes by chart type. Statistical significance testing will be incorporated for the key comparisons where sample sizes permit. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation on independent public datasets
full rationale
The paper reports direct empirical results from running six VLMs and DePlot on the public ChartX (n=300) and PlotQA (n=529) benchmarks. No equations, parameter fitting, self-referential predictions, or derivation chains appear. Performance claims rest on external, pre-existing benchmarks and an external baseline model; the evaluation is self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can interpret and extract numerical information from chart images when properly prompted.
Reference graph
Works this paper leans on
-
[1]
Extracting data from figures with software was faster, with higher interrater reliability than manual extraction.Journal of Clinical Epidemiology, 74:119–123, 2016
Antonia Jelicic Kadic, Katarina Vucic, Svjetlana Dosenovic, Damir Sa- punar, and Livia Puljak. Extracting data from figures with software was faster, with higher interrater reliability than manual extraction.Journal of Clinical Epidemiology, 74:119–123, 2016
2016
-
[2]
DePlot: One-shot visual language reasoning by plot-to-table translation
FangyuLiu, JulianMartinEisenschlos, FrancescoPiccinno, SyrineKrich- ene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Col- lier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. InFindings of ACL, 2023
2023
-
[3]
TinyChart: Efficient chart understanding with visual token merging and program-of-thoughts learning
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. TinyChart: Efficient chart understanding with visual token merging and program-of-thoughts learning. InEMNLP, 2024
2024
-
[4]
Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Peng Ye, Min Dou, Botian Shi, Junchi Yan, and Yu Qiao. ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning.arXiv preprint arXiv:2402.12185, 2024
-
[5]
PyMuPDF: Python bindings for MuPDF
Artifex Software. PyMuPDF: Python bindings for MuPDF. 7
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.