arxiv: 2604.28061 · v1 · submitted 2026-04-30 · 💻 cs.DL · cs.CL

Recognition: unknown

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Iain Hrynaszkiewicz, Lauren Cadwallader, parth sarin, Tim Vines

Pith reviewed 2026-05-07 06:27 UTC · model grok-4.3

classification 💻 cs.DL cs.CL

keywords research data reuseopen science indicatorslarge language modelsgenerative AIbibliometric analysisdata sharing impact

0 comments

The pith

LLM-based classifier detects research data reuse in 43 percent of publications

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an indicator powered by generative AI to track when datasets from one study appear in later publications. Testing on a sample of scholarly articles produced a 43 percent reuse rate, above the levels found by citation-counting methods. This result implies that the downstream benefits of open data sharing are larger than current estimates suggest and that impact monitoring can now be done at scale rather than through manual or limited bibliometric checks.

Core claim

The authors built and applied an LLM classifier that scans full-text articles to identify instances of research data reuse. In the sampled publications the classifier reported a reuse rate of 43 percent. This figure is higher than rates produced by established bibliometric techniques that rely on explicit data citations or mentions. The work shows that generative AI makes it practical to measure the actual effects of data sharing across large numbers of papers rather than only recording whether data were shared.

What carries the argument

The LLM-based data-reuse classifier, which examines publication text to flag cases where a prior dataset is used in new research, beyond simple citation counts.

If this is right

Open science monitoring can move from counting data deposits to counting actual reuse events.
Estimates of the value of data sharing may need upward revision.
Scalable AI measurement becomes feasible for tracking other downstream effects of open practices.
Policy discussions on data-sharing requirements can be informed by reuse statistics rather than sharing rates alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM approach could be extended to measure reuse of code, protocols, or other research outputs to give a broader view of open science impacts.
Running the classifier across disciplinary subsets might reveal whether reuse rates differ by field or data type.
Over time the indicator could be combined with citation data to produce composite metrics that better reflect real research influence.

Load-bearing premise

The generative model correctly labels true data-reuse events with few errors and the chosen publications represent the wider body of scholarly literature.

What would settle it

A hand-coded review of several hundred model-labeled reuse cases that finds substantially fewer than 43 percent are genuine reuse, or a repeat run on a fresh representative sample that returns a markedly lower reuse rate.

read the original abstract

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open science. PLOS and DataSeer have developed a new LLM-based indicator to measure an important effect of open science: the reuse of research data. Our results show a data reuse rate of 43%, which is higher than established bibliometric techniques. We show that data reuse can be measured at scale using LLMs and generative artificial intelligence. The positive effects of research data sharing and reuse may currently be underestimated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript describes the development of an LLM-based Open Science Indicator by PLOS and DataSeer to measure research data reuse in scholarly publications. Preliminary results from applying the generative AI classifier report a 43% data reuse rate, which the authors claim exceeds rates from established bibliometric techniques. The work argues that LLMs enable scalable measurement of data reuse and that positive downstream effects of research data sharing may currently be underestimated.

Significance. If the LLM classifier can be shown to be reliable, the approach would offer a valuable scalable tool for metascience to quantify downstream impacts of open science beyond simple prevalence metrics or citation counts. It leverages generative AI in a novel way for an empirical measurement task in digital libraries and open science monitoring, potentially capturing reuse instances that bibliometrics miss. The preliminary nature of the results limits immediate impact, but successful validation could influence how reuse is tracked in policy and evaluation contexts.

major comments (3)

[Abstract] Abstract: The central claim of a 43% data reuse rate and superiority over bibliometric techniques is presented without any reported validation of the LLM classifier against human labels, including precision, recall, confusion matrix, inter-rater reliability, or details on the evaluation sample size. This validation is load-bearing for assessing whether the numerical result supports the superiority claim.
[Methods] Methods: The description of the generative AI classifier provides no information on the specific model, prompt templates, few-shot examples, fine-tuning steps, or handling of edge cases such as data mentions versus actual reuse. These details are required to evaluate reproducibility and to rule out systematic labeling biases that could inflate or deflate the 43% figure.
[Results] Results: The comparison to bibliometric techniques lacks specification of the exact bibliometric baseline, whether it was applied to the identical corpus, and any statistical assessment of the difference. In addition, no analysis of corpus representativeness across disciplines or publication years is reported, undermining generalizability of the reuse rate.

minor comments (2)

[Abstract] The abstract would benefit from stating the total number of publications analyzed and the time period covered to provide immediate context for the 43% rate.
Consider adding a dedicated subsection or table in the Methods or Results that reports classifier performance metrics once validation is completed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We appreciate the emphasis on validation, reproducibility, and generalizability, which will help strengthen the presentation of our LLM-based Open Science Indicator. We address each major comment below and commit to incorporating the requested clarifications and additions in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 43% data reuse rate and superiority over bibliometric techniques is presented without any reported validation of the LLM classifier against human labels, including precision, recall, confusion matrix, inter-rater reliability, or details on the evaluation sample size. This validation is load-bearing for assessing whether the numerical result supports the superiority claim.

Authors: We agree that the abstract's central claims require supporting validation details to be fully convincing. While the manuscript contains a preliminary validation component, we will expand this substantially in the revision. The updated abstract and a new dedicated validation subsection will report precision, recall, F1-score, a full confusion matrix, inter-rater reliability (Cohen's kappa), and the exact size and sampling method of the human-labeled evaluation set. These additions will allow readers to directly assess the reliability of the 43% reuse rate and the comparison to bibliometric methods. revision: yes
Referee: [Methods] Methods: The description of the generative AI classifier provides no information on the specific model, prompt templates, few-shot examples, fine-tuning steps, or handling of edge cases such as data mentions versus actual reuse. These details are required to evaluate reproducibility and to rule out systematic labeling biases that could inflate or deflate the 43% figure.

Authors: We acknowledge that the current methods description is insufficient for full reproducibility and bias assessment. In the revised manuscript we will add a detailed classifier subsection specifying the exact model (including version), the complete prompt templates, any few-shot examples, fine-tuning steps if used, and explicit procedures for distinguishing data mentions from actual reuse (including prompt engineering choices and any post-processing rules). We will also discuss potential sources of systematic bias and how they were addressed or quantified. revision: yes
Referee: [Results] Results: The comparison to bibliometric techniques lacks specification of the exact bibliometric baseline, whether it was applied to the identical corpus, and any statistical assessment of the difference. In addition, no analysis of corpus representativeness across disciplines or publication years is reported, undermining generalizability of the reuse rate.

Authors: We agree that the results section needs greater specificity on the baseline comparison and corpus characteristics. The revision will explicitly name the bibliometric baseline, confirm it was run on the identical corpus, and include a statistical comparison of the two reuse rates (e.g., proportion tests with confidence intervals). We will also add a corpus description subsection reporting disciplinary and temporal distributions, together with a discussion of representativeness and the resulting limits on generalizability of the 43% figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent output

full rationale

The paper is an empirical measurement exercise that applies an LLM-based classifier to a sample of scholarly publications to estimate a data reuse rate of 43%. This figure is produced by direct processing of external text data rather than by any derivation, equation, or self-referential definition internal to the paper. No load-bearing step reduces a claimed prediction or result to a fitted parameter, self-citation chain, or ansatz smuggled from prior work by the same authors. The method description and results section treat the classifier output as an observed quantity on the sampled corpus, with no indication that the 43% value is forced by construction from the paper's own inputs. The work therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The central claim rests on the unstated assumption that LLM classification corresponds to actual data reuse.

pith-pipeline@v0.9.0 · 5404 in / 1283 out tokens · 67553 ms · 2026-05-07T06:27:53.813883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

accession numbers

Results The dataset includes the analysis of 4475 research articles published by PLOS between 1st January 2024 and 31st March 2024. Table 3 details the number and percentage of articles generating new data and/or reusing existing data. 59% of articles generated new data and 43% reused data (articles may generate new data, reuse data, or both) and 10% did ...

2024
[2]

helicopter science

Discussion As a paper-in-progress reporting an LLM-driven approach still in development, we provide only minimal interpretation of our results to date. Overall we observe levels of data reuse, as determined by this LLM approach in our sample of 4475 PLOS publications, of 43%. This result is closer to the estimates of data reuse provided by researchers in ...

work page doi:10.6084/m9.figshare.31872850 2021