One in Eight OpenAlex Abstracts Has Integrity Issues

Seorin Kim; Vincent Ginis; Vincent Holst

arxiv: 2605.20168 · v1 · pith:7XOETLFGnew · submitted 2026-05-19 · 💻 cs.DL · cs.DB

One in Eight OpenAlex Abstracts Has Integrity Issues

Seorin Kim , Vincent Holst , Vincent Ginis This is my paper

Pith reviewed 2026-05-20 02:11 UTC · model grok-4.3

classification 💻 cs.DL cs.DB

keywords OpenAlexscientific abstractsdata integritymetasciencebibliographic databasesdata qualityannotation protocolfailure modes

0 comments

The pith

About 12% of English-language journal abstracts in OpenAlex show integrity problems such as insufficient content or misplaced metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper checks the quality of abstracts stored in the OpenAlex bibliographic database, which many computational metascience studies now treat as primary data. A random sample of 10,000 abstracts was reviewed with a two-stage process that combines human experts and large language model classification. The review uncovered seven distinct failure modes and showed that 12% of the abstracts contain integrity issues. Insufficient content and misplaced metadata turned out to be the most common problems. These findings matter because low-quality input data can distort any large-scale analysis that draws on the database.

Core claim

We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent.

What carries the argument

Two-stage human-plus-LLM annotation protocol that labels abstracts according to seven defined failure modes.

If this is right

Metascience studies that treat OpenAlex abstracts as primary data may produce biased or noisy results.
Researchers using the database for computational work should apply additional data-cleaning steps.
A forthcoming community portal will allow collective annotation to improve the resource over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar audits of other large bibliographic databases could reveal whether the 12% rate is widespread.
Automated detectors trained on the seven failure modes could be integrated into database ingestion pipelines.
The reported prevalence supplies a concrete baseline against which future improvements in abstract quality can be measured.

Load-bearing premise

The random sample of 10,000 abstracts represents all English-language journal abstracts in OpenAlex and the combined human-LLM protocol identifies integrity issues consistently and without systematic bias.

What would settle it

Re-annotating an independent new sample of 10,000 OpenAlex abstracts or manually verifying a random subset against the original journal PDFs to confirm whether the 12% rate holds.

Figures

Figures reproduced from arXiv: 2605.20168 by Seorin Kim, Vincent Ginis, Vincent Holst.

**Figure 2.** Figure 2: Binary confusion matrix comparing Claude Opus 4.6 (calibrated prompt) against the human consensus ground [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Rejection rate by publication period. Rejection rates declined consistently across publication periods for both [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of abstract integrity failures across 10,000 OpenAlex abstracts. The inset shows the overall [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Failure mode composition per period across 10,000 OpenAlex abstracts. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Scientific abstracts are increasingly used as primary data in computational metascience research, yet the quality of these abstracts in widely used bibliographic databases has not been systematically examined. We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12\% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent. We discuss implications for downstream research and describe a forthcoming community portal to support collective annotation efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a 12% rate of integrity issues in OpenAlex abstracts from a 10k random sample, but the two-stage annotation lacks reported agreement metrics so the exact figure needs verification.

read the letter

The main thing here is the 12% figure for abstracts with integrity problems in OpenAlex, drawn from a random sample of 10,000 English-language journal entries. They break it into seven failure modes, with insufficient content and misplaced metadata at the top. If the number holds, it is worth knowing for anyone running large-scale work on that database. They appear to be the first to quantify this systematically, which is the real addition. The taxonomy and the direct count from annotation give a concrete baseline where before there were only suspicions. The two-stage human-plus-LLM setup is a reasonable way to handle volume without pure manual labor, and they connect it to risks in metascience and computational social science studies that treat abstracts as primary data. That part is straightforward and useful. The soft spot is the annotation process itself. The abstract gives no inter-rater agreement numbers, no calibration details between humans and the LLM, and no clear rules for resolving disagreements. If the categories have any subjectivity, especially around what counts as insufficient content, the 12% and the mode rankings could move with different labelers or prompt tweaks. The sample is limited to English journals, which is fine for scope but means the result does not automatically cover the full database. These are fixable with more methods detail rather than fatal. This paper is aimed at researchers who pull OpenAlex data for text analysis, citation studies, or trend detection. Anyone who has used those abstracts as clean input will see immediate practical value in the error rate. It deserves a serious referee because the finding is actionable for a widely used resource and the core design is empirical and transparent, even if the reliability checks need to be shown. I would send it to review and specifically ask for the agreement statistics and a few labeled examples of each mode.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical assessment of abstract integrity in the OpenAlex database. A random sample of 10,000 English-language journal abstracts is annotated via a two-stage protocol that combines human expert review with LLM classification. Seven failure modes are identified, and the authors conclude that 12% of abstracts exhibit integrity issues, with insufficient content and misplaced metadata as the most common. Implications for metascience research and plans for a community annotation portal are discussed.

Significance. If the 12% prevalence and failure-mode distribution are shown to be robust, the result would demonstrate that a substantial fraction of abstracts in a widely used open bibliographic database contain integrity problems. This would have immediate consequences for any computational metascience work that treats OpenAlex abstracts as primary input data and would support the value of the proposed community portal for ongoing quality improvement.

major comments (2)

[Methods] Methods section (Annotation Protocol): The two-stage human-plus-LLM protocol is outlined, yet no inter-annotator agreement statistics (Cohen’s kappa, percentage agreement), disagreement-resolution rules, or LLM calibration/validation performance on held-out labels are reported. Because the central 12% prevalence figure and the ranking of the seven failure modes are derived directly from these annotations, the lack of quantitative reliability metrics leaves the quantitative claims sensitive to potential subjectivity or systematic bias in labeling.
[Results] Results section: The manuscript should report exact counts, percentages, and confidence intervals for each of the seven failure modes (not only the two highlighted in the abstract) so that readers can assess the precision of the overall 12% estimate and the relative prevalence claims.

minor comments (2)

[Methods] Clarify the exact filtering criteria used to select English-language journal abstracts from OpenAlex and any exclusion rules applied before sampling.
[Results] Consider adding a short table that maps each of the seven failure modes to concrete examples drawn from the annotated sample.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript assessing abstract integrity in OpenAlex. We provide point-by-point responses to the major comments below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Methods] Methods section (Annotation Protocol): The two-stage human-plus-LLM protocol is outlined, yet no inter-annotator agreement statistics (Cohen’s kappa, percentage agreement), disagreement-resolution rules, or LLM calibration/validation performance on held-out labels are reported. Because the central 12% prevalence figure and the ranking of the seven failure modes are derived directly from these annotations, the lack of quantitative reliability metrics leaves the quantitative claims sensitive to potential subjectivity or systematic bias in labeling.

Authors: We appreciate this comment and agree that transparency regarding annotation reliability is important. The protocol involved an initial LLM classification followed by human expert review for a subset to validate and refine. In the revised manuscript, we will add details on the disagreement-resolution process (human expert decisions take precedence) and report the LLM's accuracy on a held-out validation set of 500 abstracts manually labeled by the expert. However, as the human review was conducted by a single expert, inter-annotator agreement statistics such as Cohen’s kappa are not applicable in this context. We believe this additional information will sufficiently address concerns about potential bias. revision: partial
Referee: [Results] Results section: The manuscript should report exact counts, percentages, and confidence intervals for each of the seven failure modes (not only the two highlighted in the abstract) so that readers can assess the precision of the overall 12% estimate and the relative prevalence claims.

Authors: We agree with this suggestion. Although the abstract highlights the primary issues, the manuscript identifies all seven failure modes. In the revised version, we will include a comprehensive table in the Results section that reports the exact counts, percentages, and 95% Wilson score confidence intervals for each of the seven failure modes. This will enable readers to evaluate the precision of the 12% overall estimate and the relative frequencies. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical prevalence from annotation counts

full rationale

The paper performs a straightforward empirical measurement by randomly sampling 10,000 English-language journal abstracts from OpenAlex and applying a two-stage human-plus-LLM annotation protocol to identify seven failure modes. The reported 12% prevalence and mode rankings are obtained directly from the resulting annotation counts rather than any derivation, equation, fitted parameter, or self-referential definition. No load-bearing steps reduce by construction to the paper's own inputs, and the central claim remains independent of self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central prevalence claim rests on two domain assumptions: that the random sample represents the full population of English-language journal abstracts and that the human-LLM annotation protocol reliably detects integrity issues. No free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption The 10,000 randomly sampled abstracts are statistically representative of all English-language journal abstracts stored in OpenAlex.
The study generalizes the observed 12% rate from the sample to the broader database without additional stratification or weighting details.
domain assumption The two-stage human-expert plus LLM annotation protocol produces consistent and valid labels for integrity issues.
The paper relies on this protocol to define and count the seven failure modes but does not report agreement metrics or validation against an external gold standard in the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1592 out tokens · 50350 ms · 2026-05-20T02:11:45.497054+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12% of abstracts have integrity issues
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage annotation protocol... calibrated classification prompt... 96% agreement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Alonso-Álvarez, P., & van Eck, N. J. (2025). Coverage and metadata completeness and accuracy of african research publications in openalex: A comparative analysis.Quantitative Science Studies,6, 1336–1357. https://doi.org/ 10.1162/QSS.a.396

work page doi:10.1162/qss.a.396 2025
[2]

Arts, S., Melluso, N., & Veugelers, R. (2025). Beyond citations: Measuring novel scientific ideas and their impact in publication text.The Review of Economics and Statistics, 1–33. https://doi.org/10.1162/rest_a_01561

work page doi:10.1162/rest_a_01561 2025
[3]

H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P

Culbert, J. H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2025). Reference coverage analysis of openalex compared to web of science and scopus.Scientometrics,130(4), 2475–2492. https: //doi.org/10.1007/s11192-025-05293-3

work page doi:10.1007/s11192-025-05293-3 2025
[4]

Kim, S., Holst, V ., & Ginis, V . (2026). Turning citation networks inside out: Studying science using content-based knowledge graphs from llm-derived taxonomies. https://arxiv.org/abs/2601.15062

work page arXiv 2026
[5]

Priem, J., Piwowar, H., & Orr, R. (2022). Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. https://arxiv.org/abs/2205.01833

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

no abstract available

Tosi, M. D. L., & dos Reis, J. C. (2021). Scikgraph: A knowledge graph approach to structure a scientific field.Journal of Informetrics,15(1), 101109. https://doi.org/https://doi.org/10.1016/j.joi.2020.101109 Appendix A LLM Prompt for Classification by Failure Modes The classification prompt was derived from the structured resolution of 196 disagreements ...

work page doi:10.1016/j.joi.2020.101109 2021

[1] [1]

Alonso-Álvarez, P., & van Eck, N. J. (2025). Coverage and metadata completeness and accuracy of african research publications in openalex: A comparative analysis.Quantitative Science Studies,6, 1336–1357. https://doi.org/ 10.1162/QSS.a.396

work page doi:10.1162/qss.a.396 2025

[2] [2]

Arts, S., Melluso, N., & Veugelers, R. (2025). Beyond citations: Measuring novel scientific ideas and their impact in publication text.The Review of Economics and Statistics, 1–33. https://doi.org/10.1162/rest_a_01561

work page doi:10.1162/rest_a_01561 2025

[3] [3]

H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P

Culbert, J. H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2025). Reference coverage analysis of openalex compared to web of science and scopus.Scientometrics,130(4), 2475–2492. https: //doi.org/10.1007/s11192-025-05293-3

work page doi:10.1007/s11192-025-05293-3 2025

[4] [4]

Kim, S., Holst, V ., & Ginis, V . (2026). Turning citation networks inside out: Studying science using content-based knowledge graphs from llm-derived taxonomies. https://arxiv.org/abs/2601.15062

work page arXiv 2026

[5] [5]

Priem, J., Piwowar, H., & Orr, R. (2022). Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. https://arxiv.org/abs/2205.01833

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

no abstract available

Tosi, M. D. L., & dos Reis, J. C. (2021). Scikgraph: A knowledge graph approach to structure a scientific field.Journal of Informetrics,15(1), 101109. https://doi.org/https://doi.org/10.1016/j.joi.2020.101109 Appendix A LLM Prompt for Classification by Failure Modes The classification prompt was derived from the structured resolution of 196 disagreements ...

work page doi:10.1016/j.joi.2020.101109 2021