pith. sign in

arxiv: 2605.20168 · v1 · pith:7XOETLFGnew · submitted 2026-05-19 · 💻 cs.DL · cs.DB

One in Eight OpenAlex Abstracts Has Integrity Issues

Pith reviewed 2026-05-20 02:11 UTC · model grok-4.3

classification 💻 cs.DL cs.DB
keywords OpenAlexscientific abstractsdata integritymetasciencebibliographic databasesdata qualityannotation protocolfailure modes
0
0 comments X

The pith

About 12% of English-language journal abstracts in OpenAlex show integrity problems such as insufficient content or misplaced metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper checks the quality of abstracts stored in the OpenAlex bibliographic database, which many computational metascience studies now treat as primary data. A random sample of 10,000 abstracts was reviewed with a two-stage process that combines human experts and large language model classification. The review uncovered seven distinct failure modes and showed that 12% of the abstracts contain integrity issues. Insufficient content and misplaced metadata turned out to be the most common problems. These findings matter because low-quality input data can distort any large-scale analysis that draws on the database.

Core claim

We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent.

What carries the argument

Two-stage human-plus-LLM annotation protocol that labels abstracts according to seven defined failure modes.

If this is right

  • Metascience studies that treat OpenAlex abstracts as primary data may produce biased or noisy results.
  • Researchers using the database for computational work should apply additional data-cleaning steps.
  • A forthcoming community portal will allow collective annotation to improve the resource over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar audits of other large bibliographic databases could reveal whether the 12% rate is widespread.
  • Automated detectors trained on the seven failure modes could be integrated into database ingestion pipelines.
  • The reported prevalence supplies a concrete baseline against which future improvements in abstract quality can be measured.

Load-bearing premise

The random sample of 10,000 abstracts represents all English-language journal abstracts in OpenAlex and the combined human-LLM protocol identifies integrity issues consistently and without systematic bias.

What would settle it

Re-annotating an independent new sample of 10,000 OpenAlex abstracts or manually verifying a random subset against the original journal PDFs to confirm whether the 12% rate holds.

Figures

Figures reproduced from arXiv: 2605.20168 by Seorin Kim, Vincent Ginis, Vincent Holst.

Figure 1
Figure 1. Figure 1: Inter-annotator agreement across four annotators (Human: A1 and A2; LLM: Claude and Codex). (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Binary confusion matrix comparing Claude Opus 4.6 (calibrated prompt) against the human consensus ground [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rejection rate by publication period. Rejection rates declined consistently across publication periods for both [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of abstract integrity failures across 10,000 OpenAlex abstracts. The inset shows the overall [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure mode composition per period across 10,000 OpenAlex abstracts. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Scientific abstracts are increasingly used as primary data in computational metascience research, yet the quality of these abstracts in widely used bibliographic databases has not been systematically examined. We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12\% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent. We discuss implications for downstream research and describe a forthcoming community portal to support collective annotation efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical assessment of abstract integrity in the OpenAlex database. A random sample of 10,000 English-language journal abstracts is annotated via a two-stage protocol that combines human expert review with LLM classification. Seven failure modes are identified, and the authors conclude that 12% of abstracts exhibit integrity issues, with insufficient content and misplaced metadata as the most common. Implications for metascience research and plans for a community annotation portal are discussed.

Significance. If the 12% prevalence and failure-mode distribution are shown to be robust, the result would demonstrate that a substantial fraction of abstracts in a widely used open bibliographic database contain integrity problems. This would have immediate consequences for any computational metascience work that treats OpenAlex abstracts as primary input data and would support the value of the proposed community portal for ongoing quality improvement.

major comments (2)
  1. [Methods] Methods section (Annotation Protocol): The two-stage human-plus-LLM protocol is outlined, yet no inter-annotator agreement statistics (Cohen’s kappa, percentage agreement), disagreement-resolution rules, or LLM calibration/validation performance on held-out labels are reported. Because the central 12% prevalence figure and the ranking of the seven failure modes are derived directly from these annotations, the lack of quantitative reliability metrics leaves the quantitative claims sensitive to potential subjectivity or systematic bias in labeling.
  2. [Results] Results section: The manuscript should report exact counts, percentages, and confidence intervals for each of the seven failure modes (not only the two highlighted in the abstract) so that readers can assess the precision of the overall 12% estimate and the relative prevalence claims.
minor comments (2)
  1. [Methods] Clarify the exact filtering criteria used to select English-language journal abstracts from OpenAlex and any exclusion rules applied before sampling.
  2. [Results] Consider adding a short table that maps each of the seven failure modes to concrete examples drawn from the annotated sample.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript assessing abstract integrity in OpenAlex. We provide point-by-point responses to the major comments below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Methods] Methods section (Annotation Protocol): The two-stage human-plus-LLM protocol is outlined, yet no inter-annotator agreement statistics (Cohen’s kappa, percentage agreement), disagreement-resolution rules, or LLM calibration/validation performance on held-out labels are reported. Because the central 12% prevalence figure and the ranking of the seven failure modes are derived directly from these annotations, the lack of quantitative reliability metrics leaves the quantitative claims sensitive to potential subjectivity or systematic bias in labeling.

    Authors: We appreciate this comment and agree that transparency regarding annotation reliability is important. The protocol involved an initial LLM classification followed by human expert review for a subset to validate and refine. In the revised manuscript, we will add details on the disagreement-resolution process (human expert decisions take precedence) and report the LLM's accuracy on a held-out validation set of 500 abstracts manually labeled by the expert. However, as the human review was conducted by a single expert, inter-annotator agreement statistics such as Cohen’s kappa are not applicable in this context. We believe this additional information will sufficiently address concerns about potential bias. revision: partial

  2. Referee: [Results] Results section: The manuscript should report exact counts, percentages, and confidence intervals for each of the seven failure modes (not only the two highlighted in the abstract) so that readers can assess the precision of the overall 12% estimate and the relative prevalence claims.

    Authors: We agree with this suggestion. Although the abstract highlights the primary issues, the manuscript identifies all seven failure modes. In the revised version, we will include a comprehensive table in the Results section that reports the exact counts, percentages, and 95% Wilson score confidence intervals for each of the seven failure modes. This will enable readers to evaluate the precision of the 12% overall estimate and the relative frequencies. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical prevalence from annotation counts

full rationale

The paper performs a straightforward empirical measurement by randomly sampling 10,000 English-language journal abstracts from OpenAlex and applying a two-stage human-plus-LLM annotation protocol to identify seven failure modes. The reported 12% prevalence and mode rankings are obtained directly from the resulting annotation counts rather than any derivation, equation, fitted parameter, or self-referential definition. No load-bearing steps reduce by construction to the paper's own inputs, and the central claim remains independent of self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central prevalence claim rests on two domain assumptions: that the random sample represents the full population of English-language journal abstracts and that the human-LLM annotation protocol reliably detects integrity issues. No free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption The 10,000 randomly sampled abstracts are statistically representative of all English-language journal abstracts stored in OpenAlex.
    The study generalizes the observed 12% rate from the sample to the broader database without additional stratification or weighting details.
  • domain assumption The two-stage human-expert plus LLM annotation protocol produces consistent and valid labels for integrity issues.
    The paper relies on this protocol to define and count the seven failure modes but does not report agreement metrics or validation against an external gold standard in the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1592 out tokens · 50350 ms · 2026-05-20T02:11:45.497054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Alonso-Álvarez, P., & van Eck, N. J. (2025). Coverage and metadata completeness and accuracy of african research publications in openalex: A comparative analysis.Quantitative Science Studies,6, 1336–1357. https://doi.org/ 10.1162/QSS.a.396

  2. [2]

    Arts, S., Melluso, N., & Veugelers, R. (2025). Beyond citations: Measuring novel scientific ideas and their impact in publication text.The Review of Economics and Statistics, 1–33. https://doi.org/10.1162/rest_a_01561

  3. [3]

    H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P

    Culbert, J. H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2025). Reference coverage analysis of openalex compared to web of science and scopus.Scientometrics,130(4), 2475–2492. https: //doi.org/10.1007/s11192-025-05293-3

  4. [4]

    Kim, S., Holst, V ., & Ginis, V . (2026). Turning citation networks inside out: Studying science using content-based knowledge graphs from llm-derived taxonomies. https://arxiv.org/abs/2601.15062

  5. [5]

    Priem, J., Piwowar, H., & Orr, R. (2022). Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. https://arxiv.org/abs/2205.01833

  6. [6]

    no abstract available

    Tosi, M. D. L., & dos Reis, J. C. (2021). Scikgraph: A knowledge graph approach to structure a scientific field.Journal of Informetrics,15(1), 101109. https://doi.org/https://doi.org/10.1016/j.joi.2020.101109 Appendix A LLM Prompt for Classification by Failure Modes The classification prompt was derived from the structured resolution of 196 disagreements ...