pith. machine review for the scientific record. sign in

arxiv: 2205.01833 · v2 · submitted 2022-05-04 · 💻 cs.DL

Recognition: 2 theorem links

· Lean Theorem

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:37 UTC · model grok-4.3

classification 💻 cs.DL
keywords OpenAlexscientific knowledge graphscholarly metadataopen dataMicrosoft Academic Graphcitation networksresearch indexingdisambiguation
0
0 comments X

The pith

OpenAlex supplies a free, fully open scientific knowledge graph with metadata on 209 million works to replace the discontinued Microsoft Academic Graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenAlex as a new scientific knowledge graph built to continue the work of the closed Microsoft Academic Graph. It compiles details on 209 million scholarly works, over two billion disambiguated authors, venues, institutions, and 65 thousand linked concepts. All of this data is released without restrictions through a web interface, complete data dumps, and a high-volume API. The authors note that the system remains under active development to refine parsing and coverage. A sympathetic reader would see this as the foundation for unrestricted large-scale study of research output and connections.

Core claim

OpenAlex is a new, fully-open scientific knowledge graph launched to replace the discontinued Microsoft Academic Graph. It contains metadata for 209M works, 2013M disambiguated authors, 124k venues, 109k institutions, and 65k Wikidata concepts linked to works via an automated hierarchical multi-tag classifier. The dataset is available via a web-based GUI, a full data dump, and a high-volume REST API, with ongoing work to improve citation information and entity disambiguation.

What carries the argument

The OpenAlex knowledge graph, which connects works to disambiguated authors and institutions, venues, and Wikidata concepts through an automated hierarchical multi-tag classifier.

If this is right

  • Any researcher can download or query the full citation network and author records without licenses or fees.
  • New tools for science mapping and impact measurement can be built directly on the public data.
  • Institutions gain the ability to track their publication output using open rather than proprietary sources.
  • Analyses of research trends across disciplines become feasible at the scale previously limited to closed datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sustained community contributions could expand the graph beyond its current automated tagging to include more fine-grained topic links.
  • If the API remains stable and high-volume, it could support real-time dashboards that monitor emerging research fields.
  • The explicit link to Wikidata concepts opens the possibility of cross-walking OpenAlex records with other open knowledge bases for richer semantic queries.

Load-bearing premise

The automated classifier and disambiguation routines produce data accurate and complete enough to serve as a practical replacement for the discontinued Microsoft Academic Graph.

What would settle it

A side-by-side audit that finds OpenAlex omits a large share of known works or shows substantially higher error rates in author and institution matching than the prior graph would show the replacement claim does not hold.

read the original abstract

OpenAlex is a new, fully-open scientific knowledge graph (SKG), launched to replace the discontinued Microsoft Academic Graph (MAG). It contains metadata for 209M works (journal articles, books, etc); 2013M disambiguated authors; 124k venues (places that host works, such as journals and online repositories); 109k institutions; and 65k Wikidata concepts (linked to works via an automated hierarchical multi-tag classifier). The dataset is fully and freely available via a web-based GUI, a full data dump, and high-volume REST API. The resource is under active development and future work will improve accuracy and coverage of citation information and author/institution parsing and deduplication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents OpenAlex as a new, fully-open scientific knowledge graph (SKG) launched to replace the discontinued Microsoft Academic Graph (MAG). It reports the dataset contents including metadata for 209M works, 2013M disambiguated authors, 124k venues, 109k institutions, and 65k Wikidata concepts linked via an automated hierarchical multi-tag classifier, with access provided through a web-based GUI, full data dump, and high-volume REST API. The resource is described as under active development, with future work planned to improve accuracy and coverage of citation information and author/institution parsing and deduplication.

Significance. If the underlying data processing achieves usable quality levels, OpenAlex would constitute a valuable large-scale open alternative to proprietary or discontinued scholarly indexes, supporting research in scientometrics, digital libraries, and related areas. The explicit provision of multiple access channels and the transparent note on ongoing development are strengths that increase the resource's practical utility and long-term potential impact.

minor comments (2)
  1. [Abstract] Abstract: the notation '2013M' for authors is potentially ambiguous (could be read as 2013 million or 2.013 billion); clarify with standard billion notation or exact figure for precision.
  2. The description of the automated hierarchical multi-tag classifier and disambiguation steps would benefit from a brief high-level overview of the approach or data sources used, even if high-level, to aid reader understanding of how the reported counts were obtained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition of OpenAlex's potential value as a fully open scholarly knowledge graph, and the recommendation for minor revision. We note that the report contains no specific major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a descriptive announcement of a constructed open dataset (OpenAlex) with stated counts and access methods. It contains no mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems. All core claims rest on external data sources (e.g., Wikidata concepts, prior MAG data) and processing pipelines whose accuracy is explicitly noted as future work rather than asserted by self-reference. No load-bearing step reduces to a self-citation chain or input-by-construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the successful construction and ongoing maintenance of the dataset. The main unverified element is the accuracy of the automated classifier and disambiguation pipelines, which are treated as domain-standard techniques without quantified error rates supplied in the abstract.

axioms (1)
  • domain assumption An automated hierarchical multi-tag classifier can reliably link works to 65k Wikidata concepts.
    Invoked in the abstract as the mechanism for concept tagging; no performance metrics or validation details are given.

pith-pipeline@v0.9.0 · 5424 in / 1230 out tokens · 38822 ms · 2026-05-16T06:37:37.882433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond coauthorship: semantic structure and phantom collaborators in transportation research, 1967--2025

    cs.DL 2026-04 unverdicted novelty 7.0

    Phantom collaborators—topically similar authors distant in the coauthor graph—become actual coauthors 16-33 times more often than baselines, with a 68-fold similarity gradient.

  2. A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews

    cs.IR 2026-04 accept novelty 7.0

    A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.

  3. Market Dynamics, Governance and Open Research Metadata in the AI Era

    cs.DL 2026-04 unverdicted novelty 7.0

    The innovation annulus is a functional, persistent feature of scholarly metadata production whose width reflects production inefficiency, reshaped by AI and best managed through calibrated governance analogous to opti...

  4. AI scientists produce results without reasoning scientifically

    cs.AI 2026-04 conditional novelty 7.0

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  5. Camyla: Scaling Autonomous Research in Medical Image Segmentation

    cs.AI 2026-04 unverdicted novelty 7.0

    Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

  6. Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins

    q-bio.QM 2025-12 unverdicted novelty 7.0

    StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.

  7. Faculty mobility reallocates research capacity within persistent institutional hierarchies

    cs.DL 2026-05 unverdicted novelty 6.0

    Faculty mobility follows a persistent institutional prestige hierarchy but yields little evidence of lasting improvements in movers' research productivity or citation impact.

  8. Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

    cs.AI 2026-04 unverdicted novelty 6.0

    Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-d...

  9. CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization

    cs.LG 2026-04 unverdicted novelty 6.0

    CiteRadar is a new open-source pipeline that enriches Google Scholar citations using five external data sources and produces ranked tables plus an offline interactive geographic map from a single command.

  10. AI-assisted writing and the reorganization of scientific knowledge

    cs.DL 2026-04 unverdicted novelty 6.0

    Post-2023, AI-assisted writing intensity positively associates with scientific disruption but shows weakened links to cross-field citation breadth and attenuated negative links to citation concentration.

  11. AgentSPEX: An Agent SPecification and EXecution Language

    cs.CL 2026-04 unverdicted novelty 6.0

    AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.

  12. Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

    physics.comp-ph 2026-04 conditional novelty 6.0

    An LLM agent autonomously runs read-plan-compute-compare loops on 111 computational physics papers, raising substantive concerns in 42% of them (97.7% only after execution), and generates a full publishable Comment re...

  13. Structural Diversity Drives Disruptive Scientific Innovation

    cs.SI 2026-04 unverdicted novelty 6.0

    Structural diversity in a team's prior collaboration network predicts disruptive scientific innovation more strongly than team freshness or edge density and turns large team size from a liability into an advantage via...

  14. pAI/MSc: ML Theory Research with Humans on the Loop

    cs.AI 2026-04 unverdicted novelty 5.0

    pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...

  15. Scientific tools and Innovation: Big Science Facilities Yield More Novel and Interdisciplinary Knowledge

    cs.DL 2026-04 unverdicted novelty 5.0

    Big Science Facilities produce publications with greater recombinant novelty and interdisciplinary integration than matched controls, with stronger effects in fields outside their traditional physical-sciences focus.

  16. Polarization and Integration in Global AI Research

    physics.soc-ph 2026-04 unverdicted novelty 5.0

    Over three decades, global AI research has polarized into US and China poles, with UK/Germany aligning with US, some Europeans with both, and developing countries with China.

  17. Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era

    cs.CL 2026-03 unverdicted novelty 5.0

    NLI accuracy on research papers declined steadily over time, with Chinese and French showing unexpected resistance while Japanese and Korean declined more sharply in the post-LLM era.

  18. Mapping the Landscape of Open Access Dashboards -- A Dataset for Research and Infrastructure Development

    cs.DL 2025-12 unverdicted novelty 5.0

    A survey identifies nearly 60 open access dashboards and supplies a structured metadata dataset plus community contribution process for open science research.

  19. Construction of a Battery Research Knowledge Graph using a Global Open Catalog

    cs.CL 2026-04 unverdicted novelty 4.0

    A pipeline builds a battery research knowledge graph from 189k OpenAlex papers using author vectors weighted by OpenAlex concepts, KeyBERT/ChatGPT keyphrases, authorship position, and recency, then serializes it as RD...

  20. Auditing automated research assessment: an interpretable machine learning approach to validate funding criteria

    cs.DL 2026-04 unverdicted novelty 4.0

    ML models show Brazilian PQ grant levels are predicted well by a small set of bibliographic and supervision features but not by the full set of official criteria.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 20 Pith papers

  1. [1]

    The OpenAlex project was created toaddress this concern

    OpenAlex: A fully-open index of scholarly works, authors, venues,institutions, and concepts Jason Priem*, Heather Piwowar*, Richard Orr* *jason@ourresearch.org; heather@ourresearch.org; richard@ourresearch.orgOurResearch, 500 Westover Dr #8234, Sanford, NC, 27330 (USA) Introduction In May 2021, Microsoft announced that it was discontinuing support for Mic...

  2. [2]

    first-class citizens

    Although still in its nascency, as a fully-open (100% open data, open API, open-source code)source of scholarly metadata, OpenAlex has potential to improve the transparency of researchevaluation, navigation, representation, and discovery, adding to the growing list of other openand partly-open SKGs such as OpenCitations (Peroni, Shotton, & Vitali, 2017), ...

  3. [3]

    which provide guidance for sustainablyopen development. STI Conference 2022 · Granada Limitations and future workThe OpenAlex project is still quite young, and there are many areas for improvement.Foremost is continued improvement in the parsing, normalisation, and disambiguation ofentities, especially authors and institutions. This is particularly import...