Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Pith reviewed 2026-05-20 22:24 UTC · model grok-4.3
The pith
PubMed papers can be autonomously turned into larger, more nuanced and accurate structured biomedical datasets than manually curated ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present an LLM-based entity-tagging pipeline on nine biomedical ontologies applied to 22.5M papers, hybrid retrieval over the tagged corpus, and the Starling multi-agent system that designs filters and extracts nuanced records. For six tasks including blood-brain barrier permeability and gene-disease associations, Starling yields 6.3M records with rejection rates of 0.6-7.7% compared to 7.3-16.5% on curated counterparts, plus nuance-rich fields.
What carries the argument
Starling, the multi-agent deep research system that given a natural language task designs precision and recall targeted retrieval filters, induces an extraction schema, and emits structured records with supporting passages.
If this is right
- Produces up to millions of records per task, including some of the largest public datasets for properties like oral bioavailability.
- Retains experimental context such as fed versus fasted state in bioavailability measurements that tabular databases typically discard.
- Establishes a scalable foundation for AI-driven therapeutic design using autonomously generated knowledge.
- Lowers the cost and lag of maintaining biomedical repositories compared to manual curation.
Where Pith is reading between the lines
- Similar autonomous extraction could be applied to other large scientific corpora beyond PubMed to build structured knowledge in physics or chemistry.
- Integrating feedback loops where extracted data informs new queries might further improve coverage and accuracy over time.
- The approach opens the possibility of real-time updating of datasets as new papers are published.
Load-bearing premise
Frontier model rejection rates provide a reliable proxy for the actual accuracy of the extracted structured records.
What would settle it
Independent manual review or experimental replication of a sample of the generated records versus the source papers would confirm whether the reported rejection rates correspond to true improvements in data quality.
Figures
read the original abstract
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that PubMed can be autonomously processed via an LLM-based entity-tagging pipeline over 22.5M papers and a multi-agent system called Starling to produce ~6.3M structured records across six biomedical tasks (e.g., blood-brain barrier permeability, oral bioavailability, gene-disease associations). These datasets are asserted to be larger, more nuanced (via supporting passages), and more accurate than existing manually curated databases, with evidence consisting of frontier-model rejection rates of 0.6-7.7% versus 7.3-16.5% on counterparts such as BBB_Martins and Bioavailability_Ma. The work also contributes hybrid retrieval over a 4.5B-entity tagged corpus and releases code and datasets.
Significance. If the accuracy and nuance claims hold, the approach could enable scalable, cost-effective alternatives to manual curation, preserving experimental context that tabular databases often discard and supporting AI-driven therapeutic design. The public release of code and datasets on GitHub is a clear strength that facilitates reproducibility and community extension of the extraction pipeline.
major comments (2)
- [Abstract] Abstract: The headline accuracy claim (rejection rates 0.6-7.7% for Starling extractions versus 7.3-16.5% on curated databases) is load-bearing for the central thesis that the new datasets are 'more accurate.' This comparison uses frontier-model rejection as a proxy for both the new records and the error rates on existing databases, yet the manuscript provides no description of independent human expert adjudication, inter-annotator agreement, or cross-validation against gold-standard sources for the ~6.3M records. Without such grounding, the proxy risks circularity and bias with respect to extraction style and domain correctness.
- [Starling system] Starling system description: The multi-agent workflow for designing retrieval filters, inducing schemas, and emitting nuance-rich fields is presented at a high level. Specifics on how precision/recall targets are operationalized, how edge cases in the 4.5B-entity tagging across nine ontologies are handled, and the exact prompting or agent coordination mechanisms are not detailed enough to allow independent reproduction or assessment of whether the low rejection rates reflect genuine correctness or model self-consistency.
minor comments (2)
- [Abstract] The abstract states that 'several' datasets are the largest public ones for their property but does not identify which tasks achieve this or provide direct size comparisons to prior work; adding a table with record counts versus existing resources would improve clarity.
- [Methods] Notation for the hybrid sparse-dense retrieval and the nine ontologies is introduced without an explicit list or reference; including these details in the methods section would aid readers unfamiliar with the specific biomedical resources.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline accuracy claim (rejection rates 0.6-7.7% for Starling extractions versus 7.3-16.5% on curated databases) is load-bearing for the central thesis that the new datasets are 'more accurate.' This comparison uses frontier-model rejection as a proxy for both the new records and the error rates on existing databases, yet the manuscript provides no description of independent human expert adjudication, inter-annotator agreement, or cross-validation against gold-standard sources for the ~6.3M records. Without such grounding, the proxy risks circularity and bias with respect to extraction style and domain correctness.
Authors: We agree that the accuracy comparison relies on a model-based proxy and that independent human validation would be ideal. However, the proxy is applied consistently: the frontier model evaluates whether each record (our extraction or a curated database entry) is supported by its associated source text or passage. This uniform application reduces bias from differing extraction styles. Circularity is avoided because the judge model is not involved in the original extraction process for our records. We acknowledge the limitation and will revise the manuscript to explicitly describe the proxy methodology, its assumptions, and limitations, including a discussion of why full human adjudication at this scale is not practical. revision: partial
-
Referee: [Starling system] Starling system description: The multi-agent workflow for designing retrieval filters, inducing schemas, and emitting nuance-rich fields is presented at a high level. Specifics on how precision/recall targets are operationalized, how edge cases in the 4.5B-entity tagging across nine ontologies are handled, and the exact prompting or agent coordination mechanisms are not detailed enough to allow independent reproduction or assessment of whether the low rejection rates reflect genuine correctness or model self-consistency.
Authors: We appreciate this feedback on the level of detail provided for the Starling system. To improve reproducibility, we will expand the relevant section in the revised manuscript to include more specifics on operationalizing precision and recall targets, strategies for handling edge cases in the large-scale entity tagging, and the prompting templates and coordination protocols among the agents. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's derivation consists of an LLM entity-tagging pipeline over a PubMed corpus, hybrid retrieval, and the Starling multi-agent system that induces schemas and emits structured records. Accuracy claims rest on direct comparisons of frontier-model rejection rates (0.6-7.7%) against independently measured error rates on external published curated databases such as BBB_Martins and Bioavailability_Ma. These benchmarks are outside the paper's own fitted values or self-citations, and no load-bearing step reduces by construction to a self-definition, a renamed fit, or an imported uniqueness theorem from the authors' prior work. The methodology is self-contained against external references.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Frontier-model rejection of extractions is a reliable and unbiased measure of true extraction error.
- domain assumption LLM entity tagging grounded in nine ontologies produces sufficiently accurate labels to support downstream structured extraction at 22.5 M paper scale.
invented entities (1)
-
Starling multi-agent system
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Frontier-model rejection of our kept extractions is 0.6–7.7% across tasks, surprisingly far below the error rates we measure on the widely used, manually curated counterparts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.