pith. machine review for the scientific record. sign in

arxiv: 2604.14514 · v1 · submitted 2026-04-16 · 💻 cs.AI · cs.CE

Recognition: unknown

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CE
keywords biomedical AIomics data biasfoundation modelshealthcare disparitiesancestry reportingdata provenanceAI equitypopulation bias
0
0 comments X

The pith

Biases introduced during omics data collection get locked into biomedical foundation models and produce downstream healthcare inequities that later rules cannot fix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that most omics studies omit ancestry or ethnicity details, and the large public datasets used to pretrain models are overwhelmingly European in origin. Because foundation models are pretrained once on these collections and then reused across many tasks, any early skew in population representation spreads automatically to clinical tools and diagnostic aids. The authors argue this creates a form of bias that regulatory checks applied at the point of clinical deployment cannot undo. They therefore advocate shifting attention to three upstream practices: tracking data provenance, requiring demographic openness, and demanding transparent performance evaluation across groups.

Core claim

As biomedical foundation models become central to discovery through repeated reuse of models pretrained on large omics collections, the documented underreporting of ancestry and strong European dominance in those collections will be perpetuated and amplified, producing performance gaps and health inequities for non-European populations that regulatory interventions at later stages cannot fully reverse.

What carries the argument

The pretraining-and-reuse paradigm for foundation models, which transfers population skews present in source omics datasets into every downstream application.

If this is right

  • Regulatory interventions applied only at clinical deployment will leave early-stage data biases intact.
  • Community adoption of Provenance, Openness, and Evaluation Transparency practices would reduce the risk of irreversible inequities.
  • Biomedical AI tools will serve underserved populations more effectively once demographic composition of training data is routinely disclosed and evaluated.
  • Repeated reuse of the same biased base models across tasks will compound rather than dilute the initial population skew.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether adding even modest amounts of non-European omics data at the pretraining stage measurably improves equity metrics without harming overall accuracy.
  • The same logic may apply to other data modalities such as imaging or electronic health records that feed into shared foundation models.
  • Funding agencies could require ancestry reporting as a condition for dataset deposition to change collection incentives upstream.

Load-bearing premise

That the observed dominance of European-ancestry samples in omics datasets will produce measurable differences in model accuracy or clinical outcomes for other ancestry groups.

What would settle it

A controlled experiment that trains two otherwise identical foundation models, one on current European-heavy omics data and one on a version balanced across ancestries, then measures no difference in downstream task performance or fairness metrics on held-out non-European cohorts.

read the original abstract

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a perspective arguing that biases arise early in biomedical research during omics data collection and prioritization. An automated analysis of 4719 PubMed-indexed omics publications (2015-2024) finds low rates of ancestry/ethnicity reporting (with modest improvement over time), while inspection of CellxGene and GEO datasets shows strong European-ancestry dominance. The authors contend that, under the foundation-model paradigm of large-scale pretraining followed by repeated downstream reuse, these early biases will be perpetuated or amplified, producing cascading healthcare disparities that later regulatory interventions cannot fully reverse. They advocate three community principles—Provenance, Openness, and Evaluation Transparency—to improve equity and robustness.

Significance. If the causal pathway from dataset demographics to irreversible downstream disparities is substantiated, the perspective would usefully direct attention to upstream data practices in biomedical AI. The concrete counts from the 4719-publication corpus and the two large public repositories supply a tangible empirical anchor for the bias observation, which is a clear strength. The proposed principles offer a practical, non-regulatory framing that could influence data-sharing norms and model documentation standards.

major comments (2)
  1. [Abstract] Abstract and the paragraph introducing the foundation-model paradigm: the central claim that pretraining on ancestry-biased omics data will produce 'cascading inequities that regulatory interventions cannot fully reverse' is asserted without direct empirical support or simulation inside the manuscript. The 4719-publication counts and CellxGene/GEO inspections establish the existence of reporting gaps and population imbalance, but no biomedical-specific evidence, ablation, or outcome-linked analysis demonstrates that these translate into ancestry-linked performance gaps in foundation models or into health inequities immune to later mitigation.
  2. [Foundation-model risk discussion] Section discussing risks to downstream tasks: the mechanism by which European dominance in pretraining corpora is expected to propagate into measurable disparities for non-European populations in clinical AI applications is described at a high level but not instantiated with any concrete example, performance metric, or reference to a controlled study within the paper, leaving the load-bearing causal step untested.
minor comments (2)
  1. [Automated analysis of publications] The automated-analysis subsection would benefit from explicit reporting of the PubMed query string, the exact criteria or classifier used to flag ancestry mentions, and any validation steps (e.g., manual review of a sample), which are necessary for reproducibility of the 4719-paper statistics.
  2. [Conclusion] The manuscript would be strengthened by a brief discussion of how the three proposed principles (Provenance, Openness, Evaluation Transparency) could be operationalized in existing data repositories or model cards, moving from high-level recommendation to actionable guidance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to qualify our claims more carefully, add supporting literature citations, and expand the discussion of mechanisms while preserving the perspective's focus on upstream data practices.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the paragraph introducing the foundation-model paradigm: the central claim that pretraining on ancestry-biased omics data will produce 'cascading inequities that regulatory interventions cannot fully reverse' is asserted without direct empirical support or simulation inside the manuscript. The 4719-publication counts and CellxGene/GEO inspections establish the existence of reporting gaps and population imbalance, but no biomedical-specific evidence, ablation, or outcome-linked analysis demonstrates that these translate into ancestry-linked performance gaps in foundation models or into health inequities immune to later mitigation.

    Authors: We agree that the manuscript, as a perspective, does not include original empirical simulations, ablations, or outcome-linked analyses demonstrating the full causal translation from biased pretraining data to irreversible downstream disparities. Our contribution centers on documenting the upstream imbalances via the PubMed corpus analysis and repository inspections, then linking these to the foundation-model reuse paradigm. In revision, we have softened the abstract and introduction to describe a 'risk of perpetuating or amplifying biases, potentially leading to cascading inequities that may prove difficult to fully reverse through later interventions alone.' We have also added citations to studies documenting ancestry-linked performance gaps in genomic and single-cell AI models to provide indirect support for the mechanism. revision: yes

  2. Referee: [Foundation-model risk discussion] Section discussing risks to downstream tasks: the mechanism by which European dominance in pretraining corpora is expected to propagate into measurable disparities for non-European populations in clinical AI applications is described at a high level but not instantiated with any concrete example, performance metric, or reference to a controlled study within the paper, leaving the load-bearing causal step untested.

    Authors: We acknowledge that the original discussion of propagation remained conceptual. The revised manuscript expands this section with concrete examples and references drawn from the literature, including documented reductions in accuracy for polygenic risk scores and variant interpretation models when applied to non-European ancestry groups after European-dominant pretraining, as well as ancestry biases observed in cell-type annotation from single-cell omics foundation models. These additions instantiate the mechanism with specific performance considerations while clarifying that the degree of irreversibility depends on the feasibility of downstream mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external dataset analysis and logical inference, not self-referential derivations.

full rationale

The paper performs an automated count of ancestry reporting in 4719 PubMed omics papers (2015-2024) and inspects demographic composition in CellxGene and GEO. It then reasons that foundation-model pretraining on such data may perpetuate biases into downstream tasks. This is observational reporting plus perspective, with no equations, fitted parameters, self-defined terms, or load-bearing self-citations that reduce the central claim to its own inputs by construction. The causal extrapolation to irreversible inequities is an interpretive step, not a mathematical reduction. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central argument depends on the assumption that data biases propagate through foundation models and that regulation cannot reverse them; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Biases present at data collection will be perpetuated or amplified when models are pretrained on large omics datasets and reused for downstream tasks.
    Invoked to justify the risk of cascading inequities.
  • ad hoc to paper Regulatory interventions cannot fully reverse early-stage biases once embedded in foundation models.
    Stated directly as a premise for why upstream focus is needed.

pith-pipeline@v0.9.0 · 5579 in / 1251 out tokens · 30610 ms · 2026-05-10T11:47:33.679749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 11 canonical work pages

  1. [1]

    Accurate structure prediction of biomolecular interactions with AlphaFold 3

    Josh Abramson et al. “Accurate structure prediction of biomolecular interactions with AlphaFold 3”. In:Nature630.8016 (2024), pp. 493– 500

  2. [2]

    Inferring Genetic Ancestry From Cancer Sequencing Data

    Kanika Arora and Michael F. Berger. “Inferring Genetic Ancestry From Cancer Sequencing Data”. In:Trends In Genetics39.6 (June 2023), pp. 431–432.issn: 0168-9525.doi: 10.1016/j.tig.2023.03.003 . url:http://dx.doi.org/10.1016/j.tig.2023.03.003

  3. [3]

    Ethnic Diversity And Warfarin Pharmacogenomics

    Innocent G Asiimwe and Munir Pirmohamed. “Ethnic Diversity And Warfarin Pharmacogenomics”. In:Frontiers In Pharmacology13 (2022), p. 866058. 11

  4. [4]

    On The Dangers Of Stochastic Parrots: Can Language Models Be Too Big?

    Emily M Bender et al. “On The Dangers Of Stochastic Parrots: Can Language Models Be Too Big?” In:Proceedings Of The 2021 Acm Con- ference On Fairness, Accountability, And Transparency. 2021, pp. 610– 623

  5. [5]

    How To Build The Virtual Cell With Artificial Intelligence: Priorities And Opportunities

    Charlotte Bunne et al. “How To Build The Virtual Cell With Artificial Intelligence: Priorities And Opportunities”. In:Cell187.25 (2024), pp. 7045–7063

  6. [6]

    Quantitative Trait Locus (xqtl) Approaches Iden- tify Risk Genes And Drug Targets From Human Non-coding Genomes

    Marina Bykova et al. “Quantitative Trait Locus (xqtl) Approaches Iden- tify Risk Genes And Drug Targets From Human Non-coding Genomes”. In:Human Molecular Genetics31.R1 (Aug. 2022), R105–R113.issn: 1460-2083.doi: 10.1093/hmg/ddac208 .url: http://dx.doi.org/ 10.1093/hmg/ddac208

  7. [7]

    Lessons learned: recommendations for establishing critical periodic scientific benchmarking

    Salvador Capella-Gutierrez et al. “Lessons learned: recommendations for establishing critical periodic scientific benchmarking”. In:BioRxiv (2017), p. 181677

  8. [8]

    Target 2035: Probing The Human Proteome

    Adrian J Carter et al. “Target 2035: Probing The Human Proteome”. In:Drug Discovery Today24.11 (2019), pp. 2111–2115

  9. [9]

    Multi-ancestry Transcriptome-wide Association Anal- yses Yield Insights Into Tobacco Use Biology And Drug Repurposing

    Fang Chen et al. “Multi-ancestry Transcriptome-wide Association Anal- yses Yield Insights Into Tobacco Use Biology And Drug Repurposing”. In:Nature Genetics55.2 (Jan. 2023), pp. 291–300.issn: 1546-1718. doi: 10.1038/s41588-022-01282-x .url: http://dx.doi.org/10. 1038/s41588-022-01282-x

  10. [10]

    The gene expression omnibus database

    Emily Clough and Tanya Barrett. “The gene expression omnibus database”. In:Statistical Genomics: Methods and Protocols. Springer, 2016, pp. 93–110

  11. [11]

    The Tabula Sapiens: A Multiple- organ,Single-cellTranscriptomicAtlasOfHumans

    The Tabula Sapiens Consortium et al. “The Tabula Sapiens: A Multiple- organ,Single-cellTranscriptomicAtlasOfHumans”.In:Science376.6594 (2022), eabl4896

  12. [12]

    Scgpt: Toward Building A Foundation Model For Single-cell Multi-omics Using Generative Ai

    Haotian Cui et al. “Scgpt: Toward Building A Foundation Model For Single-cell Multi-omics Using Generative Ai”. In:Nature Methods (2024), pp. 1–11

  13. [13]

    Towards Multimodal Foundation Models In Molec- ular Cell Biology

    Haotian Cui et al. “Towards Multimodal Foundation Models In Molec- ular Cell Biology”. In:Nature640.8059 (2025), pp. 623–633

  14. [14]

    Bmfm-rna: An open framework for building and evaluating transcriptomic foundation mod- els

    Bharath Danziger Michael M Dandala et al. “Bmfm-rna: An open framework for building and evaluating transcriptomic foundation mod- els”. In:arXiv preprint arXiv:2506.14861(2025). 12

  15. [15]

    50 years of data science

    David Donoho. “50 years of data science”. In:Journal of Computational and Graphical Statistics26.4 (2017), pp. 745–766

  16. [16]

    Racial/ethnic Differences In Biological Aging And Their Life Course Socioeconomic Determinants: The 2016 Health And Retirement Study

    Mateo P Farina, Jung Ki Kim, and Eileen M Crimmins. “Racial/ethnic Differences In Biological Aging And Their Life Course Socioeconomic Determinants: The 2016 Health And Retirement Study”. In:Journal Of Aging And Health35.3-4 (2023), pp. 209–220

  17. [17]

    Diversity In Genomic Studies: A Roadmap To Address The Imbalance

    Segun Fatumo et al. “Diversity In Genomic Studies: A Roadmap To Address The Imbalance”. In:Nature Medicine28.2 (2022), p. 243

  18. [18]

    A Wealth Of Discovery Built On The Hu- man Genome Project—by The Numbers

    Alexander J Gates et al. “A Wealth Of Discovery Built On The Hu- man Genome Project—by The Numbers”. In:Nature590.7845 (2021), pp. 212–215

  19. [19]

    Why Batch Effects Matter In Omics Data, And How To Avoid Them

    Wilson Wen Bin Goh, Wei Wang, and Limsoon Wong. “Why Batch Effects Matter In Omics Data, And How To Avoid Them”. In:Trends In Biotechnology35.6 (2017), pp. 498–507

  20. [20]

    Assessing DEI Bias in Gene Expression Omnibus (GEO) Datasets Based on Gender and Ethnicity

    Mahnoor N Gondal. “Assessing DEI Bias in Gene Expression Omnibus (GEO) Datasets Based on Gender and Ethnicity”. In:BioRxiv(2024), pp. 2024–11

  21. [21]

    Techniques for learning and transferring knowledge for microbiome-based classification and prediction: review and assessment

    Jin Han, Haohong Zhang, and Kang Ning. “Techniques for learning and transferring knowledge for microbiome-based classification and prediction: review and assessment”. In:Briefings in Bioinformatics 26.1 (2025), bbaf015

  22. [22]

    Large-scale Foundation Model On Single-cell Transcriptomics

    Minsheng Hao et al. “Large-scale Foundation Model On Single-cell Transcriptomics”. In:Nature Methods21.8 (2024), pp. 1481–1491

  23. [23]

    Simulating 500 million years of evolution with a language model

    Thomas Hayes et al. “Simulating 500 million years of evolution with a language model”. In:Science387.6736 (2025), pp. 850–858

  24. [24]

    Dnabert: Pre-trained Bidirectional Encoder Repre- sentations From Transformers Model For Dna-language In Genome

    Yanrong Ji et al. “Dnabert: Pre-trained Bidirectional Encoder Repre- sentations From Transformers Model For Dna-language In Genome”. In:Bioinformatics37.15 (2021), pp. 2112–2120

  25. [25]

    Gene Expression In African Americans, Puerto Ricans And Mexican Americans Reveals Ancestry-specific Patterns Of Genetic Architecture

    Linda Kachuri et al. “Gene Expression In African Americans, Puerto Ricans And Mexican Americans Reveals Ancestry-specific Patterns Of Genetic Architecture”. In:Nature Genetics55.6 (May 2023), pp. 952– 963.issn: 1546-1718.doi: 10 . 1038 / s41588 - 023 - 01377 - z.url: http://dx.doi.org/10.1038/s41588-023-01377-z

  26. [26]

    BioBERT: a pre-trained biomedical language rep- resentation model for biomedical text mining

    Jinhyuk Lee et al. “BioBERT: a pre-trained biomedical language rep- resentation model for biomedical text mining”. In:Bioinformatics36.4 (2020), pp. 1234–1240. 13

  27. [27]

    CpGPT: A Foundation Model for DNA Methylation

    Lucas Paulo et al. de Lima Camillo. “CpGPT: A Foundation Model for DNA Methylation”. In:bioRxiv(2024).doi: 10.1101/2024.10.24. 619766.url: https://www.biorxiv.org/content/10.1101/2024. 10.24.619766v1

  28. [28]

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Bang Liu et al. “Advances And Challenges In Foundation Agents: From Brain-inspired Intelligence To Evolutionary, Collaborative, And Safe Systems”. In:Arxiv Preprint Arxiv:2504.01990(2025)

  29. [29]

    Large Language Models and Causal Inference in Col- laboration: A Comprehensive Survey

    Xiaoyu Liu et al. “Large Language Models and Causal Inference in Col- laboration: A Comprehensive Survey”. In:Findings of the Association for Computational Linguistics: NAACL 2025. Ed. by Luis Chiruzzo, Alan Ritter, and Lu Wang. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 7668–7684.isbn: 979- 8-89176-195-7.url: https...

  30. [30]

    Defining and benchmarking open problems in single-cell analysis

    Malte D Luecken et al. “Defining and benchmarking open problems in single-cell analysis”. In:Nature Biotechnology(2025), pp. 1–6

  31. [31]

    Biogpt: Generative Pre-trained Transformer For Biomedical Text Generation And Mining

    Renqian Luo et al. “Biogpt: Generative Pre-trained Transformer For Biomedical Text Generation And Mining”. In:Briefings In Bioinfor- matics23.6 (2022), bbac409

  32. [32]

    A Historical Perspective Of Biomedical Ex- plainable Ai Research

    Luca Malinverno et al. “A Historical Perspective Of Biomedical Ex- plainable Ai Research”. In:Patterns4.9 (2023)

  33. [33]

    Clinical use of current polygenic risk scores may exacerbate health disparities

    Alicia R Martin et al. “Clinical use of current polygenic risk scores may exacerbate health disparities”. In:Nature genetics51.4 (2019), pp. 584–591

  34. [34]

    Socioeconomic Status And Access To Healthcare: Interrelated Drivers For Healthy Aging

    Darcy Jones McMaughan, Oloruntoba Oluyomi, and Smith Lee Smith. Socioeconomic Status And Access To Healthcare: Interrelated Drivers For Healthy Aging. Front Public Health. 2020; 8: 231

  35. [35]

    Vishwali Mhasawade et al.Understanding Disparities in Post Hoc Machine Learning Explanation. 2024. arXiv: 2401 . 14539 [cs.LG]. url:https://arxiv.org/abs/2401.14539

  36. [36]

    Foundation models for generalist medical artificial intelligence

    Michael Moor et al. “Foundation models for generalist medical artificial intelligence”. In:Nature616.7956 (2023), pp. 259–265

  37. [37]

    Explainable artificial intelligence (xai): From inherent explainability to large language models.arXiv preprint arXiv:2501.09967,

    Fuseini Mumuni and Alhassan Mumuni. “Explainable Artificial Intelli- gence (xai): From Inherent Explainability To Large Language Models”. In:Arxiv Preprint Arxiv:2501.09967(2025). 14

  38. [38]

    Identify- ing biases and their potential solutions in human microbiome studies

    Jacob T Nearing, André M Comeau, and Morgan GI Langille. “Identify- ing biases and their potential solutions in human microbiome studies”. In:Microbiome9.1 (2021), p. 113

  39. [39]

    Sequence modeling and design from molecular to genome scale with Evo

    Eric Nguyen et al. “Sequence modeling and design from molecular to genome scale with Evo”. In:Science386.6723 (2024), eado9336

  40. [40]

    Dissecting racial bias in an algorithm used to manage the health of populations

    Ziad Obermeyer et al. “Dissecting racial bias in an algorithm used to manage the health of populations”. In:Science366.6464 (2019), pp. 447–453

  41. [41]

    Evaluating and addressing demographic dis- parities in medical large language models: a systematic review

    Mahmud Omar et al. “Evaluating and addressing demographic dis- parities in medical large language models: a systematic review”. In: International Journal for Equity in Health24.1 (2025), p. 57

  42. [42]

    Extensive unexplored human microbiome di- versity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle

    Edoardo Pasolli et al. “Extensive unexplored human microbiome di- versity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle”. In:Cell176.3 (2019), pp. 649–662

  43. [43]

    Cz Cellxgene Discover: A Single-cell Data Platform For Scalable Exploration, Analysis And Modeling Of Aggregated Data

    CZI Cell Science Program et al. “Cz Cellxgene Discover: A Single-cell Data Platform For Scalable Exploration, Analysis And Modeling Of Aggregated Data”. In:Nucleic Acids Research53.D1 (2025), pp. D886– D900

  44. [44]

    Analysis Of Pharma R&d Productivity– a New Perspective Needed

    Alexander Schuhmacher et al. “Analysis Of Pharma R&d Productivity– a New Perspective Needed”. In:Drug Discovery Today28.10 (2023), p. 103726

  45. [45]

    Mammal–molecular Aligned Multi-modal Archi- tecture And Language

    Yoel Shoshan et al. “Mammal–molecular Aligned Multi-modal Archi- tecture And Language”. In:Arxiv Preprint Arxiv:2410.22367(2024)

  46. [46]

    The Missing Diversity In Human Genetic Studies

    Giorgio Sirugo, Scott M Williams, and Sarah A Tishkoff. “The Missing Diversity In Human Genetic Studies”. In:Cell177.1 (2019), pp. 26–31

  47. [47]

    Meta-analysis of (single-cell method) bench- marks reveals the need for extensibility and interoperability

    Anthony Sonrel et al. “Meta-analysis of (single-cell method) bench- marks reveals the need for extensibility and interoperability”. In: Genome Biology24.1 (2023), p. 119

  48. [48]

    Socioeconomic Status And The 25× 25 Risk Factors As Determinants Of Premature Mortality: A Multicohort Study And Meta-analysis Of 1·7 million men and women

    Silvia Stringhini et al. “Socioeconomic Status And The 25× 25 Risk Factors As Determinants Of Premature Mortality: A Multicohort Study And Meta-analysis Of 1·7 million men and women”. In:The Lancet 389.10075 (2017), pp. 1229–1237

  49. [49]

    Drug development for neglected diseases: a deficient market and a public-health policy failure

    Patrice Trouiller et al. “Drug development for neglected diseases: a deficient market and a public-health policy failure”. In:The Lancet 359.9324 (2002), pp. 2188–2194. 15

  50. [50]

    Consolidated Standards Of Reporting Trials (con- sort) And The Completeness Of Reporting Of Randomised Controlled Trials (rcts) Published In Medical Journals

    Lucy Turner et al. “Consolidated Standards Of Reporting Trials (con- sort) And The Completeness Of Reporting Of Randomised Controlled Trials (rcts) Published In Medical Journals”. In:Cochrane Database Of Systematic Reviews11 (2012)

  51. [51]

    Applications Of Single-cell Rna Sequenc- ing In Drug Discovery And Development

    Bram Van de Sande et al. “Applications Of Single-cell Rna Sequenc- ing In Drug Discovery And Development”. In:Nature Reviews Drug Discovery22.6 (2023), pp. 496–520

  52. [52]

    Scbert As A Large-scale Pretrained Deep Language Model For Cell Type Annotation Of Single-cell Rna-seq Data

    Fan Yang et al. “Scbert As A Large-scale Pretrained Deep Language Model For Cell Type Annotation Of Single-cell Rna-seq Data”. In: Nature Machine Intelligence4.10 (2022), pp. 852–866

  53. [53]

    MethylGPT: a foundation model for the DNA methylome

    Kejun Ying et al. “MethylGPT: a foundation model for the DNA methylome”. In:bioRxiv(2024)

  54. [54]

    Towards Causal Foundation Model: On Duality Be- tween Optimal Balancing And Attention

    Jiaqi Zhang et al. “Towards Causal Foundation Model: On Duality Be- tween Optimal Balancing And Attention”. In:Forty-first International Conference On Machine Learning. 2024

  55. [55]

    Population-based Discovery And Mendelian Randomization Analysis Identify Telmisartan As A Candidate Medicine For Alzheimer’s Disease In African Americans

    Pengyue Zhang et al. “Population-based Discovery And Mendelian Randomization Analysis Identify Telmisartan As A Candidate Medicine For Alzheimer’s Disease In African Americans”. In:Alzheimer’s and Dementia19.5 (Nov. 2022), pp. 1876–1887.issn: 1552-5279.doi:10. 1002/alz.12819.url:http://dx.doi.org/10.1002/alz.12819

  56. [56]

    Scientific Large Language Models: A Survey On Biological & Chemical Domains

    Qiang Zhang et al. “Scientific Large Language Models: A Survey On Biological & Chemical Domains”. In:Acm Computing Surveys57.6 (2025), pp. 1–38

  57. [57]

    Learning From Models Beyond Fine-tuning

    Hongling Zheng et al. “Learning From Models Beyond Fine-tuning”. In:Nature Machine Intelligence(2025), pp. 1–12

  58. [58]

    Streamline automated biomedical discoveries with agentic bioinformatics

    Juexiao Zhou et al. “Streamline automated biomedical discoveries with agentic bioinformatics”. In:Briefings in Bioinformatics26.5 (2025), bbaf505

  59. [59]

    The rise of agentic AI teammates in medicine

    James Zou and Eric J Topol. “The rise of agentic AI teammates in medicine”. In:The Lancet405.10477 (2025), p. 457. A Demographic Analysis Details A.0.1 Study Design and Data Source To quantify demographic reporting practices in omics research, we conducted a systematic analysis of PubMed abstracts published between January 2015 16 andDecember2024. Wequeri...