Data-Centric Foundation Models in Computational Healthcare: A Survey

Dequan Wang; Jin Gao; Kexin Ding; Lingfeng Zhou; Mu Zhou; Shaoting Zhang; Yunkun Zhang; Zheling Tan

arxiv: 2401.02458 · v3 · submitted 2024-01-04 · 💻 cs.LG · cs.AI

Data-Centric Foundation Models in Computational Healthcare: A Survey

Yunkun Zhang , Jin Gao , Zheling Tan , Lingfeng Zhou , Kexin Ding , Mu Zhou , Shaoting Zhang , Dequan Wang This is my paper

Pith reviewed 2026-05-24 04:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords foundation modelscomputational healthcaredata-centric AIhealthcare workflowdata qualitypatient privacysurvey

0 comments

The pith

Foundation models ignite a data-centric AI paradigm in healthcare by prioritizing data characterization, quality, and scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys data-centric approaches across foundation model workflows in computational healthcare, from pre-training through inference. It frames these models' interactive design, shaped by pre-training data and human instructions, as the driver for addressing longstanding issues of data quantity, annotation, privacy, and ethics. The survey also covers security, assessment, and value alignment, then points to improved patient outcomes and clinical workflows as the practical payoff.

Core claim

The interactive nature of foundation models, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm that emphasizes better data characterization, quality, and scale toward improving the healthcare workflow. In healthcare AI, obtaining and processing high-quality clinical data records has been a longstanding challenge, encompassing data quantity, annotation, patient privacy, and ethics. The survey organizes existing work on these data-centric methods and provides an outlook on their use in analytics to enhance patient outcomes and clinical workflows.

What carries the argument

The data-centric AI paradigm in foundation models, which organizes methods from pre-training to inference to improve data handling in healthcare.

Load-bearing premise

The body of published work on foundation models in healthcare is mature and representative enough for a survey to treat data-centric methods as the central response to data challenges.

What would settle it

A finding that most foundation-model papers in healthcare still treat architecture changes as the primary lever, with data improvements as secondary or absent, would falsify the claim that a data-centric paradigm now dominates.

Figures

Figures reproduced from arXiv: 2401.02458 by Dequan Wang, Jin Gao, Kexin Ding, Lingfeng Zhou, Mu Zhou, Shaoting Zhang, Yunkun Zhang, Zheling Tan.

**Figure 2.** Figure 2: An overview of healthcare data challenges and foundation model-based approaches mentioned in this [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Foundation model (FM) in healthcare. 2.2.2 Instruction tuning. Instruction is defined as the linguistic description of a task along with its corresponding task-specific data sample. Instruction tuning refers to fine-tuning FMs on supervised instruction datasets with LLMs helping to understand the instruction [210]. This method enhances zero-shot performance on new tasks and improves the generalization capa… view at source ↗

**Figure 4.** Figure 4: Multi-modal fusion of healthcare data in the FM era. Conventional fusion approaches are enhanced [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Foundation models address data quantity and data annotation challenges. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Foundation model evaluation strategies. potential patient information leakage in the model training, federated learning naturally provides a distributed network solution for protecting in-house patient data in local institutions. This training strategy could ensure FM is trained locally without releasing the in-house patient data for FM training and updating. (2) The design of FM architectures is expected … view at source ↗

read the original abstract

The advent of foundation models (FMs) as an emerging suite of AI techniques has struck a wave of opportunities in computational healthcare. The interactive nature of these models, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm that emphasizes better data characterization, quality, and scale. In healthcare AI, obtaining and processing high-quality clinical data records has been a longstanding challenge, encompassing data quantity, annotation, patient privacy, and ethics. In this survey, we investigate a wide range of data-centric approaches in the FM era (from model pre-training to inference) towards improving the healthcare workflow. We discuss key perspectives in AI security, assessment, and alignment with human values. Finally, we offer a promising outlook on FM-based analytics to enhance patient outcomes and clinical workflows in the evolving landscape of healthcare and medicine. We provide an up-to-date list of healthcare-related foundation models and datasets at https://github.com/Yunkun-Zhang/Data-Centric-FM-Healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A conventional survey that collects data-centric FM papers in healthcare and ships a GitHub repo, but the paradigm-ignition claim rests on re-labeling older problems.

read the letter

The paper is a literature survey that groups existing work on data-centric methods for foundation models in healthcare and includes a GitHub repository listing models and datasets. That repo is the clearest addition. The text walks through approaches from pre-training to inference, covering data quantity, annotation, privacy, ethics, security, assessment, and alignment. For someone entering the area, the organization and the curated list could reduce the time spent hunting for references. The abstract is internally consistent and the scope is broad enough to touch the main workflow stages. The central framing that foundation models have ignited a data-centric paradigm is the soft spot. The listed challenges around clinical data are longstanding, and the survey does not supply evidence that FMs have created new solutions or reframed the problems in ways that predate the models. If the body simply assigns older papers to this heading without showing a clear shift attributable to FMs, the narrative adds little beyond the bibliography. No new derivations, experiments, or falsifiable claims appear, which is normal for a survey but means the technical contribution is limited to curation. The citation pattern looks typical for this genre. This paper is for readers who need a starting map of the literature and datasets rather than for those seeking original technical results or resolved open questions. It deserves peer review because the repo and the categorization can be checked for completeness and accuracy, and a cleaned-up version would still be a usable reference even if the ignition language is toned down.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey of data-centric approaches in the foundation model (FM) era for computational healthcare. It claims that the interactive nature of FMs, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm emphasizing data characterization, quality, and scale. The survey reviews methods spanning model pre-training through inference to address longstanding healthcare data challenges (quantity, annotation, privacy, ethics), discusses perspectives on AI security, assessment, and human-value alignment, offers an outlook on FM-based analytics, and supplies a GitHub repository listing healthcare FMs and datasets.

Significance. If the surveyed literature demonstrates approaches that are newly enabled or distinctly reframed by FMs (rather than re-categorized pre-existing data issues), the paper could usefully organize the field and highlight workflow improvements. The concrete GitHub deliverable is a clear strength that supports resource discovery and reproducibility.

major comments (2)

[Abstract, §1] Abstract and §1: The central claim that FMs have 'ignited a data-centric AI paradigm' is load-bearing for the survey's framing and organization, yet the manuscript provides no explicit contrast (e.g., via a dedicated subsection or table) between pre-FM data-centric healthcare methods and FM-era techniques. Without citing specific works that illustrate a qualitative shift attributable to FM properties such as instruction following or scale, the ignition narrative risks resting on re-labeling of longstanding issues.
[Methodology / survey scope] Survey methodology section (likely §2 or equivalent): The paper does not state inclusion/exclusion criteria, search strategy, or temporal scope for the cited literature. This omission undermines assessment of whether the body of work is sufficiently mature and representative to support the paradigm-ignition thesis as the organizing principle.

minor comments (2)

[Abstract / conclusion] The GitHub link is a valuable contribution; the paper should state the date of the most recent repository update and any curation criteria used for the listed models and datasets.
[Throughout] Notation for FM components (e.g., pre-training vs. fine-tuning stages) should be standardized across sections to improve readability for readers comparing data-centric interventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments help clarify the survey's framing and methodology. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: The central claim that FMs have 'ignited a data-centric AI paradigm' is load-bearing for the survey's framing and organization, yet the manuscript provides no explicit contrast (e.g., via a dedicated subsection or table) between pre-FM data-centric healthcare methods and FM-era techniques. Without citing specific works that illustrate a qualitative shift attributable to FM properties such as instruction following or scale, the ignition narrative risks resting on re-labeling of longstanding issues.

Authors: We acknowledge the value of an explicit contrast to substantiate the framing. The survey is organized around FM-specific capabilities (instruction following, in-context learning, and scale-enabled synthetic data) that reframe longstanding healthcare data issues in new ways, as illustrated by examples like FM-based annotation and privacy-preserving generation. To address the concern directly, we will add a comparison table in Section 1 (or a new subsection) contrasting pre-FM methods (e.g., traditional active learning, rule-based augmentation) with FM-era techniques (e.g., prompt-based data synthesis, scalable instruction tuning), citing representative works. This will make the qualitative shifts attributable to FMs explicit while preserving the survey's focus. revision: partial
Referee: [Methodology / survey scope] Survey methodology section (likely §2 or equivalent): The paper does not state inclusion/exclusion criteria, search strategy, or temporal scope for the cited literature. This omission undermines assessment of whether the body of work is sufficiently mature and representative to support the paradigm-ignition thesis as the organizing principle.

Authors: We agree that explicit methodology details are necessary for rigor and reproducibility. In the revised version, we will insert a new subsection (likely in Section 2) that specifies the search strategy (databases: arXiv, PubMed, Google Scholar; keywords: 'foundation model' combined with 'healthcare' or 'medical' and data-centric terms), inclusion criteria (works on data characterization/quality/scale in FM healthcare applications, post-2020), exclusion criteria (non-English papers, purely model-architecture focused without data emphasis), and temporal scope (literature from 2018 onward to capture the FM emergence). This addition will support the survey's organizational thesis. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey without derivations or fitted results

full rationale

This is a survey paper organizing existing literature on foundation models in healthcare under a data-centric framing. The abstract and provided text contain no equations, no parameter fitting, no predictions derived from inputs, and no uniqueness theorems or ansatzes. The central narrative (interactive FMs igniting a data-centric paradigm) is an interpretive organization of cited works rather than a self-referential derivation. No load-bearing self-citations reduce the claim to prior author work by construction; the survey cites external literature for support. Per rules, absence of any quotable reduction to inputs yields score 0. The work is self-contained as a review.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a survey and introduces no new free parameters, axioms, or invented entities; it reviews existing literature on foundation models and data challenges.

pith-pipeline@v0.9.0 · 5717 in / 1065 out tokens · 20186 ms · 2026-05-24T04:10:43.810345+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clinical Note Bloat Reduction for Efficient LLM Use
cs.CY 2026-03 conditional novelty 6.0

TRACE removes 47.3% of text from clinical notes by targeting bloat and preserves performance on information extraction and outcome prediction tasks.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 1 Pith paper · 41 internal anchors

[1]

Asma Ben Abacha and Pierre Zweigenbaum. 2011. Medical entity recognition: a comparaison of semantic and statistical methods. In Proceedings of BioNLP 2011 workshop . 56–64

work page 2011
[2]

Charu C Aggarwal and Philip S Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. Privacy-preserving data mining: models and algorithms (2008), 11–52

work page 2008
[3]

Ravi Aggarwal, Viknesh Sounderajah, Guy Martin, Daniel SW Ting, Alan Karthikesalingam, Dominic King, Hutan Ashrafian, and Ara Darzi. 2021. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine 4, 1 (2021), 65

work page 2021
[4]

Mohamed Akrout, Bálint Gyepesi, Péter Holló, Adrienn Poór, Blága Kincső, Stephen Solis, Katrina Cirone, Jeremy Kawahara, Dekker Slade, Latif Abid, Máté Kovács, and István Fazekas. 2023. Diffusion-based Data Aug- mentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images. arXiv:2301.04802 [cs.LG]

work page arXiv 2023
[5]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736

work page 2022
[6]

Garcia, and H

Saghir Alfasly, Peyman Nejat, Sobhan Hemati, Jibran Khan, Isaiah Lahr, Areej Alsaafin, Abubakr Shafique, Nneka Comfere, Dennis Murphree, Chady Meroueh, Saba Yasir, Aaron Mangold, Lisa Boardman, Vijay Shah, Joaquin J. Garcia, and H. R. Tizhoosh. 2023. When is a Foundation Model a Foundation Model. arXiv:2309.11510 [cs.IR]

work page arXiv 2023
[7]

Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott

work page
[8]

Publicly Available Clinical BERT Embeddings

Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019). 24 Y. Zhang et al

work page internal anchor Pith review Pith/arXiv arXiv 1904
[9]

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkel- berger, Ahmed Elgohary, Sergey Feldman, Vu Ha, et al. 2018. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Abduladhim Ashtaiwi. 2022. Optimal histopathological magnification factors for deep learning-based breast cancer prediction. Applied System Innovation 5, 5 (2022), 87

work page 2022
[11]

Muhammad Adeel Azam, Khan Bahadar Khan, Sana Salahuddin, Eid Rehman, Sajid Ali Khan, Muhammad Attique Khan, Seifedine Kadry, and Amir H Gandomi. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine 144 (2022), 105253

work page 2022
[12]

Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, Sebastien Baur, Simon Kornblith, Ting Chen, Nenad Tomasev, Jovana Mitrović, Patricia Strachan, et al. 2023. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering (2023), 1–24

work page 2023
[13]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35 (2022), 32897–32912

work page 2022
[15]

Ashwin Belle, Raghuram Thiagarajan, SM Soroushmehr, Fatemeh Navidi, Daniel A Beard, Kayvan Najarian, et al

work page
[16]

BioMed research international 2015 (2015)

Big data analytics in healthcare. BioMed research international 2015 (2015)

work page 2015
[17]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: Pretrained Language Model for Scientific Text. InEMNLP. arXiv:arXiv:1903.10676

work page arXiv 2019
[18]

Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. 2000. The protein data bank. Nucleic acids research 28, 1 (2000), 235–242

work page 2000
[19]

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10925–10934

work page 2022
[20]

Isabelle Bichindaritz, Guanghui Liu, and Christopher Bartlett. 2021. Integrative survival analysis of breast cancer with gene expression and DNA methylation data. Bioinformatics 37, 17 (2021), 2601–2608

work page 2021
[21]

Aleksa Bisercic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Pietro Lio, and Andrija Petrovic. 2023. Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models. arXiv preprint arXiv:2306.05052 (2023)

work page arXiv 2023
[22]

Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270

work page 2004
[23]

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. 2022. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision . Springer, 1–21

work page 2022
[24]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Nathaniel Braman, Jacob WH Gordon, Emery T Goossens, Caleb Willis, Martin C Stumpe, and Jagadish Venkataraman

work page
[26]

In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24

Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24 . Springer, 667–677

work page 2021
[27]

Becky A Briesacher, Susan E Andrade, Hassan Fouayzi, and K Arnold Chan. 2008. Comparison of drug adherence rates among patients with seven different medical conditions. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 28, 4 (2008), 437–443

work page 2008
[28]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[29]

Guy Brusselle, Ian D Pavord, Sarah Landis, Steven Pascoe, Sally Lettis, Nikhil Morjaria, Neil Barnes, and Emma Hilton. 2018. Blood eosinophil levels as a biomarker in COPD. Respiratory medicine 138 (2018), 21–31

work page 2018
[30]

Petra Budikova, Michal Batko, and Pavel Zezula. 2017. Fusion strategies for large-scale multi-modal image retrieval. Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXIII (2017), 146–184

work page 2017
[31]

Carl A Burtis and David E Bruns. 2014. Tietz fundamentals of clinical chemistry and molecular diagnostics-e-book . Elsevier Health Sciences

work page 2014
[32]

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297 (2024). Data-Centric Foundation Models in Computational Healthcare: A Survey 25

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270

work page 2023
[34]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19) . 267–284

work page 2019
[36]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650

work page 2021
[37]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660

work page 2021
[38]

Pierre Chambon, Christian Bluethgen, Curtis P Langlotz, and Akshay Chaudhari. 2022. Adapting pretrained vision- language foundational models to medical imaging domains. arXiv preprint arXiv:2210.04133 (2022)

work page arXiv 2022
[39]

Kai Chieh Chang, Mark Hasegawa-Johnson, Nancy L McElwain, and Bashima Islam. 2023. Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data. arXiv preprint arXiv:2306.15808 (2023)

work page arXiv 2023
[40]

Qi Chang, Zhennan Yan, Mu Zhou, Hui Qu, Xiaoxiao He, Han Zhang, Lohendran Baskaran, Subhi Al’Aref, Hongsheng Li, Shaoting Zhang, et al. 2023. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nature communications 14, 1 (2023), 5510

work page 2023
[41]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)

work page arXiv 2023
[42]

Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. 2022. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 16144–16155

work page 2022
[43]

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban, et al. 2023. A general-purpose self-supervised model for computational pathology. arXiv preprint arXiv:2308.15474 (2023)

work page arXiv 2023
[44]

Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. 2021. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering 5, 6 (2021), 493–497

work page 2021
[45]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning . PMLR, 1597–1607

work page 2020
[46]

Wei Chen, Xuesong Liu, Sanyin Zhang, and Shilin Chen. 2023. Artificial intelligence for drug discovery: Resources, methods, and applications. Molecular Therapy-Nucleic Acids (2023)

work page 2023
[47]

Xinlei Chen, Saining Xie, and Kaiming He. 2021. An Empirical Study of Training Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . https://doi.org/10.1109/iccv48922.2021.00950

work page doi:10.1109/iccv48922.2021.00950 2021
[48]

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723 (2023)

work page arXiv 2023
[49]

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Zhiyu Chen, Yujie Lu, and William Yang Wang. 2023. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. arXiv preprint arXiv:2310.07146 (2023)

work page arXiv 2023
[51]

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208 (2024)

work page arXiv 2024
[52]

Dongjie Cheng, Ziyuan Qin, Zekun Jiang, Shaoting Zhang, Qicheng Lao, and Kang Li. 2023. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035 (2023)

work page arXiv 2023
[53]

Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil ˙e Žemgulyt ˙e, Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023), eadg7492

work page 2023
[54]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? arXiv preprint arXiv:2305.01937 (2023)

work page arXiv 2023
[55]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. 26 Y. Zhang et al. arXiv preprint arXiv:2204.02311 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang. 2024. Vision–language foundation model for echocardiogram interpretation. Nature Medicine (2024), 1–8

work page 2024
[57]

Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg, Jan Y Verbakel, and Ben Van Calster. 2019. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology 110 (2019), 12–22

work page 2019
[58]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De- hghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Ellen Wright Clayton, Peter J Embí, and Bradley A Malin. 2023. Dobbs and the future of health data privacy for patients and healthcare organizations. Journal of the American Medical Informatics Association 30, 1 (2023), 155–160

work page 2023
[60]

1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68

work page 2015
[61]

ENCODE Project Consortium et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 7414 (2012), 57

work page 2012
[62]

Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. 2019. Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering 16, 3 (2019), 031001

work page 2019
[63]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, and Bo Wang. 2023. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv (2023), 2023–04

work page 2023
[65]

Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A Joshi, and Richard M Leahy

work page
[66]

arXiv preprint arXiv:2311.03764 (2023)

Neuro-GPT: Developing A Foundation Model for EEG. arXiv preprint arXiv:2311.03764 (2023)

work page arXiv 2023
[67]

Peter-Paul de Wolf. 2012. Statistical disclosure control. Wiley & Sons, Chichester

work page 2012
[68]

Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. 2023. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155 (2023)

work page arXiv 2023
[69]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

Yair Dgani, Hayit Greenspan, and Jacob Goldberger. 2018. Training a neural network based on unreliable human annotation of medical images. In 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018) . IEEE, 39–42

work page 2018
[72]

Kexin Ding, Mu Zhou, Dimitris N Metaxas, and Shaoting Zhang. 2023. Pathology-and-genomics multimodal transformer for survival outcome prediction. In International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 622–631

work page 2023
[73]

Kexin Ding, Mu Zhou, He Wang, Olivier Gevaert, Dimitris Metaxas, and Shaoting Zhang. 2023. A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data 10, 1 (2023), 231

work page 2023
[74]

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5, 3 (2023), 220–235

work page 2023
[75]

Jonas Dippel, Barbara Feulner, Tobias Winterhoff, Simon Schallenberg, Gabriel Dernbach, Andreas Kunft, Stephan Tietz, Philipp Jurmeister, David Horst, Lukas Ruff, et al . 2024. RudolfV: a foundation model by pathologists for pathologists. arXiv preprint arXiv:2401.04079 (2024)

work page arXiv 2024
[76]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Tian Dong, Bo Zhao, and Lingjuan Lyu. 2022. Privacy for Free: How does Dataset Condensation Help Privacy? arXiv:2206.00240 [cs.CR]

work page arXiv 2022
[78]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[79]

Raman Dutt, Linus Ericsson, Pedro Sanchez, Sotirios A Tsaftaris, and Timothy Hospedales. 2023. Parameter-Efficient Fine-Tuning for Medical Image Analysis: The Missed Opportunity. arXiv preprint arXiv:2305.08252 (2023)

work page arXiv 2023
[80]

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381 (2018). Data-Centric Foundation Models in Computational Healthcare: A Survey 27

work page internal anchor Pith review Pith/arXiv arXiv 2018

Showing first 80 references.

[1] [1]

Asma Ben Abacha and Pierre Zweigenbaum. 2011. Medical entity recognition: a comparaison of semantic and statistical methods. In Proceedings of BioNLP 2011 workshop . 56–64

work page 2011

[2] [2]

Charu C Aggarwal and Philip S Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. Privacy-preserving data mining: models and algorithms (2008), 11–52

work page 2008

[3] [3]

Ravi Aggarwal, Viknesh Sounderajah, Guy Martin, Daniel SW Ting, Alan Karthikesalingam, Dominic King, Hutan Ashrafian, and Ara Darzi. 2021. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine 4, 1 (2021), 65

work page 2021

[4] [4]

Mohamed Akrout, Bálint Gyepesi, Péter Holló, Adrienn Poór, Blága Kincső, Stephen Solis, Katrina Cirone, Jeremy Kawahara, Dekker Slade, Latif Abid, Máté Kovács, and István Fazekas. 2023. Diffusion-based Data Aug- mentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images. arXiv:2301.04802 [cs.LG]

work page arXiv 2023

[5] [5]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736

work page 2022

[6] [6]

Garcia, and H

Saghir Alfasly, Peyman Nejat, Sobhan Hemati, Jibran Khan, Isaiah Lahr, Areej Alsaafin, Abubakr Shafique, Nneka Comfere, Dennis Murphree, Chady Meroueh, Saba Yasir, Aaron Mangold, Lisa Boardman, Vijay Shah, Joaquin J. Garcia, and H. R. Tizhoosh. 2023. When is a Foundation Model a Foundation Model. arXiv:2309.11510 [cs.IR]

work page arXiv 2023

[7] [7]

Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott

work page

[8] [8]

Publicly Available Clinical BERT Embeddings

Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019). 24 Y. Zhang et al

work page internal anchor Pith review Pith/arXiv arXiv 1904

[9] [9]

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkel- berger, Ahmed Elgohary, Sergey Feldman, Vu Ha, et al. 2018. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Abduladhim Ashtaiwi. 2022. Optimal histopathological magnification factors for deep learning-based breast cancer prediction. Applied System Innovation 5, 5 (2022), 87

work page 2022

[11] [11]

Muhammad Adeel Azam, Khan Bahadar Khan, Sana Salahuddin, Eid Rehman, Sajid Ali Khan, Muhammad Attique Khan, Seifedine Kadry, and Amir H Gandomi. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine 144 (2022), 105253

work page 2022

[12] [12]

Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, Sebastien Baur, Simon Kornblith, Ting Chen, Nenad Tomasev, Jovana Mitrović, Patricia Strachan, et al. 2023. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering (2023), 1–24

work page 2023

[13] [13]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35 (2022), 32897–32912

work page 2022

[15] [15]

Ashwin Belle, Raghuram Thiagarajan, SM Soroushmehr, Fatemeh Navidi, Daniel A Beard, Kayvan Najarian, et al

work page

[16] [16]

BioMed research international 2015 (2015)

Big data analytics in healthcare. BioMed research international 2015 (2015)

work page 2015

[17] [17]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: Pretrained Language Model for Scientific Text. InEMNLP. arXiv:arXiv:1903.10676

work page arXiv 2019

[18] [18]

Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. 2000. The protein data bank. Nucleic acids research 28, 1 (2000), 235–242

work page 2000

[19] [19]

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10925–10934

work page 2022

[20] [20]

Isabelle Bichindaritz, Guanghui Liu, and Christopher Bartlett. 2021. Integrative survival analysis of breast cancer with gene expression and DNA methylation data. Bioinformatics 37, 17 (2021), 2601–2608

work page 2021

[21] [21]

Aleksa Bisercic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Pietro Lio, and Andrija Petrovic. 2023. Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models. arXiv preprint arXiv:2306.05052 (2023)

work page arXiv 2023

[22] [22]

Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270

work page 2004

[23] [23]

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. 2022. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision . Springer, 1–21

work page 2022

[24] [24]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Nathaniel Braman, Jacob WH Gordon, Emery T Goossens, Caleb Willis, Martin C Stumpe, and Jagadish Venkataraman

work page

[26] [26]

In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24

Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24 . Springer, 667–677

work page 2021

[27] [27]

Becky A Briesacher, Susan E Andrade, Hassan Fouayzi, and K Arnold Chan. 2008. Comparison of drug adherence rates among patients with seven different medical conditions. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 28, 4 (2008), 437–443

work page 2008

[28] [28]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020

[29] [29]

Guy Brusselle, Ian D Pavord, Sarah Landis, Steven Pascoe, Sally Lettis, Nikhil Morjaria, Neil Barnes, and Emma Hilton. 2018. Blood eosinophil levels as a biomarker in COPD. Respiratory medicine 138 (2018), 21–31

work page 2018

[30] [30]

Petra Budikova, Michal Batko, and Pavel Zezula. 2017. Fusion strategies for large-scale multi-modal image retrieval. Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXIII (2017), 146–184

work page 2017

[31] [31]

Carl A Burtis and David E Bruns. 2014. Tietz fundamentals of clinical chemistry and molecular diagnostics-e-book . Elsevier Health Sciences

work page 2014

[32] [32]

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297 (2024). Data-Centric Foundation Models in Computational Healthcare: A Survey 25

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270

work page 2023

[34] [34]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19) . 267–284

work page 2019

[36] [36]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650

work page 2021

[37] [37]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660

work page 2021

[38] [38]

Pierre Chambon, Christian Bluethgen, Curtis P Langlotz, and Akshay Chaudhari. 2022. Adapting pretrained vision- language foundational models to medical imaging domains. arXiv preprint arXiv:2210.04133 (2022)

work page arXiv 2022

[39] [39]

Kai Chieh Chang, Mark Hasegawa-Johnson, Nancy L McElwain, and Bashima Islam. 2023. Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data. arXiv preprint arXiv:2306.15808 (2023)

work page arXiv 2023

[40] [40]

Qi Chang, Zhennan Yan, Mu Zhou, Hui Qu, Xiaoxiao He, Han Zhang, Lohendran Baskaran, Subhi Al’Aref, Hongsheng Li, Shaoting Zhang, et al. 2023. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nature communications 14, 1 (2023), 5510

work page 2023

[41] [41]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)

work page arXiv 2023

[42] [42]

Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. 2022. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 16144–16155

work page 2022

[43] [43]

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban, et al. 2023. A general-purpose self-supervised model for computational pathology. arXiv preprint arXiv:2308.15474 (2023)

work page arXiv 2023

[44] [44]

Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. 2021. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering 5, 6 (2021), 493–497

work page 2021

[45] [45]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning . PMLR, 1597–1607

work page 2020

[46] [46]

Wei Chen, Xuesong Liu, Sanyin Zhang, and Shilin Chen. 2023. Artificial intelligence for drug discovery: Resources, methods, and applications. Molecular Therapy-Nucleic Acids (2023)

work page 2023

[47] [47]

Xinlei Chen, Saining Xie, and Kaiming He. 2021. An Empirical Study of Training Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . https://doi.org/10.1109/iccv48922.2021.00950

work page doi:10.1109/iccv48922.2021.00950 2021

[48] [48]

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723 (2023)

work page arXiv 2023

[49] [49]

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Zhiyu Chen, Yujie Lu, and William Yang Wang. 2023. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. arXiv preprint arXiv:2310.07146 (2023)

work page arXiv 2023

[51] [51]

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208 (2024)

work page arXiv 2024

[52] [52]

Dongjie Cheng, Ziyuan Qin, Zekun Jiang, Shaoting Zhang, Qicheng Lao, and Kang Li. 2023. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035 (2023)

work page arXiv 2023

[53] [53]

Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil ˙e Žemgulyt ˙e, Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023), eadg7492

work page 2023

[54] [54]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? arXiv preprint arXiv:2305.01937 (2023)

work page arXiv 2023

[55] [55]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. 26 Y. Zhang et al. arXiv preprint arXiv:2204.02311 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang. 2024. Vision–language foundation model for echocardiogram interpretation. Nature Medicine (2024), 1–8

work page 2024

[57] [57]

Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg, Jan Y Verbakel, and Ben Van Calster. 2019. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology 110 (2019), 12–22

work page 2019

[58] [58]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De- hghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[59] [59]

Ellen Wright Clayton, Peter J Embí, and Bradley A Malin. 2023. Dobbs and the future of health data privacy for patients and healthcare organizations. Journal of the American Medical Informatics Association 30, 1 (2023), 155–160

work page 2023

[60] [60]

1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68

work page 2015

[61] [61]

ENCODE Project Consortium et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 7414 (2012), 57

work page 2012

[62] [62]

Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. 2019. Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering 16, 3 (2019), 031001

work page 2019

[63] [63]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[64] [64]

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, and Bo Wang. 2023. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv (2023), 2023–04

work page 2023

[65] [65]

Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A Joshi, and Richard M Leahy

work page

[66] [66]

arXiv preprint arXiv:2311.03764 (2023)

Neuro-GPT: Developing A Foundation Model for EEG. arXiv preprint arXiv:2311.03764 (2023)

work page arXiv 2023

[67] [67]

Peter-Paul de Wolf. 2012. Statistical disclosure control. Wiley & Sons, Chichester

work page 2012

[68] [68]

Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. 2023. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155 (2023)

work page arXiv 2023

[69] [69]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[70] [70]

Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[71] [71]

Yair Dgani, Hayit Greenspan, and Jacob Goldberger. 2018. Training a neural network based on unreliable human annotation of medical images. In 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018) . IEEE, 39–42

work page 2018

[72] [72]

Kexin Ding, Mu Zhou, Dimitris N Metaxas, and Shaoting Zhang. 2023. Pathology-and-genomics multimodal transformer for survival outcome prediction. In International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 622–631

work page 2023

[73] [73]

Kexin Ding, Mu Zhou, He Wang, Olivier Gevaert, Dimitris Metaxas, and Shaoting Zhang. 2023. A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data 10, 1 (2023), 231

work page 2023

[74] [74]

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5, 3 (2023), 220–235

work page 2023

[75] [75]

Jonas Dippel, Barbara Feulner, Tobias Winterhoff, Simon Schallenberg, Gabriel Dernbach, Andreas Kunft, Stephan Tietz, Philipp Jurmeister, David Horst, Lukas Ruff, et al . 2024. RudolfV: a foundation model by pathologists for pathologists. arXiv preprint arXiv:2401.04079 (2024)

work page arXiv 2024

[76] [76]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[77] [77]

Tian Dong, Bo Zhao, and Lingjuan Lyu. 2022. Privacy for Free: How does Dataset Condensation Help Privacy? arXiv:2206.00240 [cs.CR]

work page arXiv 2022

[78] [78]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[79] [79]

Raman Dutt, Linus Ericsson, Pedro Sanchez, Sotirios A Tsaftaris, and Timothy Hospedales. 2023. Parameter-Efficient Fine-Tuning for Medical Image Analysis: The Missed Opportunity. arXiv preprint arXiv:2305.08252 (2023)

work page arXiv 2023

[80] [80]

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381 (2018). Data-Centric Foundation Models in Computational Healthcare: A Survey 27

work page internal anchor Pith review Pith/arXiv arXiv 2018