pith. sign in

arxiv: 2401.02458 · v3 · submitted 2024-01-04 · 💻 cs.LG · cs.AI

Data-Centric Foundation Models in Computational Healthcare: A Survey

Pith reviewed 2026-05-24 04:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords foundation modelscomputational healthcaredata-centric AIhealthcare workflowdata qualitypatient privacysurvey
0
0 comments X

The pith

Foundation models ignite a data-centric AI paradigm in healthcare by prioritizing data characterization, quality, and scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys data-centric approaches across foundation model workflows in computational healthcare, from pre-training through inference. It frames these models' interactive design, shaped by pre-training data and human instructions, as the driver for addressing longstanding issues of data quantity, annotation, privacy, and ethics. The survey also covers security, assessment, and value alignment, then points to improved patient outcomes and clinical workflows as the practical payoff.

Core claim

The interactive nature of foundation models, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm that emphasizes better data characterization, quality, and scale toward improving the healthcare workflow. In healthcare AI, obtaining and processing high-quality clinical data records has been a longstanding challenge, encompassing data quantity, annotation, patient privacy, and ethics. The survey organizes existing work on these data-centric methods and provides an outlook on their use in analytics to enhance patient outcomes and clinical workflows.

What carries the argument

The data-centric AI paradigm in foundation models, which organizes methods from pre-training to inference to improve data handling in healthcare.

Load-bearing premise

The body of published work on foundation models in healthcare is mature and representative enough for a survey to treat data-centric methods as the central response to data challenges.

What would settle it

A finding that most foundation-model papers in healthcare still treat architecture changes as the primary lever, with data improvements as secondary or absent, would falsify the claim that a data-centric paradigm now dominates.

Figures

Figures reproduced from arXiv: 2401.02458 by Dequan Wang, Jin Gao, Kexin Ding, Lingfeng Zhou, Mu Zhou, Shaoting Zhang, Yunkun Zhang, Zheling Tan.

Figure 1
Figure 1. Figure 1: Data-centric foundation models in computational healthcare. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of healthcare data challenges and foundation model-based approaches mentioned in this [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Foundation model (FM) in healthcare. 2.2.2 Instruction tuning. Instruction is defined as the linguistic description of a task along with its corresponding task-specific data sample. Instruction tuning refers to fine-tuning FMs on supervised instruction datasets with LLMs helping to understand the instruction [210]. This method enhances zero-shot performance on new tasks and improves the generalization capa… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-modal fusion of healthcare data in the FM era. Conventional fusion approaches are enhanced [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Foundation models address data quantity and data annotation challenges. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Foundation model evaluation strategies. potential patient information leakage in the model training, federated learning naturally provides a distributed network solution for protecting in-house patient data in local institutions. This training strategy could ensure FM is trained locally without releasing the in-house patient data for FM training and updating. (2) The design of FM architectures is expected … view at source ↗
read the original abstract

The advent of foundation models (FMs) as an emerging suite of AI techniques has struck a wave of opportunities in computational healthcare. The interactive nature of these models, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm that emphasizes better data characterization, quality, and scale. In healthcare AI, obtaining and processing high-quality clinical data records has been a longstanding challenge, encompassing data quantity, annotation, patient privacy, and ethics. In this survey, we investigate a wide range of data-centric approaches in the FM era (from model pre-training to inference) towards improving the healthcare workflow. We discuss key perspectives in AI security, assessment, and alignment with human values. Finally, we offer a promising outlook on FM-based analytics to enhance patient outcomes and clinical workflows in the evolving landscape of healthcare and medicine. We provide an up-to-date list of healthcare-related foundation models and datasets at https://github.com/Yunkun-Zhang/Data-Centric-FM-Healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey of data-centric approaches in the foundation model (FM) era for computational healthcare. It claims that the interactive nature of FMs, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm emphasizing data characterization, quality, and scale. The survey reviews methods spanning model pre-training through inference to address longstanding healthcare data challenges (quantity, annotation, privacy, ethics), discusses perspectives on AI security, assessment, and human-value alignment, offers an outlook on FM-based analytics, and supplies a GitHub repository listing healthcare FMs and datasets.

Significance. If the surveyed literature demonstrates approaches that are newly enabled or distinctly reframed by FMs (rather than re-categorized pre-existing data issues), the paper could usefully organize the field and highlight workflow improvements. The concrete GitHub deliverable is a clear strength that supports resource discovery and reproducibility.

major comments (2)
  1. [Abstract, §1] Abstract and §1: The central claim that FMs have 'ignited a data-centric AI paradigm' is load-bearing for the survey's framing and organization, yet the manuscript provides no explicit contrast (e.g., via a dedicated subsection or table) between pre-FM data-centric healthcare methods and FM-era techniques. Without citing specific works that illustrate a qualitative shift attributable to FM properties such as instruction following or scale, the ignition narrative risks resting on re-labeling of longstanding issues.
  2. [Methodology / survey scope] Survey methodology section (likely §2 or equivalent): The paper does not state inclusion/exclusion criteria, search strategy, or temporal scope for the cited literature. This omission undermines assessment of whether the body of work is sufficiently mature and representative to support the paradigm-ignition thesis as the organizing principle.
minor comments (2)
  1. [Abstract / conclusion] The GitHub link is a valuable contribution; the paper should state the date of the most recent repository update and any curation criteria used for the listed models and datasets.
  2. [Throughout] Notation for FM components (e.g., pre-training vs. fine-tuning stages) should be standardized across sections to improve readability for readers comparing data-centric interventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments help clarify the survey's framing and methodology. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract, §1] Abstract and §1: The central claim that FMs have 'ignited a data-centric AI paradigm' is load-bearing for the survey's framing and organization, yet the manuscript provides no explicit contrast (e.g., via a dedicated subsection or table) between pre-FM data-centric healthcare methods and FM-era techniques. Without citing specific works that illustrate a qualitative shift attributable to FM properties such as instruction following or scale, the ignition narrative risks resting on re-labeling of longstanding issues.

    Authors: We acknowledge the value of an explicit contrast to substantiate the framing. The survey is organized around FM-specific capabilities (instruction following, in-context learning, and scale-enabled synthetic data) that reframe longstanding healthcare data issues in new ways, as illustrated by examples like FM-based annotation and privacy-preserving generation. To address the concern directly, we will add a comparison table in Section 1 (or a new subsection) contrasting pre-FM methods (e.g., traditional active learning, rule-based augmentation) with FM-era techniques (e.g., prompt-based data synthesis, scalable instruction tuning), citing representative works. This will make the qualitative shifts attributable to FMs explicit while preserving the survey's focus. revision: partial

  2. Referee: [Methodology / survey scope] Survey methodology section (likely §2 or equivalent): The paper does not state inclusion/exclusion criteria, search strategy, or temporal scope for the cited literature. This omission undermines assessment of whether the body of work is sufficiently mature and representative to support the paradigm-ignition thesis as the organizing principle.

    Authors: We agree that explicit methodology details are necessary for rigor and reproducibility. In the revised version, we will insert a new subsection (likely in Section 2) that specifies the search strategy (databases: arXiv, PubMed, Google Scholar; keywords: 'foundation model' combined with 'healthcare' or 'medical' and data-centric terms), inclusion criteria (works on data characterization/quality/scale in FM healthcare applications, post-2020), exclusion criteria (non-English papers, purely model-architecture focused without data emphasis), and temporal scope (literature from 2018 onward to capture the FM emergence). This addition will support the survey's organizational thesis. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey without derivations or fitted results

full rationale

This is a survey paper organizing existing literature on foundation models in healthcare under a data-centric framing. The abstract and provided text contain no equations, no parameter fitting, no predictions derived from inputs, and no uniqueness theorems or ansatzes. The central narrative (interactive FMs igniting a data-centric paradigm) is an interpretive organization of cited works rather than a self-referential derivation. No load-bearing self-citations reduce the claim to prior author work by construction; the survey cites external literature for support. Per rules, absence of any quotable reduction to inputs yields score 0. The work is self-contained as a review.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a survey and introduces no new free parameters, axioms, or invented entities; it reviews existing literature on foundation models and data challenges.

pith-pipeline@v0.9.0 · 5717 in / 1065 out tokens · 20186 ms · 2026-05-24T04:10:43.810345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Clinical Note Bloat Reduction for Efficient LLM Use

    cs.CY 2026-03 conditional novelty 6.0

    TRACE removes 47.3% of text from clinical notes by targeting bloat and preserves performance on information extraction and outcome prediction tasks.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 1 Pith paper · 41 internal anchors

  1. [1]

    Asma Ben Abacha and Pierre Zweigenbaum. 2011. Medical entity recognition: a comparaison of semantic and statistical methods. In Proceedings of BioNLP 2011 workshop . 56–64

  2. [2]

    Charu C Aggarwal and Philip S Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. Privacy-preserving data mining: models and algorithms (2008), 11–52

  3. [3]

    Ravi Aggarwal, Viknesh Sounderajah, Guy Martin, Daniel SW Ting, Alan Karthikesalingam, Dominic King, Hutan Ashrafian, and Ara Darzi. 2021. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine 4, 1 (2021), 65

  4. [4]

    Mohamed Akrout, Bálint Gyepesi, Péter Holló, Adrienn Poór, Blága Kincső, Stephen Solis, Katrina Cirone, Jeremy Kawahara, Dekker Slade, Latif Abid, Máté Kovács, and István Fazekas. 2023. Diffusion-based Data Aug- mentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images. arXiv:2301.04802 [cs.LG]

  5. [5]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736

  6. [6]

    Garcia, and H

    Saghir Alfasly, Peyman Nejat, Sobhan Hemati, Jibran Khan, Isaiah Lahr, Areej Alsaafin, Abubakr Shafique, Nneka Comfere, Dennis Murphree, Chady Meroueh, Saba Yasir, Aaron Mangold, Lisa Boardman, Vijay Shah, Joaquin J. Garcia, and H. R. Tizhoosh. 2023. When is a Foundation Model a Foundation Model. arXiv:2309.11510 [cs.IR]

  7. [7]

    Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott

  8. [8]

    Publicly Available Clinical BERT Embeddings

    Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019). 24 Y. Zhang et al

  9. [9]

    Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkel- berger, Ahmed Elgohary, Sergey Feldman, Vu Ha, et al. 2018. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018)

  10. [10]

    Abduladhim Ashtaiwi. 2022. Optimal histopathological magnification factors for deep learning-based breast cancer prediction. Applied System Innovation 5, 5 (2022), 87

  11. [11]

    Muhammad Adeel Azam, Khan Bahadar Khan, Sana Salahuddin, Eid Rehman, Sajid Ali Khan, Muhammad Attique Khan, Seifedine Kadry, and Amir H Gandomi. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine 144 (2022), 105253

  12. [12]

    Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, Sebastien Baur, Simon Kornblith, Ting Chen, Nenad Tomasev, Jovana Mitrović, Patricia Strachan, et al. 2023. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering (2023), 1–24

  13. [13]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  14. [14]

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35 (2022), 32897–32912

  15. [15]

    Ashwin Belle, Raghuram Thiagarajan, SM Soroushmehr, Fatemeh Navidi, Daniel A Beard, Kayvan Najarian, et al

  16. [16]

    BioMed research international 2015 (2015)

    Big data analytics in healthcare. BioMed research international 2015 (2015)

  17. [17]

    Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: Pretrained Language Model for Scientific Text. InEMNLP. arXiv:arXiv:1903.10676

  18. [18]

    Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. 2000. The protein data bank. Nucleic acids research 28, 1 (2000), 235–242

  19. [19]

    Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10925–10934

  20. [20]

    Isabelle Bichindaritz, Guanghui Liu, and Christopher Bartlett. 2021. Integrative survival analysis of breast cancer with gene expression and DNA methylation data. Bioinformatics 37, 17 (2021), 2601–2608

  21. [21]

    Aleksa Bisercic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Pietro Lio, and Andrija Petrovic. 2023. Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models. arXiv preprint arXiv:2306.05052 (2023)

  22. [22]

    Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270

  23. [23]

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. 2022. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision . Springer, 1–21

  24. [24]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  25. [25]

    Nathaniel Braman, Jacob WH Gordon, Emery T Goossens, Caleb Willis, Martin C Stumpe, and Jagadish Venkataraman

  26. [26]

    In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24

    Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24 . Springer, 667–677

  27. [27]

    Becky A Briesacher, Susan E Andrade, Hassan Fouayzi, and K Arnold Chan. 2008. Comparison of drug adherence rates among patients with seven different medical conditions. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 28, 4 (2008), 437–443

  28. [28]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  29. [29]

    Guy Brusselle, Ian D Pavord, Sarah Landis, Steven Pascoe, Sally Lettis, Nikhil Morjaria, Neil Barnes, and Emma Hilton. 2018. Blood eosinophil levels as a biomarker in COPD. Respiratory medicine 138 (2018), 21–31

  30. [30]

    Petra Budikova, Michal Batko, and Pavel Zezula. 2017. Fusion strategies for large-scale multi-modal image retrieval. Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXIII (2017), 146–184

  31. [31]

    Carl A Burtis and David E Bruns. 2014. Tietz fundamentals of clinical chemistry and molecular diagnostics-e-book . Elsevier Health Sciences

  32. [32]

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297 (2024). Data-Centric Foundation Models in Computational Healthcare: A Survey 25

  33. [33]

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270

  34. [34]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022)

  35. [35]

    Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19) . 267–284

  36. [36]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650

  37. [37]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  38. [38]

    Pierre Chambon, Christian Bluethgen, Curtis P Langlotz, and Akshay Chaudhari. 2022. Adapting pretrained vision- language foundational models to medical imaging domains. arXiv preprint arXiv:2210.04133 (2022)

  39. [39]

    Kai Chieh Chang, Mark Hasegawa-Johnson, Nancy L McElwain, and Bashima Islam. 2023. Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data. arXiv preprint arXiv:2306.15808 (2023)

  40. [40]

    Qi Chang, Zhennan Yan, Mu Zhou, Hui Qu, Xiaoxiao He, Han Zhang, Lohendran Baskaran, Subhi Al’Aref, Hongsheng Li, Shaoting Zhang, et al. 2023. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nature communications 14, 1 (2023), 5510

  41. [41]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)

  42. [42]

    Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. 2022. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 16144–16155

  43. [43]

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban, et al. 2023. A general-purpose self-supervised model for computational pathology. arXiv preprint arXiv:2308.15474 (2023)

  44. [44]

    Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. 2021. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering 5, 6 (2021), 493–497

  45. [45]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning . PMLR, 1597–1607

  46. [46]

    Wei Chen, Xuesong Liu, Sanyin Zhang, and Shilin Chen. 2023. Artificial intelligence for drug discovery: Resources, methods, and applications. Molecular Therapy-Nucleic Acids (2023)

  47. [47]

    Xinlei Chen, Saining Xie, and Kaiming He. 2021. An Empirical Study of Training Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . https://doi.org/10.1109/iccv48922.2021.00950

  48. [48]

    Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723 (2023)

  49. [49]

    Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023)

  50. [50]

    Zhiyu Chen, Yujie Lu, and William Yang Wang. 2023. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. arXiv preprint arXiv:2310.07146 (2023)

  51. [51]

    Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208 (2024)

  52. [52]

    Dongjie Cheng, Ziyuan Qin, Zekun Jiang, Shaoting Zhang, Qicheng Lao, and Kang Li. 2023. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035 (2023)

  53. [53]

    Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil ˙e Žemgulyt ˙e, Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023), eadg7492

  54. [54]

    Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? arXiv preprint arXiv:2305.01937 (2023)

  55. [55]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. 26 Y. Zhang et al. arXiv preprint arXiv:2204.02311 (2022)

  56. [56]

    Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang. 2024. Vision–language foundation model for echocardiogram interpretation. Nature Medicine (2024), 1–8

  57. [57]

    Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg, Jan Y Verbakel, and Ben Van Calster. 2019. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology 110 (2019), 12–22

  58. [58]

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De- hghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416 (2022)

  59. [59]

    Ellen Wright Clayton, Peter J Embí, and Bradley A Malin. 2023. Dobbs and the future of health data privacy for patients and healthcare organizations. Journal of the American Medical Informatics Association 30, 1 (2023), 155–160

  60. [60]

    1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68

  61. [61]

    ENCODE Project Consortium et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 7414 (2012), 57

  62. [62]

    Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. 2019. Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering 16, 3 (2019), 031001

  63. [63]

    Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)

  64. [64]

    Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, and Bo Wang. 2023. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv (2023), 2023–04

  65. [65]

    Wenhui Cui, Woojae Jeong, Philipp Thölke, Takfarinas Medani, Karim Jerbi, Anand A Joshi, and Richard M Leahy

  66. [66]

    arXiv preprint arXiv:2311.03764 (2023)

    Neuro-GPT: Developing A Foundation Model for EEG. arXiv preprint arXiv:2311.03764 (2023)

  67. [67]

    Peter-Paul de Wolf. 2012. Statistical disclosure control. Wiley & Sons, Chichester

  68. [68]

    Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. 2023. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155 (2023)

  69. [69]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  70. [70]

    Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

  71. [71]

    Yair Dgani, Hayit Greenspan, and Jacob Goldberger. 2018. Training a neural network based on unreliable human annotation of medical images. In 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018) . IEEE, 39–42

  72. [72]

    Kexin Ding, Mu Zhou, Dimitris N Metaxas, and Shaoting Zhang. 2023. Pathology-and-genomics multimodal transformer for survival outcome prediction. In International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 622–631

  73. [73]

    Kexin Ding, Mu Zhou, He Wang, Olivier Gevaert, Dimitris Metaxas, and Shaoting Zhang. 2023. A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data 10, 1 (2023), 231

  74. [74]

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5, 3 (2023), 220–235

  75. [75]

    Jonas Dippel, Barbara Feulner, Tobias Winterhoff, Simon Schallenberg, Gabriel Dernbach, Andreas Kunft, Stephan Tietz, Philipp Jurmeister, David Horst, Lukas Ruff, et al . 2024. RudolfV: a foundation model by pathologists for pathologists. arXiv preprint arXiv:2401.04079 (2024)

  76. [76]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)

  77. [77]

    Tian Dong, Bo Zhao, and Lingjuan Lyu. 2022. Privacy for Free: How does Dataset Condensation Help Privacy? arXiv:2206.00240 [cs.CR]

  78. [78]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  79. [79]

    Raman Dutt, Linus Ericsson, Pedro Sanchez, Sotirios A Tsaftaris, and Timothy Hospedales. 2023. Parameter-Efficient Fine-Tuning for Medical Image Analysis: The Missed Opportunity. arXiv preprint arXiv:2305.08252 (2023)

  80. [80]

    Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381 (2018). Data-Centric Foundation Models in Computational Healthcare: A Survey 27

Showing first 80 references.