pith. sign in

arxiv: 2606.08903 · v1 · pith:46CD6L4Enew · submitted 2026-06-08 · 💻 cs.LG

Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

Pith reviewed 2026-06-27 17:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords synthetic EMR datagenerative modelsevaluation frameworkdescriptive fidelityclinical utilitystructural validitysubgroup structureeffect estimates
0
0 comments X

The pith

Generative models for structured EMRs reproduce marginal distributions but fail to preserve subgroup structure, effect estimates, and dependency structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evaluation framework for synthetic electronic medical records drawn from epidemiology, separating descriptive fidelity, clinical utility, and structural validity. Applied to four generative approaches on a 50,000-person cohort with known ground truth, the tests show that every model captures basic distributions yet none keeps subgroups, causal effects, and dependencies intact together. Models that score well on standard statistical checks can still produce poor calibration and twisted relationships. This indicates that conventional similarity metrics overestimate how usable synthetic EMR data are for real clinical or scientific work.

Core claim

While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

What carries the argument

Multi-dimensional evaluation framework grounded in epidemiology that measures descriptive fidelity, clinical utility, and structural validity on the PRIME-CVD cohort with known ground-truth structure.

If this is right

  • Generative models must be checked for their ability to answer predictive and causal questions, not only descriptive statistics.
  • High performance on marginal distributions does not ensure correct calibration or preserved relationships.
  • Standard evaluation practices risk overstating the quality of synthetic EMR data for downstream use.
  • Valid clinical and scientific conclusions from synthetic data require domain-specific structural checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Synthetic EMRs from these models may produce misleading effect estimates if used in place of real data for observational studies.
  • The observed gap between distributional match and structural preservation suggests future models should target dependency structure explicitly.
  • Applying the same three-part test to other high-stakes domains could reveal whether the same evaluation shortfall exists elsewhere.

Load-bearing premise

The epidemiology-grounded three-part framework plus the PRIME-CVD cohort's known ground-truth structure is enough to conclude that current generative models cannot support valid clinical conclusions in EMR settings in general.

What would settle it

A generative model that, when trained on the PRIME-CVD cohort, simultaneously preserves subgroup structure, effect estimates, and dependency structure at levels comparable to the real data.

Figures

Figures reproduced from arXiv: 2606.08903 by Blanca Gallego, Louisa Jorm, Nicholas I-Hsien Kuo.

Figure 1
Figure 1. Figure 1: Distributional Fidelity of Synthetic PRIME-CVD Cohorts Across Generative Models [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributional comparisons across IRSD quintiles for numeric and categorical variables. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of clinical utility for models trained on real and synthetic data. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Structural validity comparison using DAGs discovered through GES [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a multi-dimensional evaluation framework for synthetic EMR data, grounded in epidemiology, with three axes: descriptive fidelity, clinical utility, and structural validity. Using the PRIME-CVD cohort (50k persons with known ground-truth structure), it evaluates four generative paradigms (GAN-based, VAE-boosted, diffusion-based, masked modelling). The central empirical finding is that all models reproduce marginal distributions but none simultaneously preserve subgroup structure, effect estimates, and dependency structure; models with strong distributional fidelity can still show poor calibration and distorted relationships.

Significance. If the results hold, the work is significant for demonstrating that standard statistical-similarity metrics can overestimate the utility of synthetic EMR data for clinical inference. The framework's grounding in descriptive/predictive/causal questions and the use of a cohort with independent ground-truth structure provide a concrete, falsifiable basis for the critique of current evaluation practices. This could shift the field toward domain-informed assessment that better supports valid scientific conclusions.

major comments (2)
  1. [Abstract and conclusion] Abstract and conclusion: the claim that current generative models cannot support valid clinical conclusions rests on results from a single 50k-person cohort (PRIME-CVD). No additional datasets, representativeness argument, or sensitivity analysis across cohort characteristics is described. This is load-bearing for the generalization beyond the specific PRIME-CVD setting.
  2. [Methods] Methods (implied by abstract): the support for the central claim that 'none simultaneously preserve subgroup structure, effect estimates, and dependency structure' cannot be fully verified without the exact metrics, statistical tests, and full experimental protocol for each dimension of the framework.
minor comments (1)
  1. [Abstract] The abstract would benefit from explicitly naming the four paradigms evaluated and the precise metrics used for each of the three evaluation axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the work.

read point-by-point responses
  1. Referee: [Abstract and conclusion] Abstract and conclusion: the claim that current generative models cannot support valid clinical conclusions rests on results from a single 50k-person cohort (PRIME-CVD). No additional datasets, representativeness argument, or sensitivity analysis across cohort characteristics is described. This is load-bearing for the generalization beyond the specific PRIME-CVD setting.

    Authors: We agree that the empirical demonstration is limited to the PRIME-CVD cohort, which was selected specifically because it provides known ground-truth structure for evaluating structural validity. This choice enables the central falsifiable test but does constrain broad generalization claims. In revision we will modify the abstract and conclusion to state that the findings are demonstrated on PRIME-CVD and that the framework is intended to be applied to other cohorts with comparable ground-truth information. We will also add an explicit limitations paragraph discussing cohort-specific factors and the desirability of multi-cohort validation. These changes will be made without altering the core empirical result. revision: yes

  2. Referee: [Methods] Methods (implied by abstract): the support for the central claim that 'none simultaneously preserve subgroup structure, effect estimates, and dependency structure' cannot be fully verified without the exact metrics, statistical tests, and full experimental protocol for each dimension of the framework.

    Authors: The manuscript already specifies the three evaluation axes, the concrete metrics (e.g., subgroup-stratified Kolmogorov-Smirnov distances, calibration slopes for effect estimates, and mutual-information or partial-correlation measures for dependencies), and the statistical tests used to assess preservation. To improve verifiability we will expand the Methods section with a dedicated subsection that lists every metric formula, the exact hypothesis tests (including multiplicity corrections), and the full experimental protocol, and we will release the corresponding code repository upon acceptance. These additions will make every quantitative claim directly reproducible from the provided description. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on external cohort with independent ground truth; no derivations or self-referential reductions

full rationale

The paper introduces a multi-dimensional evaluation framework grounded in epidemiology and applies it to compare four generative model classes on the PRIME-CVD cohort, which supplies independent ground-truth structure. All claims rest on direct empirical measurements of marginals, subgroups, effect estimates, and dependencies rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or ansatzes appear that could reduce outputs to inputs by construction. The central finding—that no model simultaneously preserves all three validity dimensions—is therefore an independent observation relative to the input data and framework definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, new axioms, or invented entities are introduced; the work relies on standard epidemiological distinctions between descriptive, predictive, and causal questions plus established generative model families.

pith-pipeline@v0.9.1-grok · 5685 in / 1061 out tokens · 19179 ms · 2026-06-27T17:17:27.142937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 8 canonical work pages · 6 internal anchors

  1. [1]

    Enriching data science and health care education: Application and impact of synthetic data sets through the health gym project.JMIR Medical Education, 10(1):e51388, 2024

    Nicholas I-Hsien Kuo, Oscar Perez-Concha, Mark Hanly, Emmanuel Mnatzaganian, Brandon Hao, Marcus Di Sipio, Guolin Yu, Jash Vanjara, Ivy Cerelia Valerie, et al. Enriching data science and health care education: Application and impact of synthetic data sets through the health gym project.JMIR Medical Education, 10(1):e51388, 2024

  2. [2]

    Evaluating the impact of health care data completeness for deep generative models.Methods of Information in Medicine, 62 (01/02):031–039, 2023

    Benjamin Smith, Senne Van Steelandt, and Anahita Khojandi. Evaluating the impact of health care data completeness for deep generative models.Methods of Information in Medicine, 62 (01/02):031–039, 2023

  3. [3]

    Synthetic data in machine learning for medicine and healthcare.Nature Biomedical Engineering, 5(6):493–497, 2021

    Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare.Nature Biomedical Engineering, 5(6):493–497, 2021

  4. [4]

    Transformation of medical care through gene therapy and human rights to life and health–balancing risks and benefits.European Journal of Health Law, 29(3-5):359–380, 2022

    Anne Kjersti Befring. Transformation of medical care through gene therapy and human rights to life and health–balancing risks and benefits.European Journal of Health Law, 29(3-5):359–380, 2022

  5. [5]

    Survey on synthetic data generation, evaluation methods and gans.Mathematics, 10(15):2733, 2022

    Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans.Mathematics, 10(15):2733, 2022

  6. [6]

    Synthetic data generation: State of the art in health care domain.Computer Science Review, 48:100546, 2023

    Hajra Murtaza, Musharif Ahmed, Naurin Farooq Khan, Ghulam Murtaza, Saad Zafar, and Ambreen Bano. Synthetic data generation: State of the art in health care domain.Computer Science Review, 48:100546, 2023

  7. [7]

    Can i trust my fake data–a comprehensive quality assessment framework for synthetic tabular data in healthcare.International Journal of Medical Informatics, 185:105413, 2024

    Vibeke Binz Vallevik, Aleksandar Babic, Serena E Marshall, Severin Elvatun, Helga MB Brøgger, Sharmini Alagaratnam, Bjørn Edwin, Narasimha R Veeraragavan, Anne Kjersti Befring, and Jan F Nygård. Can i trust my fake data–a comprehensive quality assessment framework for synthetic tabular data in healthcare.International Journal of Medical Informatics, 185:1...

  8. [8]

    How to use a subgroup analysis: users’ guide to the medical literature.Jama, 311(4), 2014

    Xin Sun, John PA Ioannidis, Thomas Agoritsas, Ana C Alba, and Gordon Guyatt. How to use a subgroup analysis: users’ guide to the medical literature.Jama, 311(4), 2014

  9. [9]

    Estimating causal effects from epidemiological data

    Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006

  10. [10]

    Use of directed acyclic graphs (dags) to identify confounders in applied health research: review and recommendations.International journal of epidemiology, 50(2):620–632, 2021

    Peter WG Tennant, Eleanor J Murray, Kellyn F Arnold, Laurie Berrie, Matthew P Fox, Sarah C Gadd, Wendy J Harrison, Claire Keeble, Lynsie R Ranker, Johannes Textor, et al. Use of directed acyclic graphs (dags) to identify confounders in applied health research: review and recommendations.International journal of epidemiology, 50(2):620–632, 2021

  11. [11]

    Prime- cvd: A parametrically rendered informatics medical environment for education in cardiovascular risk modelling.medRxiv, pages 2026–03, 2026

    Nicholas I-Hsien Kuo, Marzia Hoque Tania, Blanca Gallego Luxan, and Louisa Jorm. Prime- cvd: A parametrically rendered informatics medical environment for education in cardiovascular risk modelling.medRxiv, pages 2026–03, 2026

  12. [12]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  13. [13]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  14. [14]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

  15. [15]

    Improved training of wasserstein gans.Advances in neural information processing systems, 30, 2017

    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans.Advances in neural information processing systems, 30, 2017

  16. [16]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015. 10

  17. [17]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  18. [18]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008

  19. [19]

    Context encoders: Feature learning by inpainting

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016

  20. [20]

    Deep variational information bottleneck

    Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InInternational Conference on Learning Representations, 2017

  21. [21]

    Autoencoding beyond pixels using a learned similarity metric

    Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. InInternational conference on machine learning, pages 1558–1566. PMLR, 2016

  22. [22]

    Generative adversarial networks are special cases of artificial curiosity (1990) and also closely related to predictability minimization (1991).Neural Networks, 127: 58–66, 2020

    Jürgen Schmidhuber. Generative adversarial networks are special cases of artificial curiosity (1990) and also closely related to predictability minimization (1991).Neural Networks, 127: 58–66, 2020

  23. [23]

    NIPS 2016 Tutorial: Generative Adversarial Networks

    Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.arXiv preprint arXiv:1701.00160, 2016

  24. [24]

    The health gym: synthetic health-related datasets for the development of reinforcement learning algorithms

    Nicholas I-Hsien Kuo, Mark N Polizzotto, Simon Finfer, Federico Garcia, Anders Sönnerborg, Maurizio Zazzi, Michael Böhm, Rolf Kaiser, Louisa Jorm, and Sebastiano Barbieri. The health gym: synthetic health-related datasets for the development of reinforcement learning algorithms. Scientific data, 9(1):693, 2022

  25. [25]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  26. [26]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  27. [27]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  28. [28]

    Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

    Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

  29. [29]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  30. [30]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  31. [31]

    Weak baselines and reporting biases lead to overoptimism in machine learning for fluid-related partial differential equations.Nature machine intelligence, 6(10):1256–1269, 2024

    Nick McGreivy and Ammar Hakim. Weak baselines and reporting biases lead to overoptimism in machine learning for fluid-related partial differential equations.Nature machine intelligence, 6(10):1256–1269, 2024

  32. [32]

    Sulla determinazione empirica di una legge didistribuzione.Giorn Dell’inst Ital Degli Att, 4:89–91, 1933

    Kolmogorov An. Sulla determinazione empirica di una legge didistribuzione.Giorn Dell’inst Ital Degli Att, 4:89–91, 1933

  33. [33]

    Table for estimating the goodness of fit of empirical distributions.The annals of mathematical statistics, 19(2):279–281, 1948

    Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions.The annals of mathematical statistics, 19(2):279–281, 1948

  34. [34]

    On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

    Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951. 11

  35. [35]

    Similarity-based methods for word sense disambiguation

    Ido Dagan, Lillian Lee, and Fernando Pereira. Similarity-based methods for word sense disambiguation. In35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 56–63, 1997

  36. [36]

    Learning via hilbert space embedding of distributions.University of Sydney (2008), 17, 2008

    Le Song. Learning via hilbert space embedding of distributions.University of Sydney (2008), 17, 2008

  37. [37]

    A class of wasserstein metrics for probability distribu- tions.Michigan Mathematical Journal, 31(2):231–240, 1984

    Clark R Givens and Rae Michael Shortt. A class of wasserstein metrics for probability distribu- tions.Michigan Mathematical Journal, 31(2):231–240, 1984

  38. [38]

    Principal component analysis: A natural approach to data exploration.ACM Computing Surveys (CSUR), 54(4):1–34, 2021

    Felipe L Gewers, Gustavo R Ferreira, Henrique F De Arruda, Filipi N Silva, Cesar H Comin, Diego R Amancio, and Luciano da F Costa. Principal component analysis: A natural approach to data exploration.ACM Computing Surveys (CSUR), 54(4):1–34, 2021

  39. [39]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

  40. [40]

    The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55, 1983

    Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55, 1983

  41. [41]

    Adrian Ybañez, Rosein Ancheta, Samantha Shane Evangelista, Joerabell Lourdes Aro, Fatima Maturan, Nadine May Atibing, Egberto Selerio, Kafferine Yamagishi, and Lanndon Ocampo. How can we use machine learning for characterizing organizational identification-a study using clustering with picture fuzzy datasets.International Journal of Information Management...

  42. [42]

    Comparison of tabular synthetic data generation techniques using propensity and cluster log metric.International Journal of Information Management Data Insights, 3(2): 100177, 2023

    Aryan Pathare, Ramchandra Mangrulkar, Kartik Suvarna, Aryan Parekh, Govind Thakur, and Aruna Gawade. Comparison of tabular synthetic data generation techniques using propensity and cluster log metric.International Journal of Information Management Data Insights, 3(2): 100177, 2023

  43. [43]

    Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy.Computational Intelligence, 37(2):819–851, 2021

    Zhenchen Wang, Puja Myles, and Allan Tucker. Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy.Computational Intelligence, 37(2):819–851, 2021

  44. [44]

    A multi-dimensional evaluation of synthetic data generators.IEEE Access, 10:11147–11158, 2022

    Fida K Dankar, Mahmoud K Ibrahim, and Leila Ismail. A multi-dimensional evaluation of synthetic data generators.IEEE Access, 10:11147–11158, 2022

  45. [45]

    Synthetic tab- ular data evaluation in the health domain covering resemblance, utility, and privacy dimensions

    Mikel Hernadez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Synthetic tab- ular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of information in medicine, 62(S 01):e19–e38, 2023

  46. [46]

    How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models

    Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela Van Der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International conference on machine learning, pages 290–306. PMLR, 2022

  47. [47]

    Evaluating identity disclosure risk in fully synthetic health data: model development and validation.Journal of medical Internet research, 22(11):e23139, 2020

    Khaled El Emam, Lucy Mosquera, and Jason Bass. Evaluating identity disclosure risk in fully synthetic health data: model development and validation.Journal of medical Internet research, 22(11):e23139, 2020

  48. [48]

    Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

    Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm, et al. Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

  49. [49]

    Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

    Nicholas I-Hsien Kuo, Sebastiano Barbieri, Clare Arnott, Blanca Gallego, Ziba Gandomkar, Shahana Ferdousi, Kirsty Douglas, Mark Woodward, and Louisa Jorm. Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

  50. [50]

    Ruth Walker and Janet E Hiller. The index of relative socio-economic disadvantage: general population views on indicators used to determine area-based disadvantage.Australian and New Zealand journal of public health, 29(5):442–447, 2005. 12

  51. [51]

    PRIME-CVD Data Asset 1: DAG-Simulated Cardiovascular Risk Cohort for Medical Informatics Education

    Nicholas I-Hsien Kuo. PRIME-CVD Data Asset 1: DAG-Simulated Cardiovascular Risk Cohort for Medical Informatics Education. https://figshare.com/articles/dataset/ PRIME-CVD_Data_Asset_1_DAG-Simulated_Cardiovascular_Risk_Cohort_for_ Medical_Informatics_Education/31395765, February 2026

  52. [52]

    Nicholas I-Hsien Kuo, Federico Garcia, Anders Sönnerborg, Michael Böhm, Rolf Kaiser, Maurizio Zazzi, Mark Polizzotto, Louisa Jorm, Sebastiano Barbieri, et al. Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for hiv.Journal of Biomedical Informatics, 1...

  53. [53]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  54. [54]

    Synthetic health-related longitudinal data with mixed-type variables generated using diffusion models

    Nicholas I-Hsien Kuo, Federico Garcia, Anders Sonnerborg, Michael Bohm, Rolf Kaiser, Maurizio Zazzi, Louisa Jorm, and Sebastiano Barbieri. Synthetic health-related longitudinal data with mixed-type variables generated using diffusion models. InNeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023

  55. [55]

    Nicholas I-Hsien Kuo, Blanca Gallego, and Louisa Jorm. Attention-based synthetic data generation for calibration-enhanced survival analysis: a case study for chronic kidney disease using electronic health records.Journal of Biomedical Informatics, page 104928, 2025

  56. [56]

    Centrum voor Wiskunde en Informatica Amsterdam, 1995

    Guido Van Rossum, Fred L Drake, et al.Python reference manual, volume 111. Centrum voor Wiskunde en Informatica Amsterdam, 1995

  57. [57]

    identifying variables that independently predict

    John B Carlin. “identifying variables that independently predict. . . ” is not a well-defined research task.Journal of Clinical Epidemiology, 189, 2026

  58. [58]

    The distinction between causal, predictive, and descriptive research–there is still room for improvement.Journal of Clinical Epidemiology, page 111960, 2025

    Brett P Dyer. The distinction between causal, predictive, and descriptive research–there is still room for improvement.Journal of Clinical Epidemiology, page 111960, 2025

  59. [59]

    John Wiley & Sons, 2002

    John D Kalbfleisch and Ross L Prentice.The statistical analysis of failure time data. John Wiley & Sons, 2002

  60. [60]

    Regression models and life-tables.Journal of the royal statistical society: Series B (methodological), 34(2):187–202, 1972

    David R Cox. Regression models and life-tables.Journal of the royal statistical society: Series B (methodological), 34(2):187–202, 1972

  61. [61]

    Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

    Ben Van Calster, David J McLernon, Maarten Van Smeden, Laure Wynants, and Ewout W Steyerberg. Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

  62. [62]

    Learning equivalence classes of bayesian-network structures

    David Maxwell Chickering. Learning equivalence classes of bayesian-network structures. Journal of machine learning research, 2(Feb):445–498, 2002

  63. [63]

    Structural Intervention Distance (SID) for Evaluating Causal Graphs

    Jonas Peters and Peter Bühlmann. Structural intervention distance (sid) for evaluating causal graphs.arXiv preprint arXiv:1306.1043, 2013

  64. [64]

    Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

  65. [65]

    Differential privacy

    Cynthia Dwork. Differential privacy. InEncyclopedia of Cryptography, Security and Privacy, pages 649–652. Springer, 2025

  66. [66]

    Array programming with numpy.nature, 585(7825):357–362, 2020

    Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy.nature, 585(7825):357–362, 2020

  67. [67]

    Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

    Wes McKinney et al. Data structures for statistical computing in python.scipy, 445(1):51–56, 2010. 13

  68. [68]

    Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Courna- peau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

  69. [69]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  70. [70]

    Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

    John D Hunter. Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

  71. [71]

    Seaborn: statistical data visualization.Journal of open source software, 6 (60):3021, 2021

    Michael L Waskom. Seaborn: statistical data visualization.Journal of open source software, 6 (60):3021, 2021

  72. [72]

    lifelines: survival analysis in python.Journal of Open Source Software, 4(40):1317, 2019

    Cameron Davidson-Pilon. lifelines: survival analysis in python.Journal of Open Source Software, 4(40):1317, 2019

  73. [73]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  74. [74]

    Causal-learn: Causal discovery in python.Journal of Machine Learning Research, 25(60):1–8, 2024

    Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, and Kun Zhang. Causal-learn: Causal discovery in python.Journal of Machine Learning Research, 25(60):1–8, 2024

  75. [75]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 14 Appendix: Additional Details to the Main Text Purpose of this Appendix.This appendix provides detailed methodological, implementation, and supplementary result information supporting the main text. It ensures reproducibility, clarifies evalu...

  76. [76]

    an auxiliary autoencoder update on embedded real samples,

  77. [77]

    multiple critic updates using the WGAN-GP objective,

  78. [78]

    one generator update using the adversarial and correlation-alignment losses. Formally, the model optimises min G,A max D Exreal[D(xreal)]−E z[D(G(z))] +λ GP ·GP +λ corr · Lcorr +L A, with the understanding that LA is applied only to the auxiliary autoencoder parameters, while the adversarial objectives govern the generator and critic updates. In the imple...

  79. [79]

    samplex 0 from real data,

  80. [80]

    sample timestept∼ U {1, . . . , T},

Showing first 80 references.