TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

Dimitrios I. Fotiadis; Eleni Georga; Kostas Marias; Manolis Tsiknakis; Nikolaos S. Tachos; Vasileios C. Pezoulas

arxiv: 2606.31268 · v1 · pith:GNMYELLSnew · submitted 2026-06-30 · 💻 cs.LG · cs.AI

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

Vasileios C. Pezoulas , Nikolaos S. Tachos , Eleni Georga , Kostas Marias , Manolis Tsiknakis , Dimitrios I. Fotiadis This is my paper

Pith reviewed 2026-07-01 06:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords synthetic data generationtabular dataBayesian mixture modelsadaptive algorithmsgenerative modelingprivacy-preserving datatoolkit

0 comments

The pith

TDGT introduces ABMS, an algorithm that automatically selects the optimal number of mixture components for synthetic tabular data generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TDGT as a web-based toolkit for creating synthetic tabular data and assessing its quality. At its core is the Adaptive Bayesian Mixture Synthesizer, which iteratively optimizes cluster quality to decide how many mixture components to use. This design targets the removal of manual hyperparameter choices. The toolkit adds a VAE-based hybrid for nonlinear patterns, CUDA acceleration for scale, and eleven fidelity metrics plus privacy checks. Tests on healthcare, socioeconomic, and cybersecurity datasets show consistent results across feature types.

Core claim

ABMS autonomously determines the optimal number of mixture components through iterative cluster quality optimization, eliminating the need for manual hyperparameter configuration in Bayesian mixture models for tabular synthesis.

What carries the argument

The Adaptive Bayesian Mixture Synthesizer (ABMS), which performs iterative cluster quality optimization to select mixture components without user input.

If this is right

Synthetic data generation becomes possible without users setting the number of mixture components in advance.
A hybrid VAE-ABMS model extends generation to complex nonlinear distributions in tabular data.
GPU acceleration enables the method to handle large-scale datasets while retaining the adaptive selection.
Eleven statistical metrics plus privacy indicators provide a standardized way to verify output quality across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The web interface and streaming visualizations could lower the barrier for non-experts to produce usable synthetic data.
If the optimization proves stable, similar adaptive component selection might apply to other mixture-based generative tasks.
Consistent performance on three distinct domains suggests the method may transfer to additional fields like finance or biology with minimal changes.

Load-bearing premise

Iterative cluster quality optimization will reliably identify the correct number of components for data with mixed feature types and different scales.

What would settle it

Apply ABMS to datasets constructed with a known ground-truth number of mixture components and verify whether the algorithm recovers that exact count.

read the original abstract

The growing demand for privacy-preserving data sharing has positioned synthetic data generation as a critical component of responsible AI workflows. Despite notable advances in generative modeling, existing solutions often lack integration of adaptive generation strategies, multi-metric evaluation, and accessible end-to-end generators within a unified web-based toolkit. In this work, we introduce TDGT (Tabular Data Generation Toolkit), a web-based toolkit for synthetic tabular data generation and fidelity assessment. TDGT introduces the Adaptive Bayesian Mixture Synthesizer (ABMS), a novel algorithm that autonomously determines the optimal number of mixture components through iterative cluster quality optimization, eliminating the need for manual hyperparameter configuration. Building upon ABMS, we further propose VAE-ABMS, a hybrid architecture that couples Variational Autoencoder-based latent space learning with adaptive Bayesian mixture synthesis, enabling high-fidelity generation of complex, nonlinear tabular distributions. For large-scale scenarios, TDGT provides a GPU-accelerated variant of ABMS leveraging CUDA-based k-means clustering and Gaussian mixture fitting. Synthetic data fidelity is assessed through eleven statistical fidelity metrics spanning distributional divergence, structural correlation, and sample-level similarity, complemented by privacy risk indicators including k-anonymity scoring and disclosure rate estimation. The web-based toolkit supports a real-time streaming interface with interactive Plotly-based visualizations. TDGT is assessed across datasets from healthcare, socioeconomic modeling, and cybersecurity domains, demonstrating consistent generation fidelity and statistical coherence across heterogeneous feature types and data scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TDGT is a packaging of Bayesian mixtures and VAEs into a web toolkit with metrics, but the ABMS autonomy claim stays underspecified with no optimization details or mixed-data handling.

read the letter

The main takeaway is that this paper describes TDGT, a web-based toolkit that combines an adaptive Bayesian mixture synthesizer (ABMS), a VAE-ABMS hybrid, CUDA acceleration for scale, and eleven fidelity plus privacy metrics. It tests the setup on healthcare, socioeconomic, and cybersecurity data.

The integration into one accessible interface with Plotly visuals and real-time streaming is the practical part that stands out. Pulling together generation, evaluation, and privacy checks in one place can save time for users who need synthetic tabular data without assembling the pieces themselves. The GPU variant for larger datasets is a straightforward engineering choice that fits the use case.

The soft spot is the treatment of ABMS. The text states that it autonomously selects the number of mixture components via iterative cluster quality optimization and removes manual tuning, yet supplies no metric, loop description, convergence rule, or explanation for handling categorical and continuous features together. Without those specifics it is impossible to tell whether this differs from standard model-selection heuristics or prior adaptive GMM work. The abstract also asserts consistent fidelity across domains but shows no numbers, baselines, or error bars.

This work is aimed at practitioners building responsible AI pipelines who want a ready tool rather than readers seeking new theory or derivations. The citation pattern is not visible here, but the absence of comparisons to existing adaptive mixture literature is noticeable.

I would send it for peer review if the full manuscript adds the missing algorithmic steps and quantitative results; as presented the central claim cannot be assessed.

Referee Report

3 major / 2 minor

Summary. The paper presents TDGT, a web-based toolkit for synthetic tabular data generation and fidelity assessment. It introduces the Adaptive Bayesian Mixture Synthesizer (ABMS) as a novel algorithm that autonomously selects the optimal number of mixture components via iterative cluster quality optimization, along with a VAE-ABMS hybrid for latent-space modeling, a GPU-accelerated ABMS variant, eleven statistical fidelity metrics, and privacy indicators. The toolkit is evaluated on healthcare, socioeconomic, and cybersecurity datasets, claiming consistent fidelity across heterogeneous features and scales.

Significance. If the ABMS autonomy mechanism and fidelity results hold with proper validation, the work could offer a practical integrated toolkit for privacy-preserving synthetic data generation, combining adaptive mixture modeling with web accessibility and multi-metric evaluation in a way that addresses gaps in existing tools.

major comments (3)

[ABMS description] ABMS algorithm description: The central claim that ABMS 'autonomously determines the optimal number of mixture components through iterative cluster quality optimization' lacks any equation, pseudocode, definition of the quality objective (e.g., BIC, silhouette score, or custom metric), convergence criterion, or explicit handling of mixed categorical/continuous features. This is load-bearing for the novelty and autonomy assertions, as the skeptic correctly notes that without these details it is impossible to distinguish the method from standard model-selection heuristics.
[Experimental evaluation] Experimental results section: The abstract states that TDGT 'demonstrat[es] consistent generation fidelity and statistical coherence' across domains, yet no quantitative results, tables, error bars, baseline comparisons, or dataset-specific metrics are referenced. This undermines assessment of the fidelity claims and the cross-domain consistency assertion.
[Title and abstract] Title vs. abstract: The title explicitly includes 'diffusion-based models,' but the abstract and described contributions focus solely on ABMS, VAE-ABMS, and GPU-accelerated Bayesian mixtures with no mention of diffusion models or how they are supported in the toolkit.

minor comments (2)

[Evaluation metrics] The description of the eleven fidelity metrics and privacy indicators would benefit from explicit formulas or references to standard implementations (e.g., for distributional divergence measures) to aid reproducibility.
[Toolkit implementation] The web-based interface and real-time streaming features are mentioned but lack details on implementation (e.g., backend framework or data flow), which would improve clarity for toolkit users.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the manuscript. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [ABMS description] ABMS algorithm description: The central claim that ABMS 'autonomously determines the optimal number of mixture components through iterative cluster quality optimization' lacks any equation, pseudocode, definition of the quality objective (e.g., BIC, silhouette score, or custom metric), convergence criterion, or explicit handling of mixed categorical/continuous features. This is load-bearing for the novelty and autonomy assertions, as the skeptic correctly notes that without these details it is impossible to distinguish the method from standard model-selection heuristics.

Authors: We agree the current description is insufficiently detailed. The revised manuscript will include the explicit quality objective (a composite of BIC and silhouette score), pseudocode for the iterative optimization loop, the convergence criterion (delta in component count and quality below threshold), and the mixed-type handling via separate continuous Gaussian and categorical multinomial components with Gower distance for clustering initialization. revision: yes
Referee: [Experimental evaluation] Experimental results section: The abstract states that TDGT 'demonstrat[es] consistent generation fidelity and statistical coherence' across domains, yet no quantitative results, tables, error bars, baseline comparisons, or dataset-specific metrics are referenced. This undermines assessment of the fidelity claims and the cross-domain consistency assertion.

Authors: The experimental section reports results across the three domains using the eleven metrics, but we acknowledge the absence of consolidated tables, error bars, and direct baseline comparisons. The revision will add a results table with per-dataset metric values (including standard deviations), plus comparisons to standard GMM and CTGAN to substantiate the consistency claims. revision: yes
Referee: [Title and abstract] Title vs. abstract: The title explicitly includes 'diffusion-based models,' but the abstract and described contributions focus solely on ABMS, VAE-ABMS, and GPU-accelerated Bayesian mixtures with no mention of diffusion models or how they are supported in the toolkit.

Authors: The toolkit architecture includes a diffusion-based generator module, but the abstract prioritizes the novel ABMS contributions. We will revise the abstract to briefly note support for diffusion models and their integration within the unified interface. revision: yes

Circularity Check

0 steps flagged

No circularity: ABMS described as independent algorithmic contribution

full rationale

The paper presents TDGT and ABMS as a new toolkit and algorithm whose core claim is the existence of an iterative cluster quality optimization procedure that selects mixture components without manual tuning. No equations, fitted parameters, or self-citations appear in the provided text that would reduce this claim to a tautology or to quantities defined by the same model. The description of VAE-ABMS, GPU acceleration, and the eleven fidelity metrics likewise stand as external specifications rather than self-referential derivations. Because the central claims concern the introduction of a procedure whose internal mechanics are asserted to be novel and are not shown to collapse into their own inputs, the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Claims rest on standard assumptions of Bayesian mixture models and variational autoencoders plus the unverified effectiveness of the iterative cluster quality optimization; no explicit free parameters are named in the abstract.

axioms (1)

domain assumption Iterative cluster quality optimization can autonomously select the optimal number of mixture components without manual tuning or overfitting
This is the core premise invoked for ABMS in the abstract.

invented entities (2)

ABMS no independent evidence
purpose: Adaptive Bayesian mixture synthesis that eliminates manual hyperparameter configuration
Newly introduced algorithm in this work.
VAE-ABMS no independent evidence
purpose: Hybrid architecture coupling VAE latent space learning with adaptive Bayesian mixture synthesis
New hybrid proposed in this work.

pith-pipeline@v0.9.1-grok · 5832 in / 1434 out tokens · 35437 ms · 2026-07-01T06:28:48.613491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Benchmarking 3.1. Benchmark datasets Three benchmark datasets across the healthcare, financial, and cybersecurity domains were selected to evaluate the TDGT generators across structurally diverse data characteristics, sample sizes, and feature- type compositions. The datasets were chosen to represent a progression from a small, high-dimensional continuous...

1999
[2]

Web-based application technology stack 4.1. Backend (Flask, SSE-based real-time streaming, threading model) TDGT is implemented as a lightweight web -based toolkit built on the Flask microframework [22], exposing three primary HTTP endpoints: (i ) a data ingestion and job submission endpoint, (ii) a real - 25 time progress streaming endpoint, and (iii) a ...
[3]

Discussion The experimental results presented in Section 3 reveal a consistent and structured pattern of trade -offs between the six evaluated generators across the three benchmark domains. In this section we interpret these results in terms of three interconnected themes: the relative strengths and limitations of ABMS and its GPU -accelerated variant com...
[4]

Summary of contributions In this paper we introduced TDGT, a web-based toolkit for tabular data generation and multi-metric statistical evaluation

Conclusion and Future Work 6.1. Summary of contributions In this paper we introduced TDGT, a web-based toolkit for tabular data generation and multi-metric statistical evaluation. TDGT addresses a practical gap in the synthetic data ecosystem by unifying a portfolio of six generation methods (parametric mixture models, hybrid latent-space architectures, a...
[5]

A survey on tabular data: from tree -based methods to tabular deep learning

Somvanshi, Shriyank, et al. "A survey on tabular data: from tree -based methods to tabular deep learning." ACM Computing Surveys (2026)

2026
[6]

Comprehensive review of privacy, utility, and fairness offered by synthetic data

Kiran, A., P. Rubini, and S. Saravana Kumar. "Comprehensive review of privacy, utility, and fairness offered by synthetic data." IEEE Access 13 (2025): 15795-15811

2025
[7]

Synthetic data generation methods in healthcare: A review on open - source tools and methods

Pezoulas, Vasileios C., et al. "Synthetic data generation methods in healthcare: A review on open - source tools and methods." Computational and structural biotechnology journal 23 (2024): 2892 - 2910

2024
[8]

Anonymization: The imperfect science of using data while preserving privacy

Gadotti, Andrea, et al. "Anonymization: The imperfect science of using data while preserving privacy." Science advances 10.29 (2024): eadn7053

2024
[9]

AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations

Sangaraju, Varun Varma. "AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations." International Journal of Emerging Trends in Computer Science and Information Technology (2025): 67-74

2025
[10]

Data lineage and metadata in payment ecosystems: Auditability and regulatory readiness across the life cycle

Vallemoni, Ravi Kumar. "Data lineage and metadata in payment ecosystems: Auditability and regulatory readiness across the life cycle." Frontiers in Computer Science and Artificial Intelligence 2.1 (2023): 46-58

2023
[11]

Adversarial challenges in network intrusion detection systems: Research insights and future prospects

Ennaji, Sabrine, et al. "Adversarial challenges in network intrusion detection systems: Research insights and future prospects." IEEE Access (2025)

2025
[12]

A survey on tabular data generation: Utility, alignment, fidelity, privacy, and beyond

Stoian, Mihaela Catalina, Eleonora Giunchiglia, and Thomas Lukasiewicz. "A survey on tabular data generation: Utility, alignment, fidelity, privacy, and beyond." arXiv preprint arXiv:2503.05954 (2025)

work page arXiv 2025
[13]

Modeling tabular data using conditional gan

Xu, Lei, et al. "Modeling tabular data using conditional gan." Advances in neural information processing systems 32 (2019)

2019
[14]

The synthetic data vault

Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni. "The synthetic data vault." 2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016

2016
[15]

Why do tree -based models still outperform deep learning on typical tabular data?

Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree -based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520

2022
[16]

Variational inference for Dirichlet process mixtures

Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." (2006): 121-143

2006
[17]

Pomegranate: fast and flexible probabilistic modeling in python

Schreiber, Jacob. "Pomegranate: fast and flexible probabilistic modeling in python." Journal of Machine Learning Research 18.164 (2018): 1-6. 33

2018
[18]

Wasserstein generative adversarial networks

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International conference on machine learning. Pmlr, 2017

2017
[19]

Improved training of wasserstein gans

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in neural information processing systems 30 (2017)

2017
[20]

Layer Normalization

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Auto-Encoding Variational Bayes

Kingma, Diederik P., and Max Welling. "Auto -encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[23]

Denoising diffusion probabilistic models

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851

2020
[24]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Decoupled Weight Decay Regularization

Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

O'Reilly Media, Inc

Grinberg, Miguel. Flask web development. " O'Reilly Media, Inc.", 2018

2018
[27]

https://www.w3.org/news/2015/server-sent-events-is-a-w3c-recommendation/

2015
[28]

Breast cancer diagnosis and prognosis via linear programming

Mangasarian, Olvi L., W. Nick Street, and William H. Wolberg. "Breast cancer diagnosis and prognosis via linear programming." Operations research 43.4 (1995): 570-577

1995
[29]

A data-driven approach to predict the success of bank telemarketing

Moro, Sérgio, Paulo Cortez, and Paulo Rita. "A data-driven approach to predict the success of bank telemarketing." Decision Support Systems 62 (2014): 22-31

2014
[30]

A detailed analysis of the KDD CUP 99 data set

Tavallaee, Mahbod, et al. "A detailed analysis of the KDD CUP 99 data set." 2009 IEEE symposium on computational intelligence for security and defense applications. Ieee, 2009

2009
[31]

Conditional Generative Adversarial Nets

Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Classifier-Free Diffusion Guidance

Ho, Jonathan, and Tim Salimans. "Classifier -free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

C., Tachos, N

Pezoulas, V. C., Tachos, N. S., Georga, E., Marias, K., Tsiknakis, M., & Fotiadis, D. I. (2025). Synthetic Data Blueprint (SDB): A modular framework for the statistical, structural, and graph - based evaluation of synthetic tabular data. arXiv preprint arXiv:2512.19718

work page arXiv 2025
[34]

Assessing privacy and quality of synthetic health data

Yale, Andrew, et al. "Assessing privacy and quality of synthetic health data." Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse. 2019

2019
[35]

Membership inference attacks against machine learning models

Shokri, Reza, et al. "Membership inference attacks against machine learning models." 2017 IEEE symposium on security and privacy (SP). IEEE, 2017

2017
[36]

Real (grey) vs. Synthetic (blue) — hover for values, scroll to zoom

Supplementary Material 8.1. Generator hyperparameters Supplementary Tables 1 –6 provide the complete hyperparameter specifications for each of the six generators evaluated in this work. Supplementary Table 1. Hyperparameters for ABMS. ABMS Value Cluster method MiniSom Quality metric Davies-Bouldin index Max clusters 𝐾 20 34 Covariance Diagonal Regularisat...

[1] [1]

Benchmarking 3.1. Benchmark datasets Three benchmark datasets across the healthcare, financial, and cybersecurity domains were selected to evaluate the TDGT generators across structurally diverse data characteristics, sample sizes, and feature- type compositions. The datasets were chosen to represent a progression from a small, high-dimensional continuous...

1999

[2] [2]

Web-based application technology stack 4.1. Backend (Flask, SSE-based real-time streaming, threading model) TDGT is implemented as a lightweight web -based toolkit built on the Flask microframework [22], exposing three primary HTTP endpoints: (i ) a data ingestion and job submission endpoint, (ii) a real - 25 time progress streaming endpoint, and (iii) a ...

[3] [3]

Discussion The experimental results presented in Section 3 reveal a consistent and structured pattern of trade -offs between the six evaluated generators across the three benchmark domains. In this section we interpret these results in terms of three interconnected themes: the relative strengths and limitations of ABMS and its GPU -accelerated variant com...

[4] [4]

Summary of contributions In this paper we introduced TDGT, a web-based toolkit for tabular data generation and multi-metric statistical evaluation

Conclusion and Future Work 6.1. Summary of contributions In this paper we introduced TDGT, a web-based toolkit for tabular data generation and multi-metric statistical evaluation. TDGT addresses a practical gap in the synthetic data ecosystem by unifying a portfolio of six generation methods (parametric mixture models, hybrid latent-space architectures, a...

[5] [5]

A survey on tabular data: from tree -based methods to tabular deep learning

Somvanshi, Shriyank, et al. "A survey on tabular data: from tree -based methods to tabular deep learning." ACM Computing Surveys (2026)

2026

[6] [6]

Comprehensive review of privacy, utility, and fairness offered by synthetic data

Kiran, A., P. Rubini, and S. Saravana Kumar. "Comprehensive review of privacy, utility, and fairness offered by synthetic data." IEEE Access 13 (2025): 15795-15811

2025

[7] [7]

Synthetic data generation methods in healthcare: A review on open - source tools and methods

Pezoulas, Vasileios C., et al. "Synthetic data generation methods in healthcare: A review on open - source tools and methods." Computational and structural biotechnology journal 23 (2024): 2892 - 2910

2024

[8] [8]

Anonymization: The imperfect science of using data while preserving privacy

Gadotti, Andrea, et al. "Anonymization: The imperfect science of using data while preserving privacy." Science advances 10.29 (2024): eadn7053

2024

[9] [9]

AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations

Sangaraju, Varun Varma. "AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations." International Journal of Emerging Trends in Computer Science and Information Technology (2025): 67-74

2025

[10] [10]

Data lineage and metadata in payment ecosystems: Auditability and regulatory readiness across the life cycle

Vallemoni, Ravi Kumar. "Data lineage and metadata in payment ecosystems: Auditability and regulatory readiness across the life cycle." Frontiers in Computer Science and Artificial Intelligence 2.1 (2023): 46-58

2023

[11] [11]

Adversarial challenges in network intrusion detection systems: Research insights and future prospects

Ennaji, Sabrine, et al. "Adversarial challenges in network intrusion detection systems: Research insights and future prospects." IEEE Access (2025)

2025

[12] [12]

A survey on tabular data generation: Utility, alignment, fidelity, privacy, and beyond

Stoian, Mihaela Catalina, Eleonora Giunchiglia, and Thomas Lukasiewicz. "A survey on tabular data generation: Utility, alignment, fidelity, privacy, and beyond." arXiv preprint arXiv:2503.05954 (2025)

work page arXiv 2025

[13] [13]

Modeling tabular data using conditional gan

Xu, Lei, et al. "Modeling tabular data using conditional gan." Advances in neural information processing systems 32 (2019)

2019

[14] [14]

The synthetic data vault

Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni. "The synthetic data vault." 2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016

2016

[15] [15]

Why do tree -based models still outperform deep learning on typical tabular data?

Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree -based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520

2022

[16] [16]

Variational inference for Dirichlet process mixtures

Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." (2006): 121-143

2006

[17] [17]

Pomegranate: fast and flexible probabilistic modeling in python

Schreiber, Jacob. "Pomegranate: fast and flexible probabilistic modeling in python." Journal of Machine Learning Research 18.164 (2018): 1-6. 33

2018

[18] [18]

Wasserstein generative adversarial networks

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International conference on machine learning. Pmlr, 2017

2017

[19] [19]

Improved training of wasserstein gans

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in neural information processing systems 30 (2017)

2017

[20] [20]

Layer Normalization

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Auto-Encoding Variational Bayes

Kingma, Diederik P., and Max Welling. "Auto -encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[23] [23]

Denoising diffusion probabilistic models

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851

2020

[24] [24]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

Decoupled Weight Decay Regularization

Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

O'Reilly Media, Inc

Grinberg, Miguel. Flask web development. " O'Reilly Media, Inc.", 2018

2018

[27] [27]

https://www.w3.org/news/2015/server-sent-events-is-a-w3c-recommendation/

2015

[28] [28]

Breast cancer diagnosis and prognosis via linear programming

Mangasarian, Olvi L., W. Nick Street, and William H. Wolberg. "Breast cancer diagnosis and prognosis via linear programming." Operations research 43.4 (1995): 570-577

1995

[29] [29]

A data-driven approach to predict the success of bank telemarketing

Moro, Sérgio, Paulo Cortez, and Paulo Rita. "A data-driven approach to predict the success of bank telemarketing." Decision Support Systems 62 (2014): 22-31

2014

[30] [30]

A detailed analysis of the KDD CUP 99 data set

Tavallaee, Mahbod, et al. "A detailed analysis of the KDD CUP 99 data set." 2009 IEEE symposium on computational intelligence for security and defense applications. Ieee, 2009

2009

[31] [31]

Conditional Generative Adversarial Nets

Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Classifier-Free Diffusion Guidance

Ho, Jonathan, and Tim Salimans. "Classifier -free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

C., Tachos, N

Pezoulas, V. C., Tachos, N. S., Georga, E., Marias, K., Tsiknakis, M., & Fotiadis, D. I. (2025). Synthetic Data Blueprint (SDB): A modular framework for the statistical, structural, and graph - based evaluation of synthetic tabular data. arXiv preprint arXiv:2512.19718

work page arXiv 2025

[34] [34]

Assessing privacy and quality of synthetic health data

Yale, Andrew, et al. "Assessing privacy and quality of synthetic health data." Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse. 2019

2019

[35] [35]

Membership inference attacks against machine learning models

Shokri, Reza, et al. "Membership inference attacks against machine learning models." 2017 IEEE symposium on security and privacy (SP). IEEE, 2017

2017

[36] [36]

Real (grey) vs. Synthetic (blue) — hover for values, scroll to zoom

Supplementary Material 8.1. Generator hyperparameters Supplementary Tables 1 –6 provide the complete hyperparameter specifications for each of the six generators evaluated in this work. Supplementary Table 1. Hyperparameters for ABMS. ABMS Value Cluster method MiniSom Quality metric Davies-Bouldin index Max clusters 𝐾 20 34 Covariance Diagonal Regularisat...