Diffusion and Flow Matching Models for Tabular Data: A Survey

Jiayang Shi; Lincen Yang; Matthijs van Leeuwen; Niki van Stein; Qi Huang; Thomas B\"ack; Zhao Yang; Zhong Li

arxiv: 2502.17119 · v2 · pith:RXK2QDN2new · submitted 2025-02-24 · 💻 cs.LG · cs.AI

Diffusion and Flow Matching Models for Tabular Data: A Survey

Zhong Li , Qi Huang , Lincen Yang , Jiayang Shi , Zhao Yang , Niki van Stein , Thomas B\"ack , Matthijs van Leeuwen This is my paper

Pith reviewed 2026-05-25 07:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion modelsflow matchingtabular datagenerative modelssurveydata synthesisimputationanomaly detection

0 comments

The pith

This is the first survey dedicated to diffusion and flow matching models for tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular data generation faces persistent difficulties from mixed numerical and categorical features, missing values, imbalances, and domain constraints that earlier GAN and VAE approaches often handle unstably. Diffusion models address this through iterative noising and denoising, while flow matching learns direct transport fields, both offering more stable training for tasks like synthesis, imputation, and anomaly detection. The paper collects and organizes the scattered literature on these methods, identifies why direct comparisons remain elusive, and flags open issues in scalability, privacy, and constraint handling. A reader would care because tabular records dominate real-world datasets where reliable generative tools could improve data sharing and augmentation.

Core claim

To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation.

What carries the argument

The survey's four-way organizational structure around data-engineering challenges, tasks, design choices, and evaluation dimensions.

If this is right

Researchers can use the organization to locate methods for specific tabular tasks such as synthesis or imputation.
Future work must address the documented gaps in scalability and constraint-aware generation.
Standardized benchmarks would reduce the current fragmentation in evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A shared evaluation protocol across tasks could accelerate progress by making incremental improvements visible.
Constraint-aware variants may prove essential for regulated domains where synthetic data must obey hard rules.
Privacy and fairness analyses could be integrated into the generative process rather than applied after the fact.

Load-bearing premise

The literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions.

What would settle it

Discovery of any earlier survey whose scope is limited to diffusion and flow matching models applied to tabular data.

Figures

Figures reproduced from arXiv: 2502.17119 by Jiayang Shi, Lincen Yang, Matthijs van Leeuwen, Niki van Stein, Qi Huang, Thomas B\"ack, Zhao Yang, Zhong Li.

**Figure 1.** Figure 1: Timeline of Generative Models for Tabular Data: Below the timeline, key advancements in traditional machine learning models and deep generative [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Taxonomy of Diffusion Models for Tabular Data. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first survey on diffusion and flow matching for tabular data and it organizes the scattered literature around practical challenges.

read the letter

This survey claims to be the first dedicated to diffusion and flow matching models for tabular data, and the abstract supports that positioning. It covers the move from earlier GAN and VAE approaches, which struggled with training stability, mode collapse, and mixed numerical-categorical features, toward diffusion's noising-denoising process and flow matching's transport fields as more stable alternatives for tasks like synthesis, imputation, anomaly detection, and constrained generation.

Referee Report

0 major / 2 minor

Summary. The manuscript is a survey of diffusion and flow matching models for tabular data, claiming to be the first dedicated review of the topic. It reviews literature from June 2015 to May 2026, organizes existing work around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discusses open problems including scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. The authors state that they maintain updates in a GitHub repository.

Significance. If the coverage is comprehensive and free of selection bias, the survey would be significant for organizing an emerging, heterogeneous literature on generative models for structured data. The explicit maintenance of a GitHub repository for updates strengthens the work by providing a mechanism for ongoing relevance and community contribution.

minor comments (2)

[Abstract] The review period is stated as extending to May 2026. The authors should clarify whether this is a projected cutoff, a typographical error, or the intended scope, as the current date of the manuscript appears to precede this endpoint.
[Abstract] The abstract refers to a GitHub repository for updates but does not provide the URL. Including the repository link in the manuscript (and ideally in the abstract) would improve accessibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The assessment correctly identifies the survey's scope, organization around data-engineering challenges and tasks, coverage of open problems, and the value of the maintained GitHub repository. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in survey paper

full rationale

This manuscript is explicitly a literature survey with no derivations, equations, predictions, or technical claims whose validity depends on internal self-reference. The sole novel assertion (being the first dedicated survey) is a factual statement about external literature coverage rather than a result derived from the paper's own inputs. No self-citation chains, fitted parameters renamed as predictions, or ansatzes are present. The work is therefore self-contained against external benchmarks with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5833 in / 994 out tokens · 26905 ms · 2026-05-25T07:58:12.718890+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
cs.AI 2026-04 unverdicted novelty 6.0

TagCC anchors statistical tabular representations to LLM-derived textual semantic concepts via contrastive learning jointly optimized with a clustering objective, outperforming prior methods on benchmarks.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Data mining in healthcare and biomedicine: a survey of the literature,

I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.- F. Chang, and L. Hua, “Data mining in healthcare and biomedicine: a survey of the literature,” Journal of medical systems , vol. 36, pp. 2431–2448, 2012

work page 2012
[2]

M. F. Dixon, I. Halperin, and P. Bilokon, Machine learning in finance. Springer, 2020, vol. 1170

work page 2020
[3]

Data mining in education,

A. Algarni, “Data mining in education,” International Journal of Advanced Computer Science and Applications , vol. 7, no. 6, pp. 456– 461, 2016

work page 2016
[4]

An extensive review on data mining methods and clustering models for intelligent transportation system,

S. Anand, P. Padmanabham, A. Govardhan, and R. H. Kulkarni, “An extensive review on data mining methods and clustering models for intelligent transportation system,” Journal of Intelligent Systems , vol. 27, no. 2, pp. 263–273, 2018

work page 2018
[5]

Data mining in psychological treatment research: a primer on classification and regression trees

M. W. King and P. A. Resick, “Data mining in psychological treatment research: a primer on classification and regression trees.” Journal of consulting and clinical psychology , vol. 82, no. 5, p. 895, 2014

work page 2014
[6]

General data protection regulation,

G. GDPR, “General data protection regulation,” Regulation (EU), vol. 679, 2016

work page 2016
[7]

California consumer privacy act of 2018 (ccpa),

C. S. Legislature, “California consumer privacy act of 2018 (ccpa),” 2018, accessed: 2024-12-27. [Online]. Available: https: //oag.ca.gov/privacy/ccpa

work page 2018
[8]

Tabd- dpm: Modelling tabular data with diffusion models,

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabd- dpm: Modelling tabular data with diffusion models,” in International Conference on Machine Learning . PMLR, 2023, pp. 17 564–17 579

work page 2023
[9]

Miwae: Deep generative modelling and imputation of incomplete data sets,

P.-A. Mattei and J. Frellsen, “Miwae: Deep generative modelling and imputation of incomplete data sets,” in International conference on machine learning. PMLR, 2019, pp. 4413–4423

work page 2019
[10]

A systematic review on imbalanced data challenges in machine learning: Applications and solutions,

H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,” ACM computing surveys (CSUR) , vol. 52, no. 4, pp. 1–36, 2019

work page 2019
[11]

On oversampling imbalanced data with deep conditional generative models,

V . A. Fajardo, D. Findlay, C. Jaiswal, X. Yin, R. Houmanfar, H. Xie, J. Liang, X. She, and D. B. Emerson, “On oversampling imbalanced data with deep conditional generative models,” Expert Systems with Applications, vol. 169, p. 114463, 2021

work page 2021
[12]

Generating synthetic data in finance: opportunities, challenges and pitfalls,

S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso, “Generating synthetic data in finance: opportunities, challenges and pitfalls,” in Proceedings of the First ACM International Conference on AI in Finance , 2020, pp. 1–8

work page 2020
[13]

Synthetic data generation for tabular health records: A systematic review,

M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Synthetic data generation for tabular health records: A systematic review,”Neurocomputing, vol. 493, pp. 28–45, 2022

work page 2022
[14]

Handling missing data with graph representation learning,

J. You, X. Ma, Y . Ding, M. J. Kochenderfer, and J. Leskovec, “Handling missing data with graph representation learning,” Advances in Neural Information Processing Systems , vol. 33, pp. 19 075–19 087, 2020

work page 2020
[15]

Gain: Missing data imputation using generative adversarial nets,

J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in International conference on machine learning. PMLR, 2018, pp. 5689–5698

work page 2018
[16]

Tabular and latent space synthetic data generation: a literature review,

J. Fonseca and F. Bacao, “Tabular and latent space synthetic data generation: a literature review,” Journal of Big Data , vol. 10, no. 1, p. 115, 2023

work page 2023
[17]

A tutorial on energy-based learning,

Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang et al. , “A tutorial on energy-based learning,” Predicting structured data , vol. 1, no. 0, 2006

work page 2006
[18]

Auto-Encoding Variational Bayes

D. P. Kingma, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems , vol. 27, 2014

work page 2014
[20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”

work page
[21]

Attention Is All You Need

[Online]. Available: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Normalizing flows: An introduction and review of current methods,

I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” IEEE transactions on pattern analysis and machine intelligence , vol. 43, no. 11, pp. 3964– 3979, 2020. MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 21 TABLE VII OVERVIEW OF DIFFUSION MODELS FOR TABULAR DATA. T HE COLUMN “NUM” INDICAT...

work page 2020
[23]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

work page 2015
[24]

Catastrophic forgetting and mode collapse in gans,

H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in 2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

work page 2020
[25]

Diagnosing and enhancing vae models,

B. Dai and D. Wipf, “Diagnosing and enhancing vae models,” in International Conference on Learning Representations , 2019

work page 2019
[26]

Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,

D. Carbone, “Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,” arXiv preprint arXiv:2406.13661 , 2024

work page arXiv 2024
[27]

Limitations of autoregressive models and their alternatives,

C.-C. Lin, A. Jaech, X. Li, M. R. Gormley, and J. Eisner, “Limitations of autoregressive models and their alternatives,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), 2021

work page 2021
[28]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840–6851, 2020

work page 2020
[29]

Score-based generative modeling through stochastic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Rep- resentations

work page
[30]

Wavegrad: Estimating gradients for waveform generation,

N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” in Inter- national Conference on Learning Representations , 2020

work page 2020
[31]

Diffwave: A versatile diffusion model for audio synthesis,

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations , 2020

work page 2020
[32]

Argmax flows and multinomial diffusion: Learning categorical distributions,

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr ´e, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” Advances in Neural Information Processing Systems , vol. 34, pp. 12 454–12 465, 2021

work page 2021
[33]

Structured denoising diffusion models in discrete state-spaces,

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,” Ad- vances in Neural Information Processing Systems , vol. 34, pp. 17 981– 17 993, 2021

work page 2021
[34]

A survey on video diffusion models,

Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

work page 2024
[35]

Generative diffusion models on graphs: methods and applications,

C. Liu, W. Fan, Y . Liu, J. Li, H. Li, H. Liu, J. Tang, and Q. Li, “Generative diffusion models on graphs: methods and applications,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, pp. 6702–6711

work page 2023
[36]

Stasy: Score-based tabular data synthe- sis,

J. Kim, C. Lee, and N. Park, “Stasy: Score-based tabular data synthe- sis,” in The Eleventh International Conference on Learning Represen- tations, 2023

work page 2023
[37]

Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,

N. Suh, X. Lin, D.-Y . Hsieh, M. Honarkhah, and G. Cheng, “Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,” in NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI

work page 2023
[38]

Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,

C. Lee, J. Kim, and N. Park, “Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,” in International Conference on Machine Learning . PMLR, 2023, pp. 18 940–18 956

work page 2023
[39]

Mixed-type tabular data synthesis with score-based diffusion in latent space,

H. Zhang, J. Zhang, Z. Shen, B. Srinivasan, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis, “Mixed-type tabular data synthesis with score-based diffusion in latent space,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024
[40]

Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,

A. Jolicoeur-Martineau, K. Fatras, and T. Kachman, “Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,” in International Conference on Artificial Intelligence and Statis- tics. PMLR, 2024, pp. 1288–1296

work page 2024
[41]

Diffusion models: A comprehensive survey of methods and applications,

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023

work page 2023
[42]

A survey on generative diffusion models,

H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,” IEEE Transactions on Knowledge and Data Engineering , 2024

work page 2024
[43]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 9, pp. 10 850–10 869, 2023

work page 2023
[44]

Diffusion models in nlp: A survey,

Y . Zhu and Y . Zhao, “Diffusion models in nlp: A survey,”arXiv preprint arXiv:2303.07576, 2023

work page arXiv 2023
[45]

Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,

L. Lin, Z. Li, R. Li, X. Li, and J. Gao, “Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 1, pp. 19–41, 2024

work page 2024
[46]

Challenges and opportunities of generative models on tabular data,

A. X. Wang, S. S. Chukova, C. R. Simpson, and B. P. Nguyen, “Challenges and opportunities of generative models on tabular data,” Applied Soft Computing , p. 112223, 2024

work page 2024
[47]

Generative models for tabular data: A review,

D.-K. Kim, D. Ryu, Y . Lee, and D.-H. Choi, “Generative models for tabular data: A review,”Journal of Mechanical Science and Technology, vol. 38, no. 9, pp. 4989–5005, 2024

work page 2024
[48]

A comprehensive survey on generative diffusion models for structured data,

H. Koo and T. E. Kim, “A comprehensive survey on generative diffusion models for structured data,” arXiv e-prints, pp. arXiv–2306, 2023

work page 2023
[49]

An introduction to variational autoencoders,

D. P. Kingma, M. Welling et al. , “An introduction to variational autoencoders,”Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019

work page 2019
[50]

Random variables, joint distribution functions, and copulas,

A. Sklar, “Random variables, joint distribution functions, and copulas,” Kybernetika, vol. 9, no. 6, pp. 449–460, 1973

work page 1973
[51]

Gaussian mixture models

D. A. Reynolds et al. , “Gaussian mixture models.” Encyclopedia of biometrics, vol. 741, no. 659-663, 2009

work page 2009
[52]

Clinical reasoning over tabular data and text with bayesian networks,

P. Rabaey, J. Deleu, S. Heytens, and T. Demeester, “Clinical reasoning over tabular data and text with bayesian networks,” in International Conference on Artificial Intelligence in Medicine . Springer, 2024, pp. 229–250

work page 2024
[53]

Smote: synthetic minority over-sampling technique,

N. V . Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of ar- tificial intelligence research, vol. 16, pp. 321–357, 2002

work page 2002
[54]

Borderline-smote: a new over- sampling method in imbalanced data sets learning,

H. Han, W.-Y . Wang, and B.-H. Mao, “Borderline-smote: a new over- sampling method in imbalanced data sets learning,” in International conference on intelligent computing . Springer, 2005, pp. 878–887

work page 2005
[55]

Synthetic minority oversampling using edited displacement-based k-nearest neighbors,

A. X. Wang, S. S. Chukova, and B. P. Nguyen, “Synthetic minority oversampling using edited displacement-based k-nearest neighbors,” Applied Soft Computing , vol. 148, p. 110895, 2023

work page 2023
[56]

Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,

M. Mukherjee and M. Khushi, “Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,” Applied system innovation , vol. 4, no. 1, p. 18, 2021

work page 2021
[57]

Adasyn: Adaptive synthetic sampling approach for imbalanced learning,

H. He, Y . Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE interna- tional joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328

work page 2008
[58]

synthpop: Bespoke creation of synthetic data in r,

B. Nowok, G. M. Raab, and C. Dibben, “synthpop: Bespoke creation of synthetic data in r,” Journal of statistical software, vol. 74, pp. 1–26, 2016

work page 2016
[59]

Modeling tabular data using conditional gan,

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,” Advances in neural information processing systems , vol. 32, 2019

work page 2019
[60]

Goggle: Generative modelling for tabular data by learning relational structure,

T. Liu, Z. Qian, J. Berrevoets, and M. van der Schaar, “Goggle: Generative modelling for tabular data by learning relational structure,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[61]

Ctab-gan: Effective table data synthesizing,

Z. Zhao, A. Kunar, R. Birke, and L. Y . Chen, “Ctab-gan: Effective table data synthesizing,” in Asian Conference on Machine Learning . PMLR, 2021, pp. 97–112

work page 2021
[62]

Ctab- gan+: Enhancing tabular data synthesis,

Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer, and L. Y . Chen, “Ctab- gan+: Enhancing tabular data synthesis,” Frontiers in big Data, vol. 6, p. 1296508, 2024

work page 2024
[63]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Ama- triain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Language models are realistic tabular data generators,

V . Borisov, K. Sessler, T. Leemann, M. Pawelczyk, and G. Kasneci, “Language models are realistic tabular data generators,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https://openreview.net/forum?id=cEygmQNOeI

work page 2023
[65]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021
[67]

Sos: Score-based oversampling for tabular data,

J. Kim, C. Lee, Y . Shin, S. Park, M. Kim, N. Park, and J. Cho, “Sos: Score-based oversampling for tabular data,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022, pp. 762–772

work page 2022
[68]

Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,

X. Fang, W. Xu, F. A. Tan, Z. Hu, J. Zhang, Y . Qi, S. H. Sengamedu, and C. Faloutsos, “Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,”Transactions on Machine Learning Research , 2024. [Online]. Available: https://openreview.net/forum?id=IZnrCGF9WI

work page 2024
[69]

Diffusion models for missing value imputation in tabular data,

S. Zheng and N. Charoenphakdee, “Diffusion models for missing value imputation in tabular data,” inNeurIPS 2022 First Table Representation Workshop

work page 2022
[70]

What do we really know about wages? the importance of nonreporting and census imputation,

L. Lillard, J. P. Smith, and F. Welch, “What do we really know about wages? the importance of nonreporting and census imputation,”Journal of Political Economy, vol. 94, no. 3, Part 1, pp. 489–506, 1986

work page 1986
[71]

Strategies for handling missing data in electronic health record derived data,

B. J. Wells, K. M. Chagin, A. S. Nowacki, and M. W. Kattan, “Strategies for handling missing data in electronic health record derived data,” Egems, vol. 1, no. 3, 2013

work page 2013
[72]

A survey on missing data in machine learning,

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,” Journal of Big data , vol. 8, pp. 1–37, 2021

work page 2021
[73]

Inference and missing data,

D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976

work page 1976
[74]

Tabdiff: a unified diffusion model for multi-modal tabular data generation,

J. Shi, M. Xu, H. Hua, H. Zhang, S. Ermon, and J. Leskovec, “Tabdiff: a unified diffusion model for multi-modal tabular data generation,” in NeurIPS 2024 Third Table Representation Learning Workshop

work page 2024
[75]

Generative modeling by estimating gradients of the data distribution,

Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019

work page 2019
[76]

P. E. Kloeden, E. Platen, P. E. Kloeden, and E. Platen, Stochastic differential equations. Springer, 1992

work page 1992
[77]

Neural ordinary differential equations,

R. T. Chen, Y . Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” Advances in neural information pro- cessing systems, vol. 31, 2018

work page 2018
[78]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Appli- cations, 2021

work page 2021
[79]

Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,

L. Cui, H. Li, K. Chen, L. Shou, and G. Chen, “Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,” arXiv preprint arXiv:2407.21523 , 2024

work page arXiv 2024
[80]

Missdiff: Training diffusion models on tabular data with missing values,

Y . Ouyang, L. Xie, C. Li, and G. Cheng, “Missdiff: Training diffusion models on tabular data with missing values,” in ICML 2023 Workshop on Structured Probabilistic Inference {\&} Generative Modeling , 2023

work page 2023

Showing first 80 references.

[1] [1]

Data mining in healthcare and biomedicine: a survey of the literature,

I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.- F. Chang, and L. Hua, “Data mining in healthcare and biomedicine: a survey of the literature,” Journal of medical systems , vol. 36, pp. 2431–2448, 2012

work page 2012

[2] [2]

M. F. Dixon, I. Halperin, and P. Bilokon, Machine learning in finance. Springer, 2020, vol. 1170

work page 2020

[3] [3]

Data mining in education,

A. Algarni, “Data mining in education,” International Journal of Advanced Computer Science and Applications , vol. 7, no. 6, pp. 456– 461, 2016

work page 2016

[4] [4]

An extensive review on data mining methods and clustering models for intelligent transportation system,

S. Anand, P. Padmanabham, A. Govardhan, and R. H. Kulkarni, “An extensive review on data mining methods and clustering models for intelligent transportation system,” Journal of Intelligent Systems , vol. 27, no. 2, pp. 263–273, 2018

work page 2018

[5] [5]

Data mining in psychological treatment research: a primer on classification and regression trees

M. W. King and P. A. Resick, “Data mining in psychological treatment research: a primer on classification and regression trees.” Journal of consulting and clinical psychology , vol. 82, no. 5, p. 895, 2014

work page 2014

[6] [6]

General data protection regulation,

G. GDPR, “General data protection regulation,” Regulation (EU), vol. 679, 2016

work page 2016

[7] [7]

California consumer privacy act of 2018 (ccpa),

C. S. Legislature, “California consumer privacy act of 2018 (ccpa),” 2018, accessed: 2024-12-27. [Online]. Available: https: //oag.ca.gov/privacy/ccpa

work page 2018

[8] [8]

Tabd- dpm: Modelling tabular data with diffusion models,

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabd- dpm: Modelling tabular data with diffusion models,” in International Conference on Machine Learning . PMLR, 2023, pp. 17 564–17 579

work page 2023

[9] [9]

Miwae: Deep generative modelling and imputation of incomplete data sets,

P.-A. Mattei and J. Frellsen, “Miwae: Deep generative modelling and imputation of incomplete data sets,” in International conference on machine learning. PMLR, 2019, pp. 4413–4423

work page 2019

[10] [10]

A systematic review on imbalanced data challenges in machine learning: Applications and solutions,

H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,” ACM computing surveys (CSUR) , vol. 52, no. 4, pp. 1–36, 2019

work page 2019

[11] [11]

On oversampling imbalanced data with deep conditional generative models,

V . A. Fajardo, D. Findlay, C. Jaiswal, X. Yin, R. Houmanfar, H. Xie, J. Liang, X. She, and D. B. Emerson, “On oversampling imbalanced data with deep conditional generative models,” Expert Systems with Applications, vol. 169, p. 114463, 2021

work page 2021

[12] [12]

Generating synthetic data in finance: opportunities, challenges and pitfalls,

S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso, “Generating synthetic data in finance: opportunities, challenges and pitfalls,” in Proceedings of the First ACM International Conference on AI in Finance , 2020, pp. 1–8

work page 2020

[13] [13]

Synthetic data generation for tabular health records: A systematic review,

M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Synthetic data generation for tabular health records: A systematic review,”Neurocomputing, vol. 493, pp. 28–45, 2022

work page 2022

[14] [14]

Handling missing data with graph representation learning,

J. You, X. Ma, Y . Ding, M. J. Kochenderfer, and J. Leskovec, “Handling missing data with graph representation learning,” Advances in Neural Information Processing Systems , vol. 33, pp. 19 075–19 087, 2020

work page 2020

[15] [15]

Gain: Missing data imputation using generative adversarial nets,

J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in International conference on machine learning. PMLR, 2018, pp. 5689–5698

work page 2018

[16] [16]

Tabular and latent space synthetic data generation: a literature review,

J. Fonseca and F. Bacao, “Tabular and latent space synthetic data generation: a literature review,” Journal of Big Data , vol. 10, no. 1, p. 115, 2023

work page 2023

[17] [17]

A tutorial on energy-based learning,

Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang et al. , “A tutorial on energy-based learning,” Predicting structured data , vol. 1, no. 0, 2006

work page 2006

[18] [18]

Auto-Encoding Variational Bayes

D. P. Kingma, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[19] [19]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems , vol. 27, 2014

work page 2014

[20] [20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”

work page

[21] [21]

Attention Is All You Need

[Online]. Available: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Normalizing flows: An introduction and review of current methods,

I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” IEEE transactions on pattern analysis and machine intelligence , vol. 43, no. 11, pp. 3964– 3979, 2020. MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 21 TABLE VII OVERVIEW OF DIFFUSION MODELS FOR TABULAR DATA. T HE COLUMN “NUM” INDICAT...

work page 2020

[23] [23]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

work page 2015

[24] [24]

Catastrophic forgetting and mode collapse in gans,

H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in 2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

work page 2020

[25] [25]

Diagnosing and enhancing vae models,

B. Dai and D. Wipf, “Diagnosing and enhancing vae models,” in International Conference on Learning Representations , 2019

work page 2019

[26] [26]

Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,

D. Carbone, “Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,” arXiv preprint arXiv:2406.13661 , 2024

work page arXiv 2024

[27] [27]

Limitations of autoregressive models and their alternatives,

C.-C. Lin, A. Jaech, X. Li, M. R. Gormley, and J. Eisner, “Limitations of autoregressive models and their alternatives,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), 2021

work page 2021

[28] [28]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840–6851, 2020

work page 2020

[29] [29]

Score-based generative modeling through stochastic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Rep- resentations

work page

[30] [30]

Wavegrad: Estimating gradients for waveform generation,

N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” in Inter- national Conference on Learning Representations , 2020

work page 2020

[31] [31]

Diffwave: A versatile diffusion model for audio synthesis,

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations , 2020

work page 2020

[32] [32]

Argmax flows and multinomial diffusion: Learning categorical distributions,

E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr ´e, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” Advances in Neural Information Processing Systems , vol. 34, pp. 12 454–12 465, 2021

work page 2021

[33] [33]

Structured denoising diffusion models in discrete state-spaces,

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,” Ad- vances in Neural Information Processing Systems , vol. 34, pp. 17 981– 17 993, 2021

work page 2021

[34] [34]

A survey on video diffusion models,

Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

work page 2024

[35] [35]

Generative diffusion models on graphs: methods and applications,

C. Liu, W. Fan, Y . Liu, J. Li, H. Li, H. Liu, J. Tang, and Q. Li, “Generative diffusion models on graphs: methods and applications,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, pp. 6702–6711

work page 2023

[36] [36]

Stasy: Score-based tabular data synthe- sis,

J. Kim, C. Lee, and N. Park, “Stasy: Score-based tabular data synthe- sis,” in The Eleventh International Conference on Learning Represen- tations, 2023

work page 2023

[37] [37]

Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,

N. Suh, X. Lin, D.-Y . Hsieh, M. Honarkhah, and G. Cheng, “Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,” in NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI

work page 2023

[38] [38]

Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,

C. Lee, J. Kim, and N. Park, “Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,” in International Conference on Machine Learning . PMLR, 2023, pp. 18 940–18 956

work page 2023

[39] [39]

Mixed-type tabular data synthesis with score-based diffusion in latent space,

H. Zhang, J. Zhang, Z. Shen, B. Srinivasan, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis, “Mixed-type tabular data synthesis with score-based diffusion in latent space,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024

[40] [40]

Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,

A. Jolicoeur-Martineau, K. Fatras, and T. Kachman, “Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,” in International Conference on Artificial Intelligence and Statis- tics. PMLR, 2024, pp. 1288–1296

work page 2024

[41] [41]

Diffusion models: A comprehensive survey of methods and applications,

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023

work page 2023

[42] [42]

A survey on generative diffusion models,

H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,” IEEE Transactions on Knowledge and Data Engineering , 2024

work page 2024

[43] [43]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 9, pp. 10 850–10 869, 2023

work page 2023

[44] [44]

Diffusion models in nlp: A survey,

Y . Zhu and Y . Zhao, “Diffusion models in nlp: A survey,”arXiv preprint arXiv:2303.07576, 2023

work page arXiv 2023

[45] [45]

Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,

L. Lin, Z. Li, R. Li, X. Li, and J. Gao, “Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 1, pp. 19–41, 2024

work page 2024

[46] [46]

Challenges and opportunities of generative models on tabular data,

A. X. Wang, S. S. Chukova, C. R. Simpson, and B. P. Nguyen, “Challenges and opportunities of generative models on tabular data,” Applied Soft Computing , p. 112223, 2024

work page 2024

[47] [47]

Generative models for tabular data: A review,

D.-K. Kim, D. Ryu, Y . Lee, and D.-H. Choi, “Generative models for tabular data: A review,”Journal of Mechanical Science and Technology, vol. 38, no. 9, pp. 4989–5005, 2024

work page 2024

[48] [48]

A comprehensive survey on generative diffusion models for structured data,

H. Koo and T. E. Kim, “A comprehensive survey on generative diffusion models for structured data,” arXiv e-prints, pp. arXiv–2306, 2023

work page 2023

[49] [49]

An introduction to variational autoencoders,

D. P. Kingma, M. Welling et al. , “An introduction to variational autoencoders,”Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019

work page 2019

[50] [50]

Random variables, joint distribution functions, and copulas,

A. Sklar, “Random variables, joint distribution functions, and copulas,” Kybernetika, vol. 9, no. 6, pp. 449–460, 1973

work page 1973

[51] [51]

Gaussian mixture models

D. A. Reynolds et al. , “Gaussian mixture models.” Encyclopedia of biometrics, vol. 741, no. 659-663, 2009

work page 2009

[52] [52]

Clinical reasoning over tabular data and text with bayesian networks,

P. Rabaey, J. Deleu, S. Heytens, and T. Demeester, “Clinical reasoning over tabular data and text with bayesian networks,” in International Conference on Artificial Intelligence in Medicine . Springer, 2024, pp. 229–250

work page 2024

[53] [53]

Smote: synthetic minority over-sampling technique,

N. V . Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of ar- tificial intelligence research, vol. 16, pp. 321–357, 2002

work page 2002

[54] [54]

Borderline-smote: a new over- sampling method in imbalanced data sets learning,

H. Han, W.-Y . Wang, and B.-H. Mao, “Borderline-smote: a new over- sampling method in imbalanced data sets learning,” in International conference on intelligent computing . Springer, 2005, pp. 878–887

work page 2005

[55] [55]

Synthetic minority oversampling using edited displacement-based k-nearest neighbors,

A. X. Wang, S. S. Chukova, and B. P. Nguyen, “Synthetic minority oversampling using edited displacement-based k-nearest neighbors,” Applied Soft Computing , vol. 148, p. 110895, 2023

work page 2023

[56] [56]

Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,

M. Mukherjee and M. Khushi, “Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,” Applied system innovation , vol. 4, no. 1, p. 18, 2021

work page 2021

[57] [57]

Adasyn: Adaptive synthetic sampling approach for imbalanced learning,

H. He, Y . Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE interna- tional joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328

work page 2008

[58] [58]

synthpop: Bespoke creation of synthetic data in r,

B. Nowok, G. M. Raab, and C. Dibben, “synthpop: Bespoke creation of synthetic data in r,” Journal of statistical software, vol. 74, pp. 1–26, 2016

work page 2016

[59] [59]

Modeling tabular data using conditional gan,

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,” Advances in neural information processing systems , vol. 32, 2019

work page 2019

[60] [60]

Goggle: Generative modelling for tabular data by learning relational structure,

T. Liu, Z. Qian, J. Berrevoets, and M. van der Schaar, “Goggle: Generative modelling for tabular data by learning relational structure,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023

[61] [61]

Ctab-gan: Effective table data synthesizing,

Z. Zhao, A. Kunar, R. Birke, and L. Y . Chen, “Ctab-gan: Effective table data synthesizing,” in Asian Conference on Machine Learning . PMLR, 2021, pp. 97–112

work page 2021

[62] [62]

Ctab- gan+: Enhancing tabular data synthesis,

Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer, and L. Y . Chen, “Ctab- gan+: Enhancing tabular data synthesis,” Frontiers in big Data, vol. 6, p. 1296508, 2024

work page 2024

[63] [63]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Ama- triain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Language models are realistic tabular data generators,

V . Borisov, K. Sessler, T. Leemann, M. Pawelczyk, and G. Kasneci, “Language models are realistic tabular data generators,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https://openreview.net/forum?id=cEygmQNOeI

work page 2023

[65] [65]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

work page 2021

[67] [67]

Sos: Score-based oversampling for tabular data,

J. Kim, C. Lee, Y . Shin, S. Park, M. Kim, N. Park, and J. Cho, “Sos: Score-based oversampling for tabular data,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022, pp. 762–772

work page 2022

[68] [68]

Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,

X. Fang, W. Xu, F. A. Tan, Z. Hu, J. Zhang, Y . Qi, S. H. Sengamedu, and C. Faloutsos, “Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,”Transactions on Machine Learning Research , 2024. [Online]. Available: https://openreview.net/forum?id=IZnrCGF9WI

work page 2024

[69] [69]

Diffusion models for missing value imputation in tabular data,

S. Zheng and N. Charoenphakdee, “Diffusion models for missing value imputation in tabular data,” inNeurIPS 2022 First Table Representation Workshop

work page 2022

[70] [70]

What do we really know about wages? the importance of nonreporting and census imputation,

L. Lillard, J. P. Smith, and F. Welch, “What do we really know about wages? the importance of nonreporting and census imputation,”Journal of Political Economy, vol. 94, no. 3, Part 1, pp. 489–506, 1986

work page 1986

[71] [71]

Strategies for handling missing data in electronic health record derived data,

B. J. Wells, K. M. Chagin, A. S. Nowacki, and M. W. Kattan, “Strategies for handling missing data in electronic health record derived data,” Egems, vol. 1, no. 3, 2013

work page 2013

[72] [72]

A survey on missing data in machine learning,

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,” Journal of Big data , vol. 8, pp. 1–37, 2021

work page 2021

[73] [73]

Inference and missing data,

D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976

work page 1976

[74] [74]

Tabdiff: a unified diffusion model for multi-modal tabular data generation,

J. Shi, M. Xu, H. Hua, H. Zhang, S. Ermon, and J. Leskovec, “Tabdiff: a unified diffusion model for multi-modal tabular data generation,” in NeurIPS 2024 Third Table Representation Learning Workshop

work page 2024

[75] [75]

Generative modeling by estimating gradients of the data distribution,

Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019

work page 2019

[76] [76]

P. E. Kloeden, E. Platen, P. E. Kloeden, and E. Platen, Stochastic differential equations. Springer, 1992

work page 1992

[77] [77]

Neural ordinary differential equations,

R. T. Chen, Y . Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” Advances in neural information pro- cessing systems, vol. 31, 2018

work page 2018

[78] [78]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Appli- cations, 2021

work page 2021

[79] [79]

Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,

L. Cui, H. Li, K. Chen, L. Shou, and G. Chen, “Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,” arXiv preprint arXiv:2407.21523 , 2024

work page arXiv 2024

[80] [80]

Missdiff: Training diffusion models on tabular data with missing values,

Y . Ouyang, L. Xie, C. Li, and G. Cheng, “Missdiff: Training diffusion models on tabular data with missing values,” in ICML 2023 Workshop on Structured Probabilistic Inference {\&} Generative Modeling , 2023

work page 2023