pith. sign in

arxiv: 2502.17119 · v2 · pith:RXK2QDN2new · submitted 2025-02-24 · 💻 cs.LG · cs.AI

Diffusion and Flow Matching Models for Tabular Data: A Survey

Pith reviewed 2026-05-25 07:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords diffusion modelsflow matchingtabular datagenerative modelssurveydata synthesisimputationanomaly detection
0
0 comments X

The pith

This is the first survey dedicated to diffusion and flow matching models for tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular data generation faces persistent difficulties from mixed numerical and categorical features, missing values, imbalances, and domain constraints that earlier GAN and VAE approaches often handle unstably. Diffusion models address this through iterative noising and denoising, while flow matching learns direct transport fields, both offering more stable training for tasks like synthesis, imputation, and anomaly detection. The paper collects and organizes the scattered literature on these methods, identifies why direct comparisons remain elusive, and flags open issues in scalability, privacy, and constraint handling. A reader would care because tabular records dominate real-world datasets where reliable generative tools could improve data sharing and augmentation.

Core claim

To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation.

What carries the argument

The survey's four-way organizational structure around data-engineering challenges, tasks, design choices, and evaluation dimensions.

If this is right

  • Researchers can use the organization to locate methods for specific tabular tasks such as synthesis or imputation.
  • Future work must address the documented gaps in scalability and constraint-aware generation.
  • Standardized benchmarks would reduce the current fragmentation in evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A shared evaluation protocol across tasks could accelerate progress by making incremental improvements visible.
  • Constraint-aware variants may prove essential for regulated domains where synthetic data must obey hard rules.
  • Privacy and fairness analyses could be integrated into the generative process rather than applied after the fact.

Load-bearing premise

The literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions.

What would settle it

Discovery of any earlier survey whose scope is limited to diffusion and flow matching models applied to tabular data.

Figures

Figures reproduced from arXiv: 2502.17119 by Jiayang Shi, Lincen Yang, Matthijs van Leeuwen, Niki van Stein, Qi Huang, Thomas B\"ack, Zhao Yang, Zhong Li.

Figure 1
Figure 1. Figure 1: Timeline of Generative Models for Tabular Data: Below the timeline, key advancements in traditional machine learning models and deep generative [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of Diffusion Models for Tabular Data. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a survey of diffusion and flow matching models for tabular data, claiming to be the first dedicated review of the topic. It reviews literature from June 2015 to May 2026, organizes existing work around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discusses open problems including scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. The authors state that they maintain updates in a GitHub repository.

Significance. If the coverage is comprehensive and free of selection bias, the survey would be significant for organizing an emerging, heterogeneous literature on generative models for structured data. The explicit maintenance of a GitHub repository for updates strengthens the work by providing a mechanism for ongoing relevance and community contribution.

minor comments (2)
  1. [Abstract] The review period is stated as extending to May 2026. The authors should clarify whether this is a projected cutoff, a typographical error, or the intended scope, as the current date of the manuscript appears to precede this endpoint.
  2. [Abstract] The abstract refers to a GitHub repository for updates but does not provide the URL. Including the repository link in the manuscript (and ideally in the abstract) would improve accessibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The assessment correctly identifies the survey's scope, organization around data-engineering challenges and tasks, coverage of open problems, and the value of the maintained GitHub repository. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in survey paper

full rationale

This manuscript is explicitly a literature survey with no derivations, equations, predictions, or technical claims whose validity depends on internal self-reference. The sole novel assertion (being the first dedicated survey) is a factual statement about external literature coverage rather than a result derived from the paper's own inputs. No self-citation chains, fitted parameters renamed as predictions, or ansatzes are present. The work is therefore self-contained against external benchmarks with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5833 in / 994 out tokens · 26905 ms · 2026-05-25T07:58:12.718890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

    cs.AI 2026-04 unverdicted novelty 6.0

    TagCC anchors statistical tabular representations to LLM-derived textual semantic concepts via contrastive learning jointly optimized with a clustering objective, outperforming prior methods on benchmarks.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Data mining in healthcare and biomedicine: a survey of the literature,

    I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.- F. Chang, and L. Hua, “Data mining in healthcare and biomedicine: a survey of the literature,” Journal of medical systems , vol. 36, pp. 2431–2448, 2012

  2. [2]

    M. F. Dixon, I. Halperin, and P. Bilokon, Machine learning in finance. Springer, 2020, vol. 1170

  3. [3]

    Data mining in education,

    A. Algarni, “Data mining in education,” International Journal of Advanced Computer Science and Applications , vol. 7, no. 6, pp. 456– 461, 2016

  4. [4]

    An extensive review on data mining methods and clustering models for intelligent transportation system,

    S. Anand, P. Padmanabham, A. Govardhan, and R. H. Kulkarni, “An extensive review on data mining methods and clustering models for intelligent transportation system,” Journal of Intelligent Systems , vol. 27, no. 2, pp. 263–273, 2018

  5. [5]

    Data mining in psychological treatment research: a primer on classification and regression trees

    M. W. King and P. A. Resick, “Data mining in psychological treatment research: a primer on classification and regression trees.” Journal of consulting and clinical psychology , vol. 82, no. 5, p. 895, 2014

  6. [6]

    General data protection regulation,

    G. GDPR, “General data protection regulation,” Regulation (EU), vol. 679, 2016

  7. [7]

    California consumer privacy act of 2018 (ccpa),

    C. S. Legislature, “California consumer privacy act of 2018 (ccpa),” 2018, accessed: 2024-12-27. [Online]. Available: https: //oag.ca.gov/privacy/ccpa

  8. [8]

    Tabd- dpm: Modelling tabular data with diffusion models,

    A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabd- dpm: Modelling tabular data with diffusion models,” in International Conference on Machine Learning . PMLR, 2023, pp. 17 564–17 579

  9. [9]

    Miwae: Deep generative modelling and imputation of incomplete data sets,

    P.-A. Mattei and J. Frellsen, “Miwae: Deep generative modelling and imputation of incomplete data sets,” in International conference on machine learning. PMLR, 2019, pp. 4413–4423

  10. [10]

    A systematic review on imbalanced data challenges in machine learning: Applications and solutions,

    H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,” ACM computing surveys (CSUR) , vol. 52, no. 4, pp. 1–36, 2019

  11. [11]

    On oversampling imbalanced data with deep conditional generative models,

    V . A. Fajardo, D. Findlay, C. Jaiswal, X. Yin, R. Houmanfar, H. Xie, J. Liang, X. She, and D. B. Emerson, “On oversampling imbalanced data with deep conditional generative models,” Expert Systems with Applications, vol. 169, p. 114463, 2021

  12. [12]

    Generating synthetic data in finance: opportunities, challenges and pitfalls,

    S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso, “Generating synthetic data in finance: opportunities, challenges and pitfalls,” in Proceedings of the First ACM International Conference on AI in Finance , 2020, pp. 1–8

  13. [13]

    Synthetic data generation for tabular health records: A systematic review,

    M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Synthetic data generation for tabular health records: A systematic review,”Neurocomputing, vol. 493, pp. 28–45, 2022

  14. [14]

    Handling missing data with graph representation learning,

    J. You, X. Ma, Y . Ding, M. J. Kochenderfer, and J. Leskovec, “Handling missing data with graph representation learning,” Advances in Neural Information Processing Systems , vol. 33, pp. 19 075–19 087, 2020

  15. [15]

    Gain: Missing data imputation using generative adversarial nets,

    J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in International conference on machine learning. PMLR, 2018, pp. 5689–5698

  16. [16]

    Tabular and latent space synthetic data generation: a literature review,

    J. Fonseca and F. Bacao, “Tabular and latent space synthetic data generation: a literature review,” Journal of Big Data , vol. 10, no. 1, p. 115, 2023

  17. [17]

    A tutorial on energy-based learning,

    Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang et al. , “A tutorial on energy-based learning,” Predicting structured data , vol. 1, no. 0, 2006

  18. [18]

    Auto-Encoding Variational Bayes

    D. P. Kingma, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  19. [19]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems , vol. 27, 2014

  20. [20]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”

  21. [21]

    Attention Is All You Need

    [Online]. Available: https://arxiv.org/abs/1706.03762

  22. [22]

    Normalizing flows: An introduction and review of current methods,

    I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” IEEE transactions on pattern analysis and machine intelligence , vol. 43, no. 11, pp. 3964– 3979, 2020. MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 21 TABLE VII OVERVIEW OF DIFFUSION MODELS FOR TABULAR DATA. T HE COLUMN “NUM” INDICAT...

  23. [23]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

  24. [24]

    Catastrophic forgetting and mode collapse in gans,

    H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in 2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

  25. [25]

    Diagnosing and enhancing vae models,

    B. Dai and D. Wipf, “Diagnosing and enhancing vae models,” in International Conference on Learning Representations , 2019

  26. [26]

    Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,

    D. Carbone, “Hitchhiker’s guide on energy-based models: a compre- hensive review on the relation with other generative models, sampling and statistical physics,” arXiv preprint arXiv:2406.13661 , 2024

  27. [27]

    Limitations of autoregressive models and their alternatives,

    C.-C. Lin, A. Jaech, X. Li, M. R. Gormley, and J. Eisner, “Limitations of autoregressive models and their alternatives,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), 2021

  28. [28]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840–6851, 2020

  29. [29]

    Score-based generative modeling through stochastic differential equations,

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Rep- resentations

  30. [30]

    Wavegrad: Estimating gradients for waveform generation,

    N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” in Inter- national Conference on Learning Representations , 2020

  31. [31]

    Diffwave: A versatile diffusion model for audio synthesis,

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations , 2020

  32. [32]

    Argmax flows and multinomial diffusion: Learning categorical distributions,

    E. Hoogeboom, D. Nielsen, P. Jaini, P. Forr ´e, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” Advances in Neural Information Processing Systems , vol. 34, pp. 12 454–12 465, 2021

  33. [33]

    Structured denoising diffusion models in discrete state-spaces,

    J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,” Ad- vances in Neural Information Processing Systems , vol. 34, pp. 17 981– 17 993, 2021

  34. [34]

    A survey on video diffusion models,

    Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

  35. [35]

    Generative diffusion models on graphs: methods and applications,

    C. Liu, W. Fan, Y . Liu, J. Li, H. Li, H. Liu, J. Tang, and Q. Li, “Generative diffusion models on graphs: methods and applications,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, pp. 6702–6711

  36. [36]

    Stasy: Score-based tabular data synthe- sis,

    J. Kim, C. Lee, and N. Park, “Stasy: Score-based tabular data synthe- sis,” in The Eleventh International Conference on Learning Represen- tations, 2023

  37. [37]

    Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,

    N. Suh, X. Lin, D.-Y . Hsieh, M. Honarkhah, and G. Cheng, “Autodiff: combining auto-encoder and diffusion model for tabular data synthe- sizing,” in NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI

  38. [38]

    Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,

    C. Lee, J. Kim, and N. Park, “Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis,” in International Conference on Machine Learning . PMLR, 2023, pp. 18 940–18 956

  39. [39]

    Mixed-type tabular data synthesis with score-based diffusion in latent space,

    H. Zhang, J. Zhang, Z. Shen, B. Srinivasan, X. Qin, C. Faloutsos, H. Rangwala, and G. Karypis, “Mixed-type tabular data synthesis with score-based diffusion in latent space,” in The Twelfth International Conference on Learning Representations , 2024

  40. [40]

    Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,

    A. Jolicoeur-Martineau, K. Fatras, and T. Kachman, “Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees,” in International Conference on Artificial Intelligence and Statis- tics. PMLR, 2024, pp. 1288–1296

  41. [41]

    Diffusion models: A comprehensive survey of methods and applications,

    L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023

  42. [42]

    A survey on generative diffusion models,

    H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,” IEEE Transactions on Knowledge and Data Engineering , 2024

  43. [43]

    Diffusion models in vision: A survey,

    F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 9, pp. 10 850–10 869, 2023

  44. [44]

    Diffusion models in nlp: A survey,

    Y . Zhu and Y . Zhao, “Diffusion models in nlp: A survey,”arXiv preprint arXiv:2303.07576, 2023

  45. [45]

    Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,

    L. Lin, Z. Li, R. Li, X. Li, and J. Gao, “Diffusion models for time- MANUSCRIPT SUBMITTED TO IEEE FOR POSSIBLE PUBLICATION 22 series applications: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 1, pp. 19–41, 2024

  46. [46]

    Challenges and opportunities of generative models on tabular data,

    A. X. Wang, S. S. Chukova, C. R. Simpson, and B. P. Nguyen, “Challenges and opportunities of generative models on tabular data,” Applied Soft Computing , p. 112223, 2024

  47. [47]

    Generative models for tabular data: A review,

    D.-K. Kim, D. Ryu, Y . Lee, and D.-H. Choi, “Generative models for tabular data: A review,”Journal of Mechanical Science and Technology, vol. 38, no. 9, pp. 4989–5005, 2024

  48. [48]

    A comprehensive survey on generative diffusion models for structured data,

    H. Koo and T. E. Kim, “A comprehensive survey on generative diffusion models for structured data,” arXiv e-prints, pp. arXiv–2306, 2023

  49. [49]

    An introduction to variational autoencoders,

    D. P. Kingma, M. Welling et al. , “An introduction to variational autoencoders,”Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019

  50. [50]

    Random variables, joint distribution functions, and copulas,

    A. Sklar, “Random variables, joint distribution functions, and copulas,” Kybernetika, vol. 9, no. 6, pp. 449–460, 1973

  51. [51]

    Gaussian mixture models

    D. A. Reynolds et al. , “Gaussian mixture models.” Encyclopedia of biometrics, vol. 741, no. 659-663, 2009

  52. [52]

    Clinical reasoning over tabular data and text with bayesian networks,

    P. Rabaey, J. Deleu, S. Heytens, and T. Demeester, “Clinical reasoning over tabular data and text with bayesian networks,” in International Conference on Artificial Intelligence in Medicine . Springer, 2024, pp. 229–250

  53. [53]

    Smote: synthetic minority over-sampling technique,

    N. V . Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of ar- tificial intelligence research, vol. 16, pp. 321–357, 2002

  54. [54]

    Borderline-smote: a new over- sampling method in imbalanced data sets learning,

    H. Han, W.-Y . Wang, and B.-H. Mao, “Borderline-smote: a new over- sampling method in imbalanced data sets learning,” in International conference on intelligent computing . Springer, 2005, pp. 878–887

  55. [55]

    Synthetic minority oversampling using edited displacement-based k-nearest neighbors,

    A. X. Wang, S. S. Chukova, and B. P. Nguyen, “Synthetic minority oversampling using edited displacement-based k-nearest neighbors,” Applied Soft Computing , vol. 148, p. 110895, 2023

  56. [56]

    Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,

    M. Mukherjee and M. Khushi, “Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features,” Applied system innovation , vol. 4, no. 1, p. 18, 2021

  57. [57]

    Adasyn: Adaptive synthetic sampling approach for imbalanced learning,

    H. He, Y . Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE interna- tional joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328

  58. [58]

    synthpop: Bespoke creation of synthetic data in r,

    B. Nowok, G. M. Raab, and C. Dibben, “synthpop: Bespoke creation of synthetic data in r,” Journal of statistical software, vol. 74, pp. 1–26, 2016

  59. [59]

    Modeling tabular data using conditional gan,

    L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,” Advances in neural information processing systems , vol. 32, 2019

  60. [60]

    Goggle: Generative modelling for tabular data by learning relational structure,

    T. Liu, Z. Qian, J. Berrevoets, and M. van der Schaar, “Goggle: Generative modelling for tabular data by learning relational structure,” in The Eleventh International Conference on Learning Representations, 2023

  61. [61]

    Ctab-gan: Effective table data synthesizing,

    Z. Zhao, A. Kunar, R. Birke, and L. Y . Chen, “Ctab-gan: Effective table data synthesizing,” in Asian Conference on Machine Learning . PMLR, 2021, pp. 97–112

  62. [62]

    Ctab- gan+: Enhancing tabular data synthesis,

    Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer, and L. Y . Chen, “Ctab- gan+: Enhancing tabular data synthesis,” Frontiers in big Data, vol. 6, p. 1296508, 2024

  63. [63]

    Large Language Models: A Survey

    S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Ama- triain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024

  64. [64]

    Language models are realistic tabular data generators,

    V . Borisov, K. Sessler, T. Leemann, M. Pawelczyk, and G. Kasneci, “Language models are realistic tabular data generators,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https://openreview.net/forum?id=cEygmQNOeI

  65. [65]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  66. [66]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  67. [67]

    Sos: Score-based oversampling for tabular data,

    J. Kim, C. Lee, Y . Shin, S. Park, M. Kim, N. Park, and J. Cho, “Sos: Score-based oversampling for tabular data,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022, pp. 762–772

  68. [68]

    Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,

    X. Fang, W. Xu, F. A. Tan, Z. Hu, J. Zhang, Y . Qi, S. H. Sengamedu, and C. Faloutsos, “Large language models (LLMs) on tabular data: Prediction, generation, and understanding - a survey,”Transactions on Machine Learning Research , 2024. [Online]. Available: https://openreview.net/forum?id=IZnrCGF9WI

  69. [69]

    Diffusion models for missing value imputation in tabular data,

    S. Zheng and N. Charoenphakdee, “Diffusion models for missing value imputation in tabular data,” inNeurIPS 2022 First Table Representation Workshop

  70. [70]

    What do we really know about wages? the importance of nonreporting and census imputation,

    L. Lillard, J. P. Smith, and F. Welch, “What do we really know about wages? the importance of nonreporting and census imputation,”Journal of Political Economy, vol. 94, no. 3, Part 1, pp. 489–506, 1986

  71. [71]

    Strategies for handling missing data in electronic health record derived data,

    B. J. Wells, K. M. Chagin, A. S. Nowacki, and M. W. Kattan, “Strategies for handling missing data in electronic health record derived data,” Egems, vol. 1, no. 3, 2013

  72. [72]

    A survey on missing data in machine learning,

    T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,” Journal of Big data , vol. 8, pp. 1–37, 2021

  73. [73]

    Inference and missing data,

    D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976

  74. [74]

    Tabdiff: a unified diffusion model for multi-modal tabular data generation,

    J. Shi, M. Xu, H. Hua, H. Zhang, S. Ermon, and J. Leskovec, “Tabdiff: a unified diffusion model for multi-modal tabular data generation,” in NeurIPS 2024 Third Table Representation Learning Workshop

  75. [75]

    Generative modeling by estimating gradients of the data distribution,

    Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019

  76. [76]

    P. E. Kloeden, E. Platen, P. E. Kloeden, and E. Platen, Stochastic differential equations. Springer, 1992

  77. [77]

    Neural ordinary differential equations,

    R. T. Chen, Y . Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” Advances in neural information pro- cessing systems, vol. 31, 2018

  78. [78]

    Classifier-free diffusion guidance,

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Appli- cations, 2021

  79. [79]

    Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,

    L. Cui, H. Li, K. Chen, L. Shou, and G. Chen, “Tabular data aug- mentation for machine learning: Progress and prospects of embracing generative ai,” arXiv preprint arXiv:2407.21523 , 2024

  80. [80]

    Missdiff: Training diffusion models on tabular data with missing values,

    Y . Ouyang, L. Xie, C. Li, and G. Cheng, “Missdiff: Training diffusion models on tabular data with missing values,” in ICML 2023 Workshop on Structured Probabilistic Inference {\&} Generative Modeling , 2023

Showing first 80 references.