Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

Muhammed Rasin

arxiv: 2606.08736 · v1 · pith:AZAT7HMBnew · submitted 2026-06-07 · 💻 cs.LG · cs.DB

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

Muhammed Rasin This is my paper

Pith reviewed 2026-06-27 18:50 UTC · model grok-4.3

classification 💻 cs.LG cs.DB

keywords outcome-conformant synthesisexact aggregatesynthetic tabular datacold-start generationGamma populationLukacs characterizationconformance benchmarkSpecBench

0 comments

The pith

A closed-form generator based on conditional-sum sampling from a Gamma population achieves exact satisfaction of declared analytical outcomes like aggregates, with no source data required.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines outcome-conformant synthesis as the task of generating relational tabular data that exactly matches declared targets such as revenue curves or group shares, without any real source data to imitate. It shows that off-the-shelf methods trained on data still miss those targets by large margins, while a deterministic closed-form approach reaches zero error. The method models the data as draws from a Gamma population and uses Lukacs' characterization to enforce exact conditional sums. This separates conformance to declared outcomes from fidelity to real data, and the two are shown to be independent evaluation axes. A new benchmark measures this capability for cold-start cases.

Core claim

A widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population via Lukacs' characterization, delivering closed-form exactness on the aggregate, a closed-form marginal coefficient of variation, and scale-invariance; a controlled experiment shows that enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, with the remainder attributable to shape-family mismatch.

What carries the argument

conditional-sum sampling of a Gamma population (via Lukacs' characterization), which enforces exact declared aggregates while preserving closed-form marginal properties and determinism

Load-bearing premise

The relational data generation process for declared outcomes can be modeled as draws from a Gamma population whose conditional sums satisfy the exact aggregate constraint without introducing unmodeled dependencies across the schema.

What would settle it

Run the closed-form generator on a declared aggregate and measure whether the output aggregate deviates from the target by any amount greater than floating-point error, or whether the 1-Wasserstein distance to an external marginal exceeds 0.006 after accounting for shape mismatch.

Figures

Figures reproduced from arXiv: 2606.08736 by Muhammed Rasin.

**Figure 1.** Figure 1: The P⋆ frontier across target marginals. The exact-sum engine (solid) sits on the unconstrained same-family draw (dashed); enforcing the exact aggregate adds at most 0.006 in normalized 1-Wasserstein distance. The shaded gap to the target is shape-family mismatch. 6.9 Suite-level natural-language conformance across 18 domains (E12) The leaderboard tasks exercise three curated domains through the natural-la… view at source ↗

read the original abstract

We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines outcome-conformant synthesis as a distinct task from imitation and supplies a Gamma-based closed form plus the first benchmark for hitting exact declared aggregates with no source data.

read the letter

The core contribution is a clean separation between fidelity to real data and exact conformance to declared outcomes like aggregates or shares. The work targets cold-start settings where no source data exists and shows that standard imitation tools miss targets by large margins even after adjustments. It then links a family of exact generators to conditional-sum sampling from a Gamma population via Lukacs' characterization, which supplies closed-form exactness, marginal CV, and scale invariance.

The controlled experiment is useful: enforcing the exact aggregate adds at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, with the rest attributed to shape mismatch. SpecBench is presented as the first benchmark for this conformance axis, and the paper is explicit that imitation remains preferable when real data is available.

The main soft spot is that the relational case with multiple simultaneous aggregates is not fully stress-tested in the visible material. The independence premise behind the Gamma construction could be violated when several constraints interact through joins or periods, and the abstract-level error numbers do not yet show how the closed forms hold up under those conditions. The derivation itself looks externally grounded rather than circular.

This is worth a serious referee for groups working on synthetic data for testing or business rules where exact outcomes matter more than distribution matching. The new task definition and the formal angle are substantive enough to review even if the multi-constraint handling needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that outcome-conformant synthesis enables exact satisfaction of declared analytical outcomes (e.g., aggregates across a relational schema) in cold-start settings with no source data. It argues this is achieved by showing that a family of exact-aggregate generators is equivalent to conditional-sum sampling from a Gamma population via Lukacs' characterization, yielding closed-form exactness, marginal CV, and scale-invariance. A controlled experiment bounds the 1-Wasserstein cost of enforcing the exact aggregate at most 0.006 relative to an external marginal; off-the-shelf methods miss declared aggregates by 74-86% while the proposed closed-form generator reaches exactly 0. It introduces SpecBench for conformance evaluation and positions conformance as orthogonal to fidelity.

Significance. If the derivation and multi-constraint extension hold, the work establishes a new paradigm and evaluation axis for synthetic data that prioritizes exact declarative conformance over distributional fidelity. This is particularly relevant for cold-start relational generation tasks where specific analytical outcomes must be met deterministically, and the closed-form properties plus the SpecBench benchmark provide concrete tools and a falsifiable testbed that current imitation methods lack.

major comments (2)

[§3] §3 (Formal Account via Lukacs' characterization): The central equivalence to conditional-sum sampling of independent Gammas with common scale supplies the claimed closed-form marginal CV and scale-invariance, but the relational schema setting requires simultaneous enforcement of multiple declared outcomes (different periods/groups/joins). Conditioning on one aggregate induces dependence that violates the independence premise required for the second aggregate to remain a product of conditional Gammas; the manuscript does not supply an adjustment or proof that the closed-form properties survive this joint constraint.
[Section 5] Controlled experiment (Section 5, Wasserstein bound): The reported bound of at most 0.006 in 1-Wasserstein distance is demonstrated for a single aggregate constraint; no results or extension are given for the multi-aggregate case that the abstract and introduction identify as the core relational use case. This leaves the load-bearing claim that the method jointly satisfies exactness, closed-form marginals, and relational integrity without extra adjustments unverified.

minor comments (2)

[Abstract] Abstract: the phrase '74 to 86 percent' miss should be clarified as relative or absolute error on the aggregate value, and the 'per-period steelman' baseline should be defined with a citation or section reference.
SpecBench description: the benchmark schema, number of declared outcomes per instance, and exact metric definitions (e.g., how conformance is aggregated across periods) are referenced but not fully specified in the provided text, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the precise identification of the gap between the single-constraint formalization and the multi-constraint relational claims. We address both major comments below and will revise the manuscript to clarify the scope and supply the missing multi-aggregate analysis.

read point-by-point responses

Referee: [§3] §3 (Formal Account via Lukacs' characterization): The central equivalence to conditional-sum sampling of independent Gammas with common scale supplies the claimed closed-form marginal CV and scale-invariance, but the relational schema setting requires simultaneous enforcement of multiple declared outcomes (different periods/groups/joins). Conditioning on one aggregate induces dependence that violates the independence premise required for the second aggregate to remain a product of conditional Gammas; the manuscript does not supply an adjustment or proof that the closed-form properties survive this joint constraint.

Authors: We agree that Lukacs' characterization and the resulting closed-form marginal CV and scale-invariance rest on independence of the underlying Gamma variables. Sequential conditioning on multiple aggregates necessarily introduces dependence, so the exact closed-form properties do not automatically carry over. The manuscript presents the single-constraint derivation as the core technical result and states that the approach extends to relational schemas, but supplies neither an explicit joint proof nor an adjusted procedure. We will revise §3 to (a) explicitly delimit the single-constraint guarantees and (b) either derive an approximate multi-constraint extension or describe a practical sequential sampling algorithm with its attendant loss of closed-form marginal properties. revision: yes
Referee: [Section 5] Controlled experiment (Section 5, Wasserstein bound): The reported bound of at most 0.006 in 1-Wasserstein distance is demonstrated for a single aggregate constraint; no results or extension are given for the multi-aggregate case that the abstract and introduction identify as the core relational use case. This leaves the load-bearing claim that the method jointly satisfies exactness, closed-form marginals, and relational integrity without extra adjustments unverified.

Authors: The experiment in Section 5 is restricted to a single aggregate; the 0.006 Wasserstein bound therefore applies only to that setting. The abstract and introduction do position the method for relational schemas with multiple declared outcomes, yet no multi-constraint results are reported. We accept that this leaves the joint claim unverified. In revision we will add either (i) a multi-constraint experiment that measures the additional Wasserstein cost and any degradation in marginal properties, or (ii) a clear statement that the current closed-form guarantees and bound are proven only for single constraints, with multi-constraint support left for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external Lukacs theorem plus controlled experiment.

full rationale

The paper's central formal account equates a family of exact-aggregate generators to conditional-sum sampling of a Gamma population via Lukacs' characterization (an external theorem), from which closed-form exactness, marginal CV, and scale-invariance are stated to follow. A separate controlled experiment quantifies the 1-Wasserstein cost of enforcing the aggregate against an arbitrary external marginal. No equations or claims reduce by construction to fitted parameters defined by the target result, no self-citation is load-bearing on the core identification, and no ansatz or uniqueness result is smuggled via prior author work. The derivation chain is therefore self-contained against the cited external theorem and experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Lukacs' characterization to the generators used for relational aggregates and on the modeling choice that the data can be represented as a Gamma population without source data.

axioms (1)

standard math Lukacs' characterization: the conditional distribution of independent Gamma random variables given their sum is Dirichlet
Invoked to establish that the exact-aggregate generators are precisely conditional-sum sampling of a Gamma population

pith-pipeline@v0.9.1-grok · 5878 in / 1389 out tokens · 22449 ms · 2026-06-27T18:50:29.952911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages

[1]

Ågren and V

W. Ågren and V. Úbeda Sosa. Hierarchical conditional tabular GAN for multi-tabular synthetic data generation. 2024. Preprint, arXiv:2411.07009

arXiv 2024
[2]

Aitchison.The Statistical Analysis of Compositional Data

J. Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, London, 1986. ISBN 0-412-28060-4

1986
[3]

Arasu, R

A. Arasu, R. Kaushik, and J. Li. DataSynth: Generating synthetic data using declarative constraints.Proc. VLDB Endowment (PVLDB), 4(12), 2011. doi: 10.14778/3402755.3402785

work page doi:10.14778/3402755.3402785 2011
[4]

Armendáriz and M

I. Armendáriz and M. Loulakis. Conditional distribution of heavy tailed random variables on large deviations of their sum.Stochastic Processes and their Applications, 121(5):1138–1147, 2011

2011
[5]

M. L. Balinski and H. P. Young.Fair Representation: Meeting the Ideal of One Man, One Vote. Yale University Press, 1982. ISBN 0-300-02724-9

1982
[6]

Binnig, D

C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating query-aware test databases. InProc. ACM SIGMOD Int. Conf. on Management of Data, pages 341–352, Beijing, China, 2007. doi: 10.1145/1247480.1247520

work page doi:10.1145/1247480.1247520 2007
[7]

G. C. Chow and A.-l. Lin. Best linear unbiased interpolation, distribution, and extrapolation of time series by related series.The Review of Economics and Statistics, 53(4):372–375, 1971

1971
[8]

V. S. Chundawat, A. K. Tarun, M. Mandal, M. Lahoti, and P. Narang. TabSynDex: A universal metric for robust evaluation of synthetic tabular data. 2022. Preprint, arXiv:2207.05295

arXiv 2022
[9]

L. H. Cox. A constructive procedure for unbiased controlled rounding.Journal of the American Statistical Association, 82(398):520–524, 1987. 20

1987
[10]

SDGym and SDMetrics

DataCebo. SDGym and SDMetrics. software + documentation
[11]

W. E. Deming and F. F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known.The Annals of Mathematical Statistics, 11(4): 427–444, 1940. doi: 10.1214/aoms/1177731829

work page doi:10.1214/aoms/1177731829 1940
[12]

F. T. Denton. Adjustment of monthly or quarterly series to annual totals: An approach based on quadratic minimization.Journal of the American Statistical Association, 66(333):99–102, 1971

1971
[13]

Esteban, S

C. Esteban, S. L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. 2017. Preprint, arXiv:1706.02633

Pith/arXiv arXiv 2017
[14]

Golchi and D

S. Golchi and D. A. Campbell. Sequentially constrained monte carlo.Computational Statistics & Data Analysis, 97:98–113, 2016

2016
[15]

Hudovernik, M

V. Hudovernik, M. Xu, J. Shi, L. Šubelj, S. Ermon, E. Štrumbelj, and J. Leskovec. RelDiff: Relational data generative modeling with graph-based diffusion models. 2025. Preprint, arXiv:2506.00710

arXiv 2025
[16]

Kotelnikov, D

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. TabDDPM: Modelling tabular data with diffusion models. InProc. 40th Int. Conf. on Machine Learning (ICML), PMLR, volume 202, pages 17564–17579, 2023

2023
[17]

A. D. Lautrup, T. Hyrup, A. Zimek, and P. Schneider-Kamp. SynthEval: A framework for detailed utility and privacy evaluation of tabular synthetic data.Data Mining and Knowledge Discovery, 39(1), 2025

2025
[18]

J. Li, Z. Zhao, M. Abdollahzadeh, B. Sikdar, and Y. C. Tay. IRG: Modular synthetic relational database generation with complex relational schemas. 2023. doi: 10.1145/3770854.3780313. Preprint, arXiv:2312.15187

work page doi:10.1145/3770854.3780313 2023
[19]

S. Liu, Y. Zheng, and Y. Zhang. StructSynth: Leveraging LLMs for structure-aware tabular data synthesis in low-data regimes. 2025. Preprint, arXiv:2508.02601

arXiv 2025
[20]

E. Lo, C. Binnig, D. Kossmann, M. T. Özsu, and W.-K. Hon. A framework for testing DBMS features.The VLDB Journal, 19(2):203–230, 2010. doi: 10.1007/s00778-009-0157-y

work page doi:10.1007/s00778-009-0157-y 2010
[21]

Y. Long, L. Xu, and A. Brintrup. LLM-TabLogic: Preserving inter-column logical relationships in synthetic tabular data via prompt-guided latent diffusion. 2025. Preprint, arXiv:2503.02161

Pith/arXiv arXiv 2025
[22]

E. Lukacs. A characterization of the gamma distribution.The Annals of Mathematical Statistics, 26(2):319–324, 1955

1955
[23]

Nguyen, S

A. Nguyen, S. Schafft, N. Hale, and J. Alfaro. FASTGEN: Fast and cost-effective synthetic tabular data generation with LLMs. 2025. Preprint, arXiv:2507.15839

arXiv 2025
[24]

NeMo data designer, 2025

NVIDIA (formerly Gretel). NeMo data designer, 2025. Apache-2.0; software + documentation

2025
[25]

Patki, R

N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. InIEEE Int. Conf. on Data Science and Advanced Analytics (DSAA), pages 399–410, 2016. doi: 10.1109/DSAA.2016. 49. 21

work page doi:10.1109/dsaa.2016 2016
[26]

Sanghi, S

A. Sanghi, S. Ahmed, and J. R. Haritsa. Projection-compliant database generation.Proc. VLDB Endowment (PVLDB), 15(5):998–1010, 2022. doi: 10.14778/3510397.3510398

work page doi:10.14778/3510397.3510398 2022
[27]

Sidorenko, M

A. Sidorenko, M. Platzer, M. Scriminaci, and P. Tiwald. Benchmarking synthetic tabular data: A multi-dimensional evaluation framework. 2025. Preprint, arXiv:2504.01908

arXiv 2025
[28]

Szavits-Nossan, M

J. Szavits-Nossan, M. R. Evans, and S. N. Majumdar. Condensation transition in joint large deviations of linear statistics.Journal of Physics A: Mathematical and Theoretical, 47(45): 455004, 2014

2014
[29]

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems (NeurIPS), pages 7335–7345, 2019

2019
[30]

Z. Yao, N. Krčo, G. Ganev, and Y.-A. de Montjoye. The DCR delusion: Measuring the privacy risk of synthetic data. 2025. Preprint, arXiv:2505.01524

arXiv 2025
[31]

Zhang et al

C. Zhang et al. Self-reinforcing controllable synthesis of rare relational data via bayesian calibration. 2026. Preprint, arXiv:2604.16817. 22

Pith/arXiv arXiv 2026

[1] [1]

Ågren and V

W. Ågren and V. Úbeda Sosa. Hierarchical conditional tabular GAN for multi-tabular synthetic data generation. 2024. Preprint, arXiv:2411.07009

arXiv 2024

[2] [2]

Aitchison.The Statistical Analysis of Compositional Data

J. Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, London, 1986. ISBN 0-412-28060-4

1986

[3] [3]

Arasu, R

A. Arasu, R. Kaushik, and J. Li. DataSynth: Generating synthetic data using declarative constraints.Proc. VLDB Endowment (PVLDB), 4(12), 2011. doi: 10.14778/3402755.3402785

work page doi:10.14778/3402755.3402785 2011

[4] [4]

Armendáriz and M

I. Armendáriz and M. Loulakis. Conditional distribution of heavy tailed random variables on large deviations of their sum.Stochastic Processes and their Applications, 121(5):1138–1147, 2011

2011

[5] [5]

M. L. Balinski and H. P. Young.Fair Representation: Meeting the Ideal of One Man, One Vote. Yale University Press, 1982. ISBN 0-300-02724-9

1982

[6] [6]

Binnig, D

C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating query-aware test databases. InProc. ACM SIGMOD Int. Conf. on Management of Data, pages 341–352, Beijing, China, 2007. doi: 10.1145/1247480.1247520

work page doi:10.1145/1247480.1247520 2007

[7] [7]

G. C. Chow and A.-l. Lin. Best linear unbiased interpolation, distribution, and extrapolation of time series by related series.The Review of Economics and Statistics, 53(4):372–375, 1971

1971

[8] [8]

V. S. Chundawat, A. K. Tarun, M. Mandal, M. Lahoti, and P. Narang. TabSynDex: A universal metric for robust evaluation of synthetic tabular data. 2022. Preprint, arXiv:2207.05295

arXiv 2022

[9] [9]

L. H. Cox. A constructive procedure for unbiased controlled rounding.Journal of the American Statistical Association, 82(398):520–524, 1987. 20

1987

[10] [10]

SDGym and SDMetrics

DataCebo. SDGym and SDMetrics. software + documentation

[11] [11]

W. E. Deming and F. F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known.The Annals of Mathematical Statistics, 11(4): 427–444, 1940. doi: 10.1214/aoms/1177731829

work page doi:10.1214/aoms/1177731829 1940

[12] [12]

F. T. Denton. Adjustment of monthly or quarterly series to annual totals: An approach based on quadratic minimization.Journal of the American Statistical Association, 66(333):99–102, 1971

1971

[13] [13]

Esteban, S

C. Esteban, S. L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. 2017. Preprint, arXiv:1706.02633

Pith/arXiv arXiv 2017

[14] [14]

Golchi and D

S. Golchi and D. A. Campbell. Sequentially constrained monte carlo.Computational Statistics & Data Analysis, 97:98–113, 2016

2016

[15] [15]

Hudovernik, M

V. Hudovernik, M. Xu, J. Shi, L. Šubelj, S. Ermon, E. Štrumbelj, and J. Leskovec. RelDiff: Relational data generative modeling with graph-based diffusion models. 2025. Preprint, arXiv:2506.00710

arXiv 2025

[16] [16]

Kotelnikov, D

A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. TabDDPM: Modelling tabular data with diffusion models. InProc. 40th Int. Conf. on Machine Learning (ICML), PMLR, volume 202, pages 17564–17579, 2023

2023

[17] [17]

A. D. Lautrup, T. Hyrup, A. Zimek, and P. Schneider-Kamp. SynthEval: A framework for detailed utility and privacy evaluation of tabular synthetic data.Data Mining and Knowledge Discovery, 39(1), 2025

2025

[18] [18]

J. Li, Z. Zhao, M. Abdollahzadeh, B. Sikdar, and Y. C. Tay. IRG: Modular synthetic relational database generation with complex relational schemas. 2023. doi: 10.1145/3770854.3780313. Preprint, arXiv:2312.15187

work page doi:10.1145/3770854.3780313 2023

[19] [19]

S. Liu, Y. Zheng, and Y. Zhang. StructSynth: Leveraging LLMs for structure-aware tabular data synthesis in low-data regimes. 2025. Preprint, arXiv:2508.02601

arXiv 2025

[20] [20]

E. Lo, C. Binnig, D. Kossmann, M. T. Özsu, and W.-K. Hon. A framework for testing DBMS features.The VLDB Journal, 19(2):203–230, 2010. doi: 10.1007/s00778-009-0157-y

work page doi:10.1007/s00778-009-0157-y 2010

[21] [21]

Y. Long, L. Xu, and A. Brintrup. LLM-TabLogic: Preserving inter-column logical relationships in synthetic tabular data via prompt-guided latent diffusion. 2025. Preprint, arXiv:2503.02161

Pith/arXiv arXiv 2025

[22] [22]

E. Lukacs. A characterization of the gamma distribution.The Annals of Mathematical Statistics, 26(2):319–324, 1955

1955

[23] [23]

Nguyen, S

A. Nguyen, S. Schafft, N. Hale, and J. Alfaro. FASTGEN: Fast and cost-effective synthetic tabular data generation with LLMs. 2025. Preprint, arXiv:2507.15839

arXiv 2025

[24] [24]

NeMo data designer, 2025

NVIDIA (formerly Gretel). NeMo data designer, 2025. Apache-2.0; software + documentation

2025

[25] [25]

Patki, R

N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. InIEEE Int. Conf. on Data Science and Advanced Analytics (DSAA), pages 399–410, 2016. doi: 10.1109/DSAA.2016. 49. 21

work page doi:10.1109/dsaa.2016 2016

[26] [26]

Sanghi, S

A. Sanghi, S. Ahmed, and J. R. Haritsa. Projection-compliant database generation.Proc. VLDB Endowment (PVLDB), 15(5):998–1010, 2022. doi: 10.14778/3510397.3510398

work page doi:10.14778/3510397.3510398 2022

[27] [27]

Sidorenko, M

A. Sidorenko, M. Platzer, M. Scriminaci, and P. Tiwald. Benchmarking synthetic tabular data: A multi-dimensional evaluation framework. 2025. Preprint, arXiv:2504.01908

arXiv 2025

[28] [28]

Szavits-Nossan, M

J. Szavits-Nossan, M. R. Evans, and S. N. Majumdar. Condensation transition in joint large deviations of linear statistics.Journal of Physics A: Mathematical and Theoretical, 47(45): 455004, 2014

2014

[29] [29]

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems (NeurIPS), pages 7335–7345, 2019

2019

[30] [30]

Z. Yao, N. Krčo, G. Ganev, and Y.-A. de Montjoye. The DCR delusion: Measuring the privacy risk of synthetic data. 2025. Preprint, arXiv:2505.01524

arXiv 2025

[31] [31]

Zhang et al

C. Zhang et al. Self-reinforcing controllable synthesis of rare relational data via bayesian calibration. 2026. Preprint, arXiv:2604.16817. 22

Pith/arXiv arXiv 2026