pith. sign in

arxiv: 2606.08736 · v1 · pith:AZAT7HMBnew · submitted 2026-06-07 · 💻 cs.LG · cs.DB

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

Pith reviewed 2026-06-27 18:50 UTC · model grok-4.3

classification 💻 cs.LG cs.DB
keywords outcome-conformant synthesisexact aggregatesynthetic tabular datacold-start generationGamma populationLukacs characterizationconformance benchmarkSpecBench
0
0 comments X

The pith

A closed-form generator based on conditional-sum sampling from a Gamma population achieves exact satisfaction of declared analytical outcomes like aggregates, with no source data required.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines outcome-conformant synthesis as the task of generating relational tabular data that exactly matches declared targets such as revenue curves or group shares, without any real source data to imitate. It shows that off-the-shelf methods trained on data still miss those targets by large margins, while a deterministic closed-form approach reaches zero error. The method models the data as draws from a Gamma population and uses Lukacs' characterization to enforce exact conditional sums. This separates conformance to declared outcomes from fidelity to real data, and the two are shown to be independent evaluation axes. A new benchmark measures this capability for cold-start cases.

Core claim

A widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population via Lukacs' characterization, delivering closed-form exactness on the aggregate, a closed-form marginal coefficient of variation, and scale-invariance; a controlled experiment shows that enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, with the remainder attributable to shape-family mismatch.

What carries the argument

conditional-sum sampling of a Gamma population (via Lukacs' characterization), which enforces exact declared aggregates while preserving closed-form marginal properties and determinism

Load-bearing premise

The relational data generation process for declared outcomes can be modeled as draws from a Gamma population whose conditional sums satisfy the exact aggregate constraint without introducing unmodeled dependencies across the schema.

What would settle it

Run the closed-form generator on a declared aggregate and measure whether the output aggregate deviates from the target by any amount greater than floating-point error, or whether the 1-Wasserstein distance to an external marginal exceeds 0.006 after accounting for shape mismatch.

Figures

Figures reproduced from arXiv: 2606.08736 by Muhammed Rasin.

Figure 1
Figure 1. Figure 1: The P⋆ frontier across target marginals. The exact-sum engine (solid) sits on the unconstrained same-family draw (dashed); enforcing the exact aggregate adds at most 0.006 in normalized 1-Wasserstein distance. The shaded gap to the target is shape-family mismatch. 6.9 Suite-level natural-language conformance across 18 domains (E12) The leaderboard tasks exercise three curated domains through the natural-la… view at source ↗
read the original abstract

We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that outcome-conformant synthesis enables exact satisfaction of declared analytical outcomes (e.g., aggregates across a relational schema) in cold-start settings with no source data. It argues this is achieved by showing that a family of exact-aggregate generators is equivalent to conditional-sum sampling from a Gamma population via Lukacs' characterization, yielding closed-form exactness, marginal CV, and scale-invariance. A controlled experiment bounds the 1-Wasserstein cost of enforcing the exact aggregate at most 0.006 relative to an external marginal; off-the-shelf methods miss declared aggregates by 74-86% while the proposed closed-form generator reaches exactly 0. It introduces SpecBench for conformance evaluation and positions conformance as orthogonal to fidelity.

Significance. If the derivation and multi-constraint extension hold, the work establishes a new paradigm and evaluation axis for synthetic data that prioritizes exact declarative conformance over distributional fidelity. This is particularly relevant for cold-start relational generation tasks where specific analytical outcomes must be met deterministically, and the closed-form properties plus the SpecBench benchmark provide concrete tools and a falsifiable testbed that current imitation methods lack.

major comments (2)
  1. [§3] §3 (Formal Account via Lukacs' characterization): The central equivalence to conditional-sum sampling of independent Gammas with common scale supplies the claimed closed-form marginal CV and scale-invariance, but the relational schema setting requires simultaneous enforcement of multiple declared outcomes (different periods/groups/joins). Conditioning on one aggregate induces dependence that violates the independence premise required for the second aggregate to remain a product of conditional Gammas; the manuscript does not supply an adjustment or proof that the closed-form properties survive this joint constraint.
  2. [Section 5] Controlled experiment (Section 5, Wasserstein bound): The reported bound of at most 0.006 in 1-Wasserstein distance is demonstrated for a single aggregate constraint; no results or extension are given for the multi-aggregate case that the abstract and introduction identify as the core relational use case. This leaves the load-bearing claim that the method jointly satisfies exactness, closed-form marginals, and relational integrity without extra adjustments unverified.
minor comments (2)
  1. [Abstract] Abstract: the phrase '74 to 86 percent' miss should be clarified as relative or absolute error on the aggregate value, and the 'per-period steelman' baseline should be defined with a citation or section reference.
  2. SpecBench description: the benchmark schema, number of declared outcomes per instance, and exact metric definitions (e.g., how conformance is aggregated across periods) are referenced but not fully specified in the provided text, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the precise identification of the gap between the single-constraint formalization and the multi-constraint relational claims. We address both major comments below and will revise the manuscript to clarify the scope and supply the missing multi-aggregate analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Formal Account via Lukacs' characterization): The central equivalence to conditional-sum sampling of independent Gammas with common scale supplies the claimed closed-form marginal CV and scale-invariance, but the relational schema setting requires simultaneous enforcement of multiple declared outcomes (different periods/groups/joins). Conditioning on one aggregate induces dependence that violates the independence premise required for the second aggregate to remain a product of conditional Gammas; the manuscript does not supply an adjustment or proof that the closed-form properties survive this joint constraint.

    Authors: We agree that Lukacs' characterization and the resulting closed-form marginal CV and scale-invariance rest on independence of the underlying Gamma variables. Sequential conditioning on multiple aggregates necessarily introduces dependence, so the exact closed-form properties do not automatically carry over. The manuscript presents the single-constraint derivation as the core technical result and states that the approach extends to relational schemas, but supplies neither an explicit joint proof nor an adjusted procedure. We will revise §3 to (a) explicitly delimit the single-constraint guarantees and (b) either derive an approximate multi-constraint extension or describe a practical sequential sampling algorithm with its attendant loss of closed-form marginal properties. revision: yes

  2. Referee: [Section 5] Controlled experiment (Section 5, Wasserstein bound): The reported bound of at most 0.006 in 1-Wasserstein distance is demonstrated for a single aggregate constraint; no results or extension are given for the multi-aggregate case that the abstract and introduction identify as the core relational use case. This leaves the load-bearing claim that the method jointly satisfies exactness, closed-form marginals, and relational integrity without extra adjustments unverified.

    Authors: The experiment in Section 5 is restricted to a single aggregate; the 0.006 Wasserstein bound therefore applies only to that setting. The abstract and introduction do position the method for relational schemas with multiple declared outcomes, yet no multi-constraint results are reported. We accept that this leaves the joint claim unverified. In revision we will add either (i) a multi-constraint experiment that measures the additional Wasserstein cost and any degradation in marginal properties, or (ii) a clear statement that the current closed-form guarantees and bound are proven only for single constraints, with multi-constraint support left for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external Lukacs theorem plus controlled experiment.

full rationale

The paper's central formal account equates a family of exact-aggregate generators to conditional-sum sampling of a Gamma population via Lukacs' characterization (an external theorem), from which closed-form exactness, marginal CV, and scale-invariance are stated to follow. A separate controlled experiment quantifies the 1-Wasserstein cost of enforcing the aggregate against an arbitrary external marginal. No equations or claims reduce by construction to fitted parameters defined by the target result, no self-citation is load-bearing on the core identification, and no ansatz or uniqueness result is smuggled via prior author work. The derivation chain is therefore self-contained against the cited external theorem and experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Lukacs' characterization to the generators used for relational aggregates and on the modeling choice that the data can be represented as a Gamma population without source data.

axioms (1)
  • standard math Lukacs' characterization: the conditional distribution of independent Gamma random variables given their sum is Dirichlet
    Invoked to establish that the exact-aggregate generators are precisely conditional-sum sampling of a Gamma population

pith-pipeline@v0.9.1-grok · 5878 in / 1389 out tokens · 22449 ms · 2026-06-27T18:50:29.952911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages

  1. [1]

    Ågren and V

    W. Ågren and V. Úbeda Sosa. Hierarchical conditional tabular GAN for multi-tabular synthetic data generation. 2024. Preprint, arXiv:2411.07009

  2. [2]

    Aitchison.The Statistical Analysis of Compositional Data

    J. Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, London, 1986. ISBN 0-412-28060-4

  3. [3]

    Arasu, R

    A. Arasu, R. Kaushik, and J. Li. DataSynth: Generating synthetic data using declarative constraints.Proc. VLDB Endowment (PVLDB), 4(12), 2011. doi: 10.14778/3402755.3402785

  4. [4]

    Armendáriz and M

    I. Armendáriz and M. Loulakis. Conditional distribution of heavy tailed random variables on large deviations of their sum.Stochastic Processes and their Applications, 121(5):1138–1147, 2011

  5. [5]

    M. L. Balinski and H. P. Young.Fair Representation: Meeting the Ideal of One Man, One Vote. Yale University Press, 1982. ISBN 0-300-02724-9

  6. [6]

    Binnig, D

    C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating query-aware test databases. InProc. ACM SIGMOD Int. Conf. on Management of Data, pages 341–352, Beijing, China, 2007. doi: 10.1145/1247480.1247520

  7. [7]

    G. C. Chow and A.-l. Lin. Best linear unbiased interpolation, distribution, and extrapolation of time series by related series.The Review of Economics and Statistics, 53(4):372–375, 1971

  8. [8]

    V. S. Chundawat, A. K. Tarun, M. Mandal, M. Lahoti, and P. Narang. TabSynDex: A universal metric for robust evaluation of synthetic tabular data. 2022. Preprint, arXiv:2207.05295

  9. [9]

    L. H. Cox. A constructive procedure for unbiased controlled rounding.Journal of the American Statistical Association, 82(398):520–524, 1987. 20

  10. [10]

    SDGym and SDMetrics

    DataCebo. SDGym and SDMetrics. software + documentation

  11. [11]

    W. E. Deming and F. F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known.The Annals of Mathematical Statistics, 11(4): 427–444, 1940. doi: 10.1214/aoms/1177731829

  12. [12]

    F. T. Denton. Adjustment of monthly or quarterly series to annual totals: An approach based on quadratic minimization.Journal of the American Statistical Association, 66(333):99–102, 1971

  13. [13]

    Esteban, S

    C. Esteban, S. L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. 2017. Preprint, arXiv:1706.02633

  14. [14]

    Golchi and D

    S. Golchi and D. A. Campbell. Sequentially constrained monte carlo.Computational Statistics & Data Analysis, 97:98–113, 2016

  15. [15]

    Hudovernik, M

    V. Hudovernik, M. Xu, J. Shi, L. Šubelj, S. Ermon, E. Štrumbelj, and J. Leskovec. RelDiff: Relational data generative modeling with graph-based diffusion models. 2025. Preprint, arXiv:2506.00710

  16. [16]

    Kotelnikov, D

    A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. TabDDPM: Modelling tabular data with diffusion models. InProc. 40th Int. Conf. on Machine Learning (ICML), PMLR, volume 202, pages 17564–17579, 2023

  17. [17]

    A. D. Lautrup, T. Hyrup, A. Zimek, and P. Schneider-Kamp. SynthEval: A framework for detailed utility and privacy evaluation of tabular synthetic data.Data Mining and Knowledge Discovery, 39(1), 2025

  18. [18]

    J. Li, Z. Zhao, M. Abdollahzadeh, B. Sikdar, and Y. C. Tay. IRG: Modular synthetic relational database generation with complex relational schemas. 2023. doi: 10.1145/3770854.3780313. Preprint, arXiv:2312.15187

  19. [19]

    S. Liu, Y. Zheng, and Y. Zhang. StructSynth: Leveraging LLMs for structure-aware tabular data synthesis in low-data regimes. 2025. Preprint, arXiv:2508.02601

  20. [20]

    E. Lo, C. Binnig, D. Kossmann, M. T. Özsu, and W.-K. Hon. A framework for testing DBMS features.The VLDB Journal, 19(2):203–230, 2010. doi: 10.1007/s00778-009-0157-y

  21. [21]

    Y. Long, L. Xu, and A. Brintrup. LLM-TabLogic: Preserving inter-column logical relationships in synthetic tabular data via prompt-guided latent diffusion. 2025. Preprint, arXiv:2503.02161

  22. [22]

    E. Lukacs. A characterization of the gamma distribution.The Annals of Mathematical Statistics, 26(2):319–324, 1955

  23. [23]

    Nguyen, S

    A. Nguyen, S. Schafft, N. Hale, and J. Alfaro. FASTGEN: Fast and cost-effective synthetic tabular data generation with LLMs. 2025. Preprint, arXiv:2507.15839

  24. [24]

    NeMo data designer, 2025

    NVIDIA (formerly Gretel). NeMo data designer, 2025. Apache-2.0; software + documentation

  25. [25]

    Patki, R

    N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. InIEEE Int. Conf. on Data Science and Advanced Analytics (DSAA), pages 399–410, 2016. doi: 10.1109/DSAA.2016. 49. 21

  26. [26]

    Sanghi, S

    A. Sanghi, S. Ahmed, and J. R. Haritsa. Projection-compliant database generation.Proc. VLDB Endowment (PVLDB), 15(5):998–1010, 2022. doi: 10.14778/3510397.3510398

  27. [27]

    Sidorenko, M

    A. Sidorenko, M. Platzer, M. Scriminaci, and P. Tiwald. Benchmarking synthetic tabular data: A multi-dimensional evaluation framework. 2025. Preprint, arXiv:2504.01908

  28. [28]

    Szavits-Nossan, M

    J. Szavits-Nossan, M. R. Evans, and S. N. Majumdar. Condensation transition in joint large deviations of linear statistics.Journal of Physics A: Mathematical and Theoretical, 47(45): 455004, 2014

  29. [29]

    L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems (NeurIPS), pages 7335–7345, 2019

  30. [30]

    Z. Yao, N. Krčo, G. Ganev, and Y.-A. de Montjoye. The DCR delusion: Measuring the privacy risk of synthetic data. 2025. Preprint, arXiv:2505.01524

  31. [31]

    Zhang et al

    C. Zhang et al. Self-reinforcing controllable synthesis of rare relational data via bayesian calibration. 2026. Preprint, arXiv:2604.16817. 22