arxiv: 2605.06261 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion

Eugenio Lomurno , Filippo Balzarini , Francesco Benelle , Francesca Pia Panaccione , Matteo Matteucci

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tabular data synthesisdiffusion modelsinference-time refinementsynthetic data utilitychamfer alignmentdownstream task performanceTabDiff backbone

0 comments

The pith

Inference-time refinement of a frozen tabular diffusion model produces synthetic data that trains downstream models better than real data does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that synthetic tabular data generated by a pre-trained diffusion model can be refined after training to close and even reverse the usual gap with real data in downstream utility. It does this through TARDIS, a framework that searches per dataset for the right combination of score-level guidance during the reverse diffusion steps and post-generation sample selection, all organized around a single pattern of symmetric alignment between synthetic and real samples. This alignment happens both continuously through gradients and discretely through ranking, without any change to the original model's weights. A sympathetic reader cares because the method works in minutes to an hour on ordinary hardware and requires no new training runs or architectural changes.

Core claim

TARDIS recovers Bidirectional Chamfer Refinement configurations on most of 15 benchmarks and yields synthetic data that raises downstream task performance by a median 8.6 percent over models trained on real data (with strict wins on 11 of 15 datasets) while leaving the pre-trained backbone's manifold fidelity, diversity, and privacy statistics unchanged.

What carries the argument

Bidirectional Chamfer Refinement (BCR), the symmetric Chamfer functional between synthetic and real samples that is minimized both continuously via score-level gradients during reverse diffusion and discretely via batch-ranking post-generation selectors.

Load-bearing premise

The per-dataset search over guidance and selector choices reliably finds refinement settings that improve performance without overfitting to the validation objectives used inside the search.

What would settle it

Applying the same TARDIS procedure to a new collection of tabular datasets drawn from different domains and measuring no gain in downstream accuracy over either real data or the unrefined backbone.

Figures

Figures reproduced from arXiv: 2605.06261 by Eugenio Lomurno, Filippo Balzarini, Francesca Pia Panaccione, Francesco Benelle, Matteo Matteucci.

**Figure 1.** Figure 1: TARDIS pipeline. Stage I draws an oversampled noise pool Dnoise of cardinality M · Nr from the latent space Z. Stage II denoises Dnoise via reverse diffusion, perturbing the score εθ(xt, t) with the gradient of the bidirectional Chamfer functional C between the current candidate batch xt and a real reference batch xr, projected through a representation map φ; this produces the candidate pool Dcand. Stage I… view at source ↗

**Figure 2.** Figure 2: Empirical signatures of TARDIS performance: utility headroom (left) and cardinality saturation (right). generation accounts for 47% to 90% of the wall-clock budget; Stage III selection contributes 5% to 38%, with the highest fractions on Music and News (both 38%) where the candidate pool is large; GKD distillation contributes 7% to 11% on the two datasets where it is active. The dominant cost factor is th… view at source ↗

**Figure 3.** Figure 3: Stacked-bar visualization of view at source ↗

read the original abstract

Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural advances, scaling, and retraining of monolithic generators. The inference-time alternative, i.e., refining the outputs of a pre-trained backbone with parameters left untouched, has remained largely unexplored for tabular synthesis. We introduce TARDIS (Tabular generation through Refinement, Distillation, and Inference-time Sampling), an inference-time refinement framework that operates on a frozen pre-trained backbone, configured per dataset by a Tree-structured Parzen Estimator search over score-level guidance during reverse diffusion, with each trial's objective set by an inner grid search over post-hoc sample selectors and an optional soft-label distillation step. The search space encodes a single mathematical pattern we name Bidirectional Chamfer Refinement (BCR): the symmetric Chamfer functional between synthetic and real samples is minimized both continuously, via a score-level gradient, and discretely, via batch-ranking post-generation. The per-dataset search recovers BCR-aligned configurations on most datasets, evidence for BCR as the dominant refinement pattern. Across 15 binary, multiclass, and regression benchmarks TARDIS achieves a median +8.6% downstream-task improvement over models trained on real data (95% CI [+3.3, +16.4], Wilcoxon p=0.016, 11/15 strict wins) and improves over the TabDiff backbone on all 15 datasets (mean +12.9%, p<10^-4), matching the backbone on manifold fidelity, diversity, and sample-level privacy. Inference-time refinement of a pre-trained tabular diffusion backbone reaches and exceeds real-data utility in 1 to 80 minutes on a single consumer-grade GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TARDIS claims inference-time refinement lets a frozen tabular diffusion model beat real data on downstream tasks, but the per-dataset search directly targeting that metric leaves open whether the BCR pattern or the tuning is doing most of the work.

read the letter

The main point is that TARDIS gets synthetic tabular data to outperform real data on average for downstream models by refining outputs from a frozen diffusion backbone at inference time. This moves away from the usual focus on retraining or scaling generators during training, which is useful when the backbone is expensive to touch again. They define Bidirectional Chamfer Refinement as the core pattern: it pulls synthetic samples toward real ones both by adding a Chamfer-based gradient to the score during reverse diffusion and by applying post-generation batch selection. A TPE search per dataset tunes the guidance scale and selector choices to recover settings aligned with this pattern, plus an optional distillation step. The reported results are a median 8.6% gain over real-data baselines across 15 benchmarks, with a confidence interval and Wilcoxon p-value, plus consistent improvement over the TabDiff backbone and no degradation on fidelity, diversity, or privacy checks. This is new because inference-time refinement for tabular diffusion has seen little prior attention. The practical side of leaving the backbone untouched is a clear plus for applied settings. The soft spot is the search procedure itself. Since each dataset gets its own optimization whose objective is downstream utility on the same splits used for final reporting, the headline numbers could partly come from exploiting benchmark-specific quirks rather than from a general, reusable BCR mechanism. The abstract notes that the search recovers BCR-aligned configurations, but without results for a single fixed parameter set across datasets or on held-out domains, it is difficult to judge how much of the lift is reproducible without retuning. Minor gaps like fuller dataset descriptions and exact search spaces also affect how easy it is to verify the claims. This paper is for researchers working on synthetic tabular data, especially in privacy-sensitive areas such as healthcare and finance. Readers focused on diffusion models for structured data would find the inference-time angle worth testing. It has enough substance and a distinct direction to merit peer review, though referees will likely press on the generalization of the refinement pattern beyond the tuned setups.

Referee Report

3 major / 2 minor

Summary. The paper introduces TARDIS, an inference-time refinement framework for frozen pre-trained tabular diffusion backbones. It configures per-dataset Tree-structured Parzen Estimator (TPE) searches over score-level guidance and post-hoc selectors (plus optional distillation) to implement Bidirectional Chamfer Refinement (BCR), claiming this recovers a dominant refinement pattern. Across 15 binary/multiclass/regression benchmarks, TARDIS reports a median +8.6% downstream-task improvement over real-data baselines (95% CI [+3.3, +16.4], Wilcoxon p=0.016, 11/15 strict wins) while matching the TabDiff backbone on fidelity, diversity, and privacy metrics.

Significance. If the reported gains are attributable to the BCR mechanism rather than per-dataset optimization, the result would be significant: it would establish that inference-time refinement of existing tabular diffusion models can close (and exceed) the synthetic-real utility gap without retraining or architectural changes, shifting emphasis from training-time advances. The statistical reporting (CIs, p-values, win counts) and focus on a frozen backbone are strengths.

major comments (3)

[Abstract] Abstract: The experimental protocol runs a fresh TPE search per dataset whose objective is downstream task performance—the same metric used to declare the +8.6% median improvement and 11/15 wins. This leaves open whether the headline results arise from recovering a general BCR pattern or from dataset-specific exploitation of validation idiosyncrasies; a fixed BCR configuration (or median parameters) evaluated on held-out data or new domains is required to support the central claim.
[Abstract] Abstract and search-procedure description: No ablation isolates BCR from other search outcomes, nor reports performance of a single non-per-dataset BCR configuration. The claim that the search 'recovers BCR-aligned configurations on most datasets' therefore lacks direct evidence that BCR, rather than the optimization procedure itself, drives the gains over real data and the backbone.
[Abstract] Abstract: Dataset characteristics, exact search-space bounds for guidance scales and selectors, baseline re-implementations, and validation-fold details are not provided. Without these, it is impossible to determine whether the Wilcoxon significance and 'exceeds real data' result are robust or sensitive to the 15 chosen benchmarks and their splits.

minor comments (2)

[Abstract] The runtime range '1 to 80 minutes' should be accompanied by per-dataset GPU hours, hardware specification, and dataset sizes for reproducibility.
All 15 datasets should be explicitly listed with type (binary/multiclass/regression), size, and source to allow independent verification of the benchmark suite.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on our experimental design and committing to specific revisions that strengthen the evidence for Bidirectional Chamfer Refinement (BCR) as the driving mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The experimental protocol runs a fresh TPE search per dataset whose objective is downstream task performance—the same metric used to declare the +8.6% median improvement and 11/15 wins. This leaves open whether the headline results arise from recovering a general BCR pattern or from dataset-specific exploitation of validation idiosyncrasies; a fixed BCR configuration (or median parameters) evaluated on held-out data or new domains is required to support the central claim.

Authors: We acknowledge that the per-dataset TPE search optimizes directly for downstream performance and could in principle exploit validation-set characteristics. However, the search space is deliberately restricted to parameters that implement the BCR pattern (symmetric Chamfer minimization via score-level guidance and post-hoc selection). In the revised manuscript we will report results for a single fixed BCR configuration obtained by taking the median guidance scales and selector parameters across all 15 datasets; this fixed configuration will be evaluated on the same benchmarks to quantify how much of the reported gain persists without per-dataset re-optimization. revision: partial
Referee: [Abstract] Abstract and search-procedure description: No ablation isolates BCR from other search outcomes, nor reports performance of a single non-per-dataset BCR configuration. The claim that the search 'recovers BCR-aligned configurations on most datasets' therefore lacks direct evidence that BCR, rather than the optimization procedure itself, drives the gains over real data and the backbone.

Authors: We agree that an explicit ablation separating BCR-aligned outcomes from other search results would provide stronger causal evidence. In the revision we will add (i) a table comparing downstream performance of the BCR-aligned configurations recovered on each dataset versus the non-BCR configurations that the TPE also evaluated, and (ii) the performance of the single median-parameter BCR configuration described above, thereby isolating the contribution of the BCR pattern from the search procedure itself. revision: yes
Referee: [Abstract] Abstract: Dataset characteristics, exact search-space bounds for guidance scales and selectors, baseline re-implementations, and validation-fold details are not provided. Without these, it is impossible to determine whether the Wilcoxon significance and 'exceeds real data' result are robust or sensitive to the 15 chosen benchmarks and their splits.

Authors: We apologize for these omissions. The revised manuscript and supplementary material will include: (a) a table summarizing the 15 datasets (size, feature types, task, source), (b) the precise numerical bounds used for the TPE search over guidance scales and selector hyperparameters, (c) exact re-implementation details for all baselines, and (d) the train/validation/test split ratios and random seeds employed for each benchmark. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical methods contribution whose central claims are measured performance improvements on 15 fixed benchmarks. The TARDIS framework explicitly includes per-dataset TPE configuration search whose objective is downstream utility; the reported +8.6% median gain and Wilcoxon statistics are therefore direct experimental outcomes of the described procedure rather than independent predictions. BCR is introduced as the mathematical pattern encoded in the search space, and the statement that the search 'recovers BCR-aligned configurations' follows from that design choice, but this interpretive remark does not reduce the headline empirical results to a tautology. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The derivation chain consists of standard ML experimental practice (hyperparameter search + evaluation against real-data and backbone baselines) and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard diffusion model assumptions plus the newly introduced BCR pattern and search procedure, with no independent evidence provided for BCR dominance beyond the reported empirical recovery on the tested datasets.

free parameters (2)

score-level guidance scale
Per-dataset TPE search over continuous guidance during reverse diffusion
post-hoc selector hyperparameters
Inner grid search per TPE trial to choose batch-ranking rules

axioms (2)

domain assumption Pre-trained tabular diffusion backbones produce samples that can be meaningfully refined without parameter updates
Core premise enabling frozen-backbone operation
ad hoc to paper Bidirectional Chamfer Refinement is the dominant and recoverable refinement pattern across datasets
Claimed on the basis that search recovers BCR-aligned configurations on most datasets

invented entities (1)

Bidirectional Chamfer Refinement (BCR) no independent evidence
purpose: Symmetric Chamfer minimization performed both continuously via score gradients and discretely via post-generation ranking
Newly named mathematical pattern introduced to unify the refinement operations

pith-pipeline@v0.9.0 · 5647 in / 1632 out tokens · 81729 ms · 2026-05-08T13:05:35.983356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision ar- chitectures

James Bergstra, Daniel Y amins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision ar- chitectures. InInternational conference on machine learning, pages 115–123. PMLR, 2013

2013
[2]

P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae.arXiv preprint arXiv:1804.03599, 2018

work page arXiv 2018
[3]

Density-based clustering based on hierar- chical density estimates

Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierar- chical density estimates. InPacific-Asia Conference on Knowledge Discovery and Data Mining, 2013

2013
[4]

Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16:321–357, 2002

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16:321–357, 2002

2002
[5]

Xgboost: A scal- able tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scal- able tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, pages 785–794, 2016

2016
[6]

Increasing the utility of synthetic images through chamfer guid- ance.arXiv preprint arXiv:2508.10631, 2025

Nicola Dall’Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, and Michal Drozdzal. Increasing the utility of synthetic images through chamfer guid- ance.arXiv preprint arXiv:2508.10631, 2025

work page arXiv 2025
[7]

Navigating tabular data syn- thesis research understanding user needs and tool capabilities.ACM SIGMOD Record, 53(4):18–35, 2025

Maria F Davila R, Sven Groen, Fabian Panse, and Wolfram Wingerath. Navigating tabular data syn- thesis research understanding user needs and tool capabilities.ACM SIGMOD Record, 53(4):18–35, 2025

2025
[8]

Iterative subset selection for high-fidelity synthetic tabular data

Daniel G"arber and Lea Demelius. Iterative subset selection for high-fidelity synthetic tabular data. In EurIPS 2025 Workshop: AI for Tabular Data, 2025

2025
[9]

General data protection regulation (gdpr)

EU GDPR. General data protection regulation (gdpr). Cit. on, page 4, 2018

2018
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015
[11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free dif- fusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review arXiv 2022
[12]

Denois- ing diffusion probabilistic models.Advances in neu- ral information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in neu- ral information processing systems, 33:6840–6851, 2020

2020
[13]

Stasy: Score-based tabular data synthesis.arXiv preprint arXiv:2210.04018, 2022

Jayoung Kim, Chaejeong Lee, and Noseong Park. Stasy: Score-based tabular data synthesis.arXiv preprint arXiv:2210.04018, 2022

work page arXiv 2022
[14]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review arXiv 2013
[15]

Tabddpm: Modelling tabular data with diffusion models

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. InInternational con- ference on machine learning, pages 17564–17579. PMLR, 2023

2023
[16]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019

2019
[17]

Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis

Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. InInternational Con- ference on Machine Learning, pages 18940–18956. PMLR, 2023

2023
[18]

Federated knowledge recycling: Privacy-preserving synthetic data sharing.Pattern Recognition Letters, 190:124– 130, 2025

Eugenio Lomurno and Matteo Matteucci. Federated knowledge recycling: Privacy-preserving synthetic data sharing.Pattern Recognition Letters, 190:124– 130, 2025. 8 Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion

2025
[19]

Synthetic image learning: Preserving performance and pre- venting membership inference attacks.Pattern Recognition Letters, 190:52–58, 2025

Eugenio Lomurno and Matteo Matteucci. Synthetic image learning: Preserving performance and pre- venting membership inference attacks.Pattern Recognition Letters, 190:52–58, 2025

2025
[20]

Tabd- iff: a mixed-type diffusion model for tabular data generation.arXiv preprint arXiv:2410.20626, 2024

Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, and Jure Leskovec. Tabdiff: a mixed-type diffusion model for tabular data genera- tion.arXiv preprint arXiv:2410.20626, 2024

work page arXiv 2024
[21]

Tabularargn: An auto-regressive generative network for tabular data generation

Andrey Sidorenko, Ivona Krchova, Mariana Vargas Vieyra, Paul Tiwald, Mario Scriminaci, and Michael Platzer. Tabularargn: An auto-regressive generative network for tabular data generation. InEurIPS 2025 Workshop: AI for Tabular Data, 2025

2025
[22]

A survey on tabular data generation: Utility, alignment, fidelity, privacy, and beyond.arXiv preprint arXiv:2503.05954, 2025

Mihaela CÄ Stoian, Eleonora Giunchiglia, and Thomas Lukasiewicz. A survey on tabular data gen- eration: Utility, alignment, fidelity, privacy, and be- yond.arXiv preprint arXiv:2503.05954, 2025

work page arXiv 2025
[23]

Information-based optimal subdata selection for big data linear regression.Journal of the American Sta- tistical Association, 114(525):393–405, 2019

HaiYing Wang, Min Y ang, and John Stufken. Information-based optimal subdata selection for big data linear regression.Journal of the American Sta- tistical Association, 114(525):393–405, 2019

2019
[24]

Modeling tabular data using conditional gan.Advances in neural informa- tion processing systems, 32, 2019

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan.Advances in neural informa- tion processing systems, 32, 2019

2019
[25]

Mixed-type tabular data synthesis with score-based diffusion in latent space.arXiv preprint arXiv:2310.09656, 2023

Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Chris- tos Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space.arXiv preprint arXiv:2310.09656, 2023

work page arXiv 2023
[26]

Ctab-gan+: Enhancing tabular data synthesis.Frontiers in big Data, 6:1296508, 2024

Zilong Zhao, Aditya Kunar, Robert Birke, Hiek Van der Scheer, and Lydia Y Chen. Ctab-gan+: Enhancing tabular data synthesis.Frontiers in big Data, 6:1296508, 2024. 9 Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion A Bidirectional Chamfer Refinement: Properties and Saturation This appendix collects the structural properties of ...

work page arXiv 2024