pith. machine review for the scientific record. sign in

arxiv: 2604.08021 · v1 · submitted 2026-04-09 · 💻 cs.DB

Recognition: no theorem link

SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.DB
keywords SQL workload synthesisquery optimizer trainingforeign-key graph traversalsynthetic data generationdatabase benchmarkinglearned cost modelingAST-based query generationcontrollable SQL synthesis
0
0 comments X

The pith

SynQL generates valid, diverse SQL workloads by traversing a database's foreign-key graph and populating an abstract syntax tree under explicit parametric control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Acquiring realistic SQL workloads for training learned query optimizers is difficult because privacy rules block access to production queries and anonymized traces often omit executable text. Existing fixed benchmarks lack variety for statistical learning while language-model generators frequently produce schema errors or overly simple joins. SynQL instead walks the live foreign-key graph to deterministically construct execution-ready queries that include multi-table joins, projections, aggregations, and range predicates. A configuration vector supplies direct control over join topology, analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb schemas show the resulting workloads reach near-maximal topological diversity and support training of tree-based cost models that achieve strong accuracy on held-out synthetic data with sub-millisecond inference.

Core claim

SynQL is a deterministic rule-based framework that traverses a database's foreign-key graph to build an abstract syntax tree for the core analytical SQL fragment of multi-table joins with projections, aggregations, and predicates. A single configuration vector Θ explicitly governs join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity, guaranteeing schema and syntactic validity by construction without probabilistic generation. On TPC-H and IMDb the method yields workloads with topological entropy of 1.53 bits; tree-based cost models trained on the synthetic corpus attain R² ≥ 0.79 on held-out synthetic test sets at sub-millisecond inference latency.

What carries the argument

Foreign-key graph traversal that populates an abstract syntax tree for SQL queries, parameterized by a configuration vector Θ to control topology, intensity, and selectivity.

If this is right

  • Synthetic corpora can replace inaccessible production logs for training learned query optimizers.
  • Explicit control over join topology and predicate selectivity enables targeted generation of stress-test workloads.
  • Near-maximal topological entropy supports better statistical generalization than fixed-template benchmarks.
  • Trained cost models deliver accurate estimates at sub-millisecond latency suitable for real-time optimizer use.
  • The approach works across different schemas such as TPC-H and IMDb without requiring probabilistic sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-traversal technique could be adapted to synthesize training data for related database tasks such as index recommendation or rewrite rule learning.
  • If the synthetic patterns prove sufficiently representative, the framework could reduce dependence on anonymized traces that discard executable query text.
  • Direct evaluation of SynQL-trained models against proprietary real workloads would quantify generalization beyond the synthetic domain.
  • Combining SynQL with existing benchmark suites could create hybrid evaluation pipelines that test both synthetic diversity and real-world fidelity.

Load-bearing premise

Workloads generated by traversing the foreign-key graph and controlled by Θ are sufficiently representative of real-world analytical query patterns to serve as effective training data for learned optimizers.

What would settle it

Cost models trained on SynQL data achieving markedly lower accuracy on real production query traces than on the held-out synthetic test sets would falsify the claim that the synthetic workloads are effective substitutes.

Figures

Figures reproduced from arXiv: 2604.08021 by Amit Mankodi, Kahan Mehta.

Figure 1
Figure 1. Figure 1: SynQL pipeline overview. The database catalog feeds Phase I (Algorithm 1), which produces a join blueprint under topology bias αshape. Phase II (Algorithm 2) injects semantic content and compiles each query via an AST. Configuration vector Θ governs both phases; the outer loop repeats them N times to emit workload Q. Iterative Edge Selection. At each expansion step the algorithm identifies Jposs, the set o… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of αshape on join topology. High values attach all tables to the root R (Star); low values extend the deepest frontier (Chain); intermediate values produce branching Forks. validity: every non-aggregated column in the SELECT list will have a corresponding GROUP BY entry, mechanically eliminating the “unaggregated column” errors that plague LLM-based generators. Step 2: Predicate Injection. With prob… view at source ↗
Figure 3
Figure 3. Figure 3: Detailed walkthrough of a single SynQL iteration on the IMDb schema. Phase I (steps 1–3): root table title is se￾lected, join depth 3 is sampled, and three αshape-weighted edge expansions produce a chain subgraph with FK join con￾ditions shown on each edge. Phase II (steps 4–6): columns are sampled with full aggrega￾tion (Pagg = 1.0), a year predi￾cate is injected, and the AST compiler auto-appends GROUP B… view at source ↗
read the original abstract

Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $\Theta$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R^2 \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SynQL, a deterministic rule-based framework that synthesizes execution-ready SQL workloads by traversing a live database's foreign-key graph to populate an AST for multi-table joins with projections, aggregations, and range predicates. A configuration vector Θ explicitly controls join topology (Star/Chain/Fork), analytical intensity, and predicate selectivity. On TPC-H and IMDb schemas, it reports near-maximal diversity via topological entropy H=1.53 bits and shows that tree-based cost models trained on the generated corpus achieve R² ≥ 0.79 on held-out synthetic test sets with sub-millisecond inference.

Significance. If the generated workloads are shown to be representative of real analytical patterns, SynQL would provide a scalable, controllable, and validity-guaranteed source of training data for learned query optimizers where privacy constraints limit access to production logs. The deterministic FK-graph traversal and parametric control avoid common pitfalls of LLM-based synthesis such as schema hallucination. The reported entropy and latency numbers, if reproducible, would be concrete strengths.

major comments (2)
  1. [Abstract] Abstract and Experiments section: the central claim that SynQL supplies effective training data for learned optimizers rests on tree-based models reaching R² ≥ 0.79, yet this is measured only on held-out workloads produced by the identical FK-graph traversal and Θ-controlled generation process; no results are reported on the standard 22 TPC-H queries, measured DBMS execution times, or transfer to real production traces, leaving the representativeness assumption untested and risking that models simply recover the deterministic generation rules.
  2. [Abstract] Abstract: the diversity claim (Topological Entropy H = 1.53 bits) and downstream R² values are presented without any comparison to existing synthesis baselines (fixed-template benchmarks or LLM approaches) on the same schemas and downstream task, so the asserted superiority in bridging the training-data gap cannot be assessed from the reported numbers alone.
minor comments (1)
  1. [Abstract] Abstract: the definition and normalization of Topological Entropy H are not provided, making it impossible to verify whether 1.53 bits is near-maximal for the given schemas.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, clarifying the intended scope of our contributions while acknowledging limitations in the current evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: the central claim that SynQL supplies effective training data for learned optimizers rests on tree-based models reaching R² ≥ 0.79, yet this is measured only on held-out workloads produced by the identical FK-graph traversal and Θ-controlled generation process; no results are reported on the standard 22 TPC-H queries, measured DBMS execution times, or transfer to real production traces, leaving the representativeness assumption untested and risking that models simply recover the deterministic generation rules.

    Authors: We agree that the reported R² ≥ 0.79 is measured exclusively on held-out synthetic workloads generated by the same deterministic FK-graph traversal and Θ-controlled process. This choice aligns with the paper's stated goal of providing a controllable, validity-guaranteed data source specifically for scenarios where production logs are inaccessible due to privacy constraints. The high topological entropy and parametric control are intended to ensure the synthetic distribution is rich enough for model training within that regime. We acknowledge that this leaves the broader representativeness to real analytical patterns untested and creates the possibility that models may partially recover generation rules rather than general cost-model features. In the revised manuscript we will add experiments that apply the trained models to the standard 22 TPC-H queries and report measured DBMS execution times for direct comparison. We will also expand the discussion to address potential transfer and the risk of rule recovery. revision: partial

  2. Referee: [Abstract] Abstract: the diversity claim (Topological Entropy H = 1.53 bits) and downstream R² values are presented without any comparison to existing synthesis baselines (fixed-template benchmarks or LLM approaches) on the same schemas and downstream task, so the asserted superiority in bridging the training-data gap cannot be assessed from the reported numbers alone.

    Authors: We accept that the manuscript presents the entropy and R² figures without quantitative head-to-head comparisons against fixed-template or LLM-based synthesizers on identical schemas and the same downstream cost-modeling task. While the text qualitatively contrasts the limitations of templates (low diversity) and LLMs (hallucination and collapse), direct numerical evidence is absent. In the revised version we will extend the experimental section to include such baselines: we will generate comparable workloads using TPC-H templates and publicly available LLM synthesis methods on the same TPC-H and IMDb schemas, then report topological entropy and the resulting R² of tree-based cost models trained on each corpus. This will enable an objective assessment of SynQL's relative effectiveness. revision: yes

standing simulated objections not resolved
  • Direct transfer results on real production traces cannot be provided, as the authors do not have access to such proprietary data.

Circularity Check

0 steps flagged

No significant circularity; generative framework evaluated via standard held-out split on its own outputs

full rationale

The paper presents SynQL as a deterministic, rule-based generator that traverses FK graphs under parametric control Θ to produce ASTs. The reported R² ≥ 0.79 is an experimental outcome of training tree-based regressors on one synthetic corpus and testing on a held-out portion of the same corpus; this is ordinary ML practice and does not reduce any claimed prediction to a fitted parameter or self-defined quantity by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results. The absence of real-query validation is a limitation of external validity, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that every schema contains a traversable foreign-key graph that can be used to generate representative analytical queries; no new entities are postulated and no parameters are fitted to data.

axioms (1)
  • domain assumption The input database schema contains foreign-key relationships that can be traversed to produce valid multi-table joins.
    Invoked when describing how the AST is populated from the live database.

pith-pipeline@v0.9.0 · 5590 in / 1345 out tokens · 37248 ms · 2026-05-10T18:33:52.920576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 18 canonical work pages

  1. [1]

    Synthetic data generation for healthcare: Ex- ploring generative adversarial networks variants for medical tabular data

    Halal Abdulrahman-Ahmed, Pau Baquero-Arnal, Javier Silvestre-Blanes, and Victor Sempere-Paya. Synthetic data generation for healthcare: Ex- ploring generative adversarial networks variants for medical tabular data. International Journal of Data Science and Analytics, 20:5739–5754, 2025. https://doi.org/10.1007/s41060-025-00816-w . URL https://link. spring...

  2. [2]

    The Snowflake elastic data warehouse

    Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Isard, Speedy Joshi, et al. The Snowflake elastic data warehouse. InProceedings of the 2016 ACM SIGMOD International Conference on Management of Data, pages 215–226, 2016. https://doi.org/10.1145/2882903.2903741. URLhttps...

  3. [3]

    Is your data warehouse ready for AI? Redset: A large-scale, realistic benchmark from Redshift workloads.arXiv preprint arXiv:2411.07571, 2024

    Parimarjan Jain, Abhash Kumar Pokharel, Navneet Dhillon, Aaron Elmore, Ryan Marcus, and Tim Kraska. Is your data warehouse ready for AI? Redset: A large-scale, realistic benchmark from Redshift workloads.arXiv preprint arXiv:2411.07571, 2024. URLhttps://arxiv.org/abs/2411.07571

  4. [4]

    New TPC benchmarks for decision support and web commerce.ACM SIGMOD Record, 29(4):64–71, 2000

    Meikel Poess and Chris Floyd. New TPC benchmarks for decision support and web commerce.ACM SIGMOD Record, 29(4):64–71, 2000. URL https://dl.acm.org/doi/10.1145/373626.373714

  5. [5]

    The making of TPC-DS

    Raghunath Othayoth Nambiar and Meikel Poess. The making of TPC-DS. Proceedings of the VLDB Endowment, 32:999–1005, 2006. URL https: //dl.acm.org/doi/10.5555/1182635.1164217

  6. [6]

    How good are query optimizers, really?Proceedings of the VLDB Endowment, 9(3):204–215, 2015

    Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really?Proceedings of the VLDB Endowment, 9(3):204–215, 2015. URL http://www.vldb.org/ pvldb/vol9/p204-leis.pdf

  7. [7]

    Still asking: How good are query optimizers, really?Proceedings of the VLDB Endowment, 18:5531–5544,

    Viktor Leis and Thomas Neumann. Still asking: How good are query optimizers, really?Proceedings of the VLDB Endowment, 18:5531–5544,

  8. [8]

    URLhttp://www.vldb.org/pvldb/vol18/p5531-viktor.pdf

  9. [9]

    Wang, and Victor Zhong

    Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hong- shen Su, Zhengyang Suo, Hongbin Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows, 2024. URL https://arxiv.org/abs/2411.07763. arXiv:2411.07763

  10. [10]

    Can LLM already serve as a database interface? A big bench for large-scale database grounded text- to-SQLs

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can LLM already serve as a database interface? A big bench for large-scale database grounded text- to-SQLs. InAdvances in Neural Information Processing Systems (NeurIPS),

  11. [11]

    URLhttps://arxiv.org/abs/2305.03111

  12. [12]

    Stage: Query execution time SynQL: Controllable SQL Workload Synthesis 23 prediction in Amazon Redshift

    Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrish- nan Narayanaswamy, and Tim Kraska. Stage: Query execution time SynQL: Controllable SQL Workload Synthesis 23 prediction in Amazon Redshift. InCompanion of the 2024 Interna- tional Conference on Management of Data (SIGMOD/PODS ’24)...

  13. [13]

    JOB-Complex: A challenging benchmark for traditional & learned query optimization, 2025

    Jonas Wehrstein, Tobias Eckmann, Ruben Heinrich, and Carsten Bin- nig. JOB-Complex: A challenging benchmark for traditional & learned query optimization, 2025. URL https://arxiv.org/abs/2507.07471. arXiv:2507.07471

  14. [14]

    Learned cardinalities: Estimating correlated joins with deep learning

    Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlated joins with deep learning. InCIDR, 2019. URLhttps://arxiv.org/abs/1809.00677

  15. [15]

    NeuroCard: One cardinality estimator for all tables.Proceedings of the VLDB Endowment, 14(1):61–73, 2020

    Zongheng Yang, Amog Kamsetty, Shu Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. NeuroCard: One cardinality estimator for all tables.Proceedings of the VLDB Endowment, 14(1):61–73, 2020. URL https://www.vldb.org/pvldb/vol14/p61-yang.pdf

  16. [16]

    An end-to-end learning-based cost estimator

    Ji Sun and Guoliang Li. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 13(3):307–319, 2019. URL https: //www.vldb.org/pvldb/vol13/p307-sun.pdf

  17. [17]

    Bao: Making learned query optimization practical

    Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. Bao: Making learned query optimization practical. InProceedings of ACM SIGMOD, pages 2177–2191, 2021. URL https: //dl.acm.org/doi/10.1145/3448016.3452711

  18. [18]

    Reads the Manual

    Zongheng Yang, Wei-Lin Chiang, Shu Luan, Michael Luo, and Ion Sto- ica. Balsa: Learning a query optimizer without expert demonstrations. InProceedings of ACM SIGMOD, pages 931–944, 2022. URL https: //dl.acm.org/doi/10.1145/3514221.3517843

  19. [19]

    LIMAO: A framework for lifelong modular learned query optimization, 2025

    Yuxing Chen, Ziniu Wu, and Tim Kraska. LIMAO: A framework for lifelong modular learned query optimization, 2025. URL https://arxiv.org/abs/ 2507.00188. arXiv:2507.00188

  20. [20]

    A survey on learned query optimization.arXiv preprint arXiv:2404.02595, 2024

    Rong Zhu, Liang Chen, Shuai Wang, et al. A survey on learned query optimization.arXiv preprint arXiv:2404.02595, 2024. URL https://arxiv. org/abs/2404.02595

  21. [21]

    Learned cardinality estimation: A design space exploration and comparative eval- uation.Proceedings of the VLDB Endowment, 15(1):85–97, 2021

    Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. Learned cardinality estimation: A design space exploration and comparative eval- uation.Proceedings of the VLDB Endowment, 15(1):85–97, 2021. URL https://www.vldb.org/pvldb/vol15/p85-sun.pdf

  22. [22]

    You are a security analyst specializing in Microsoft Sentinel KQL

    Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, and Xuanhe Zhou. PARROT: A benchmark for evaluating LLMs in cross-system SQL translation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2509.23338

  23. [23]

    Next-generation database interfaces: A sur- vey of LLM-based text-to-SQL.IEEE Transactions on Knowledge and Data Engineering, 2025

    Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junfeng Dong, Feiran Huang, and Xiao Huang. Next-generation database interfaces: A sur- vey of LLM-based text-to-SQL.IEEE Transactions on Knowledge and Data Engineering, 2025. URL https://ieeexplore.ieee.org/document/ 10839257/. 24 K. Mehta and A. Mankodi

  24. [24]

    The PostgreSQL Global Development Group.PostgreSQL 14 Documentation,

  25. [25]

    URLhttps://www.postgresql.org/docs/14/

  26. [26]

    Learned cost models for query optimization: From batch to streaming systems.Proceedings of the VLDB Endowment, 18(12):5482–5487, 2025

    Ruben Heinrich, Xin Li, Manuele Luthra, and Zoi Kaoudi. Learned cost models for query optimization: From batch to streaming systems.Proceedings of the VLDB Endowment, 18(12):5482–5487, 2025. URL https://www.vldb. org/pvldb/vol18/p5482-heinrich.pdf

  27. [27]

    Scikit-learn: Machine learning in Python

    Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. URL https: //jmlr.org/papers/v12/pedregosa11a.html

  28. [28]

    Machine Learning , author =

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. URL https://doi.org/10.1023/A:1010933404324

  29. [29]

    XGBoost: A Scalable Tree Boosting System

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD, pages 785–794, 2016. URLhttps://arxiv.org/abs/1603.02754

  30. [30]

    Annals of statistics pp

    Jerome H. Friedman. Greedy function approximation: A gradient boost- ing machine.Annals of Statistics, 29(5):1189–1232, 2001. URL https: //projecteuclid.org/euclid.aos/1013203451

  31. [31]

    Online performance prediction using the fusion model of LightGBM and TabNet for large laser facilities.International Journal of Data Science and Analytics, 2024

    Hao Zhang and Jingyi Li. Online performance prediction using the fusion model of LightGBM and TabNet for large laser facilities.International Journal of Data Science and Analytics, 2024. https://doi.org/10.1007/ s41060-024-00686-8 . URL https://link.springer.com/article/10. 1007/s41060-024-00686-8