Recognition: no theorem link
SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking
Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3
The pith
SynQL generates valid, diverse SQL workloads by traversing a database's foreign-key graph and populating an abstract syntax tree under explicit parametric control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SynQL is a deterministic rule-based framework that traverses a database's foreign-key graph to build an abstract syntax tree for the core analytical SQL fragment of multi-table joins with projections, aggregations, and predicates. A single configuration vector Θ explicitly governs join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity, guaranteeing schema and syntactic validity by construction without probabilistic generation. On TPC-H and IMDb the method yields workloads with topological entropy of 1.53 bits; tree-based cost models trained on the synthetic corpus attain R² ≥ 0.79 on held-out synthetic test sets at sub-millisecond inference latency.
What carries the argument
Foreign-key graph traversal that populates an abstract syntax tree for SQL queries, parameterized by a configuration vector Θ to control topology, intensity, and selectivity.
If this is right
- Synthetic corpora can replace inaccessible production logs for training learned query optimizers.
- Explicit control over join topology and predicate selectivity enables targeted generation of stress-test workloads.
- Near-maximal topological entropy supports better statistical generalization than fixed-template benchmarks.
- Trained cost models deliver accurate estimates at sub-millisecond latency suitable for real-time optimizer use.
- The approach works across different schemas such as TPC-H and IMDb without requiring probabilistic sampling.
Where Pith is reading between the lines
- The same graph-traversal technique could be adapted to synthesize training data for related database tasks such as index recommendation or rewrite rule learning.
- If the synthetic patterns prove sufficiently representative, the framework could reduce dependence on anonymized traces that discard executable query text.
- Direct evaluation of SynQL-trained models against proprietary real workloads would quantify generalization beyond the synthetic domain.
- Combining SynQL with existing benchmark suites could create hybrid evaluation pipelines that test both synthetic diversity and real-world fidelity.
Load-bearing premise
Workloads generated by traversing the foreign-key graph and controlled by Θ are sufficiently representative of real-world analytical query patterns to serve as effective training data for learned optimizers.
What would settle it
Cost models trained on SynQL data achieving markedly lower accuracy on real production query traces than on the held-out synthetic test sets would falsify the claim that the synthetic workloads are effective substitutes.
Figures
read the original abstract
Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $\Theta$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R^2 \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SynQL, a deterministic rule-based framework that synthesizes execution-ready SQL workloads by traversing a live database's foreign-key graph to populate an AST for multi-table joins with projections, aggregations, and range predicates. A configuration vector Θ explicitly controls join topology (Star/Chain/Fork), analytical intensity, and predicate selectivity. On TPC-H and IMDb schemas, it reports near-maximal diversity via topological entropy H=1.53 bits and shows that tree-based cost models trained on the generated corpus achieve R² ≥ 0.79 on held-out synthetic test sets with sub-millisecond inference.
Significance. If the generated workloads are shown to be representative of real analytical patterns, SynQL would provide a scalable, controllable, and validity-guaranteed source of training data for learned query optimizers where privacy constraints limit access to production logs. The deterministic FK-graph traversal and parametric control avoid common pitfalls of LLM-based synthesis such as schema hallucination. The reported entropy and latency numbers, if reproducible, would be concrete strengths.
major comments (2)
- [Abstract] Abstract and Experiments section: the central claim that SynQL supplies effective training data for learned optimizers rests on tree-based models reaching R² ≥ 0.79, yet this is measured only on held-out workloads produced by the identical FK-graph traversal and Θ-controlled generation process; no results are reported on the standard 22 TPC-H queries, measured DBMS execution times, or transfer to real production traces, leaving the representativeness assumption untested and risking that models simply recover the deterministic generation rules.
- [Abstract] Abstract: the diversity claim (Topological Entropy H = 1.53 bits) and downstream R² values are presented without any comparison to existing synthesis baselines (fixed-template benchmarks or LLM approaches) on the same schemas and downstream task, so the asserted superiority in bridging the training-data gap cannot be assessed from the reported numbers alone.
minor comments (1)
- [Abstract] Abstract: the definition and normalization of Topological Entropy H are not provided, making it impossible to verify whether 1.53 bits is near-maximal for the given schemas.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below, clarifying the intended scope of our contributions while acknowledging limitations in the current evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: the central claim that SynQL supplies effective training data for learned optimizers rests on tree-based models reaching R² ≥ 0.79, yet this is measured only on held-out workloads produced by the identical FK-graph traversal and Θ-controlled generation process; no results are reported on the standard 22 TPC-H queries, measured DBMS execution times, or transfer to real production traces, leaving the representativeness assumption untested and risking that models simply recover the deterministic generation rules.
Authors: We agree that the reported R² ≥ 0.79 is measured exclusively on held-out synthetic workloads generated by the same deterministic FK-graph traversal and Θ-controlled process. This choice aligns with the paper's stated goal of providing a controllable, validity-guaranteed data source specifically for scenarios where production logs are inaccessible due to privacy constraints. The high topological entropy and parametric control are intended to ensure the synthetic distribution is rich enough for model training within that regime. We acknowledge that this leaves the broader representativeness to real analytical patterns untested and creates the possibility that models may partially recover generation rules rather than general cost-model features. In the revised manuscript we will add experiments that apply the trained models to the standard 22 TPC-H queries and report measured DBMS execution times for direct comparison. We will also expand the discussion to address potential transfer and the risk of rule recovery. revision: partial
-
Referee: [Abstract] Abstract: the diversity claim (Topological Entropy H = 1.53 bits) and downstream R² values are presented without any comparison to existing synthesis baselines (fixed-template benchmarks or LLM approaches) on the same schemas and downstream task, so the asserted superiority in bridging the training-data gap cannot be assessed from the reported numbers alone.
Authors: We accept that the manuscript presents the entropy and R² figures without quantitative head-to-head comparisons against fixed-template or LLM-based synthesizers on identical schemas and the same downstream cost-modeling task. While the text qualitatively contrasts the limitations of templates (low diversity) and LLMs (hallucination and collapse), direct numerical evidence is absent. In the revised version we will extend the experimental section to include such baselines: we will generate comparable workloads using TPC-H templates and publicly available LLM synthesis methods on the same TPC-H and IMDb schemas, then report topological entropy and the resulting R² of tree-based cost models trained on each corpus. This will enable an objective assessment of SynQL's relative effectiveness. revision: yes
- Direct transfer results on real production traces cannot be provided, as the authors do not have access to such proprietary data.
Circularity Check
No significant circularity; generative framework evaluated via standard held-out split on its own outputs
full rationale
The paper presents SynQL as a deterministic, rule-based generator that traverses FK graphs under parametric control Θ to produce ASTs. The reported R² ≥ 0.79 is an experimental outcome of training tree-based regressors on one synthetic corpus and testing on a held-out portion of the same corpus; this is ordinary ML practice and does not reduce any claimed prediction to a fitted parameter or self-defined quantity by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results. The absence of real-query validation is a limitation of external validity, not a circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The input database schema contains foreign-key relationships that can be traversed to produce valid multi-table joins.
Reference graph
Works this paper leans on
-
[1]
Halal Abdulrahman-Ahmed, Pau Baquero-Arnal, Javier Silvestre-Blanes, and Victor Sempere-Paya. Synthetic data generation for healthcare: Ex- ploring generative adversarial networks variants for medical tabular data. International Journal of Data Science and Analytics, 20:5739–5754, 2025. https://doi.org/10.1007/s41060-025-00816-w . URL https://link. spring...
-
[2]
The Snowflake elastic data warehouse
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Isard, Speedy Joshi, et al. The Snowflake elastic data warehouse. InProceedings of the 2016 ACM SIGMOD International Conference on Management of Data, pages 215–226, 2016. https://doi.org/10.1145/2882903.2903741. URLhttps...
-
[3]
Parimarjan Jain, Abhash Kumar Pokharel, Navneet Dhillon, Aaron Elmore, Ryan Marcus, and Tim Kraska. Is your data warehouse ready for AI? Redset: A large-scale, realistic benchmark from Redshift workloads.arXiv preprint arXiv:2411.07571, 2024. URLhttps://arxiv.org/abs/2411.07571
-
[4]
New TPC benchmarks for decision support and web commerce.ACM SIGMOD Record, 29(4):64–71, 2000
Meikel Poess and Chris Floyd. New TPC benchmarks for decision support and web commerce.ACM SIGMOD Record, 29(4):64–71, 2000. URL https://dl.acm.org/doi/10.1145/373626.373714
-
[5]
Raghunath Othayoth Nambiar and Meikel Poess. The making of TPC-DS. Proceedings of the VLDB Endowment, 32:999–1005, 2006. URL https: //dl.acm.org/doi/10.5555/1182635.1164217
-
[6]
How good are query optimizers, really?Proceedings of the VLDB Endowment, 9(3):204–215, 2015
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really?Proceedings of the VLDB Endowment, 9(3):204–215, 2015. URL http://www.vldb.org/ pvldb/vol9/p204-leis.pdf
2015
-
[7]
Still asking: How good are query optimizers, really?Proceedings of the VLDB Endowment, 18:5531–5544,
Viktor Leis and Thomas Neumann. Still asking: How good are query optimizers, really?Proceedings of the VLDB Endowment, 18:5531–5544,
-
[8]
URLhttp://www.vldb.org/pvldb/vol18/p5531-viktor.pdf
-
[9]
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hong- shen Su, Zhengyang Suo, Hongbin Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows, 2024. URL https://arxiv.org/abs/2411.07763. arXiv:2411.07763
-
[10]
Can LLM already serve as a database interface? A big bench for large-scale database grounded text- to-SQLs
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can LLM already serve as a database interface? A big bench for large-scale database grounded text- to-SQLs. InAdvances in Neural Information Processing Systems (NeurIPS),
- [11]
-
[12]
Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrish- nan Narayanaswamy, and Tim Kraska. Stage: Query execution time SynQL: Controllable SQL Workload Synthesis 23 prediction in Amazon Redshift. InCompanion of the 2024 Interna- tional Conference on Management of Data (SIGMOD/PODS ’24)...
-
[13]
JOB-Complex: A challenging benchmark for traditional & learned query optimization, 2025
Jonas Wehrstein, Tobias Eckmann, Ruben Heinrich, and Carsten Bin- nig. JOB-Complex: A challenging benchmark for traditional & learned query optimization, 2025. URL https://arxiv.org/abs/2507.07471. arXiv:2507.07471
-
[14]
Learned cardinalities: Estimating correlated joins with deep learning
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlated joins with deep learning. InCIDR, 2019. URLhttps://arxiv.org/abs/1809.00677
-
[15]
NeuroCard: One cardinality estimator for all tables.Proceedings of the VLDB Endowment, 14(1):61–73, 2020
Zongheng Yang, Amog Kamsetty, Shu Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. NeuroCard: One cardinality estimator for all tables.Proceedings of the VLDB Endowment, 14(1):61–73, 2020. URL https://www.vldb.org/pvldb/vol14/p61-yang.pdf
2020
-
[16]
An end-to-end learning-based cost estimator
Ji Sun and Guoliang Li. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 13(3):307–319, 2019. URL https: //www.vldb.org/pvldb/vol13/p307-sun.pdf
2019
-
[17]
Bao: Making learned query optimization practical
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. Bao: Making learned query optimization practical. InProceedings of ACM SIGMOD, pages 2177–2191, 2021. URL https: //dl.acm.org/doi/10.1145/3448016.3452711
-
[18]
Zongheng Yang, Wei-Lin Chiang, Shu Luan, Michael Luo, and Ion Sto- ica. Balsa: Learning a query optimizer without expert demonstrations. InProceedings of ACM SIGMOD, pages 931–944, 2022. URL https: //dl.acm.org/doi/10.1145/3514221.3517843
-
[19]
LIMAO: A framework for lifelong modular learned query optimization, 2025
Yuxing Chen, Ziniu Wu, and Tim Kraska. LIMAO: A framework for lifelong modular learned query optimization, 2025. URL https://arxiv.org/abs/ 2507.00188. arXiv:2507.00188
-
[20]
A survey on learned query optimization.arXiv preprint arXiv:2404.02595, 2024
Rong Zhu, Liang Chen, Shuai Wang, et al. A survey on learned query optimization.arXiv preprint arXiv:2404.02595, 2024. URL https://arxiv. org/abs/2404.02595
-
[21]
Learned cardinality estimation: A design space exploration and comparative eval- uation.Proceedings of the VLDB Endowment, 15(1):85–97, 2021
Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. Learned cardinality estimation: A design space exploration and comparative eval- uation.Proceedings of the VLDB Endowment, 15(1):85–97, 2021. URL https://www.vldb.org/pvldb/vol15/p85-sun.pdf
2021
-
[22]
You are a security analyst specializing in Microsoft Sentinel KQL
Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, and Xuanhe Zhou. PARROT: A benchmark for evaluating LLMs in cross-system SQL translation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2509.23338
-
[23]
Next-generation database interfaces: A sur- vey of LLM-based text-to-SQL.IEEE Transactions on Knowledge and Data Engineering, 2025
Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junfeng Dong, Feiran Huang, and Xiao Huang. Next-generation database interfaces: A sur- vey of LLM-based text-to-SQL.IEEE Transactions on Knowledge and Data Engineering, 2025. URL https://ieeexplore.ieee.org/document/ 10839257/. 24 K. Mehta and A. Mankodi
2025
-
[24]
The PostgreSQL Global Development Group.PostgreSQL 14 Documentation,
-
[25]
URLhttps://www.postgresql.org/docs/14/
-
[26]
Learned cost models for query optimization: From batch to streaming systems.Proceedings of the VLDB Endowment, 18(12):5482–5487, 2025
Ruben Heinrich, Xin Li, Manuele Luthra, and Zoi Kaoudi. Learned cost models for query optimization: From batch to streaming systems.Proceedings of the VLDB Endowment, 18(12):5482–5487, 2025. URL https://www.vldb. org/pvldb/vol18/p5482-heinrich.pdf
2025
-
[27]
Scikit-learn: Machine learning in Python
Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. URL https: //jmlr.org/papers/v12/pedregosa11a.html
2011
-
[28]
Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. URL https://doi.org/10.1023/A:1010933404324
-
[29]
XGBoost: A Scalable Tree Boosting System
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD, pages 785–794, 2016. URLhttps://arxiv.org/abs/1603.02754
work page Pith review arXiv 2016
-
[30]
Jerome H. Friedman. Greedy function approximation: A gradient boost- ing machine.Annals of Statistics, 29(5):1189–1232, 2001. URL https: //projecteuclid.org/euclid.aos/1013203451
-
[31]
Online performance prediction using the fusion model of LightGBM and TabNet for large laser facilities.International Journal of Data Science and Analytics, 2024
Hao Zhang and Jingyi Li. Online performance prediction using the fusion model of LightGBM and TabNet for large laser facilities.International Journal of Data Science and Analytics, 2024. https://doi.org/10.1007/ s41060-024-00686-8 . URL https://link.springer.com/article/10. 1007/s41060-024-00686-8
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.