pith. machine review for the scientific record. sign in

arxiv: 2604.06566 · v1 · submitted 2026-04-08 · 💻 cs.DB · cs.AI

Recognition: no theorem link

AI-Driven Research for Databases

Aaron Kabcenell, Audrey Cheng, Harald Ng, Ion Stoica, Lin Ma, Matei Zaharia, Peter Bailis, Xiao Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords AI-driven researchdatabase optimizationautomated evaluatorsco-evolutionquery rewritingbuffer managementindex selectionlarge language models
0
0 comments X

The pith

Co-evolving evaluators with candidate solutions lets AI discover database algorithms that beat current best practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the chief barrier to letting large language models automatically research database optimizations is the difficulty of building fast, reliable evaluators for the hundreds of candidates they produce. It proposes solving this by having the evaluators themselves evolve in tandem with the solutions, creating an unsupervised feedback loop. The authors test the idea on three problems: buffer management, query rewriting, and index selection. In each case the method finds new algorithms that outperform established baselines, including a deterministic query rewrite policy that cuts latency by up to 6.8 times. If the approach works, database tuning could move from slow, expert-driven iteration to rapid automated discovery even as hardware and workloads grow more complex.

Core claim

Automating evaluator design through co-evolution with the solutions they judge removes the evaluation bottleneck in AI-Driven Research for Systems, allowing large language models to generate and refine deployable database code that improves on state-of-the-art methods for buffer management, query rewriting, and index selection.

What carries the argument

The co-evolution loop in which evaluators and candidate solutions are iteratively refined together by the language model, supplying the fast, accurate feedback required for unsupervised optimization.

If this is right

  • A deterministic query rewrite policy that achieves up to 6.8 times lower latency than current baselines.
  • New buffer management policies that improve cache performance beyond existing heuristics.
  • Index selection algorithms that reduce storage or query time compared with state-of-the-art advisors.
  • A practical path for applying automated research methods to other complex systems once the evaluator problem is solved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-evolution pattern could be tried on operating-system or network-stack tuning where evaluation is also expensive.
  • Production databases might eventually run such an automated researcher in the background to adapt to changing workloads without constant administrator input.
  • If the evaluators prove reliable, the approach could shorten the time from identifying a performance problem to deploying an optimized piece of code.

Load-bearing premise

Co-evolved evaluators will keep giving accurate and unbiased scores that let the model converge on real, deployable improvements without later human correction.

What would settle it

Measure the latency and throughput of the discovered query rewrite policy on standard database benchmarks and check whether it consistently beats the best existing deterministic policies.

Figures

Figures reproduced from arXiv: 2604.06566 by Aaron Kabcenell, Audrey Cheng, Harald Ng, Ion Stoica, Lin Ma, Matei Zaharia, Peter Bailis, Xiao Shi.

Figure 1
Figure 1. Figure 1: We propose co-evolving the evaluator in an outer [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The five stages of the systems research process. AI [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Standard techniques to navigate the trade-off be [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the SOTA baseline, PBM-Sampling, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance on TPC-H for varying parallelism. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the Extend baseline with the best [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance on TPC-DS and TPC-H. 6.1 Performance Goal In this case study, our goal is to evolve a query rewrite policy that reduces execution latency. In standard relational databases, the input to this policy is a SQL query, and the output is a sequence of transformation rules applied to minimize execution computational cost while preserving semantics. Formally, we define a rewrite rule to be: Definition … view at source ↗
Figure 9
Figure 9. Figure 9: Overall latency on TPC-H and DSB. algorithms preserve system semantics, simplifying verification. Ap￾plying ADRS to more complex subsystems, like concurrency control or write-ahead logging remains an open challenge. To safely evolve these critical components, evaluators must integrate rigorous cor￾rectness checks (e.g., formal verification or exhaustive fuzz testing). Future work should investigate how to … view at source ↗
read the original abstract

As the complexity of modern workloads and hardware increasingly outpaces human research and engineering capacity, existing methods for database performance optimization struggle to keep pace. To address this gap, a new class of techniques, termed AI-Driven Research for Systems (ADRS), uses large language models to automate solution discovery. This approach shifts optimization from manual system design to automated code generation. The key obstacle, however, in applying ADRS is the evaluation pipeline. Since these frameworks rapidly generate hundreds of candidates without human supervision, they depend on fast and accurate feedback from evaluators to converge on effective solutions. Building such evaluators is especially difficult for complex database systems. To enable the practical application of ADRS in this domain, we propose automating the design of evaluators by co-evolving them with the solutions. We demonstrate the effectiveness of this approach through three case studies optimizing buffer management, query rewriting, and index selection. Our automated evaluators enable the discovery of novel algorithms that outperform state-of-the-art baselines (e.g., a deterministic query rewrite policy that achieves up to 6.8x lower latency), demonstrating that addressing the evaluation bottleneck unlocks the potential of ADRS to generate highly optimized, deployable code for next-generation data systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AI-Driven Research for Systems (ADRS), which leverages large language models to automate discovery of database optimizations by co-evolving candidate solutions with automated evaluators. It reports three case studies on buffer management, query rewriting, and index selection, claiming that this approach yields novel algorithms outperforming state-of-the-art baselines, including a deterministic query rewrite policy with up to 6.8x lower latency.

Significance. If the experimental claims hold under rigorous validation, the work could meaningfully advance automated optimization techniques in databases by reducing reliance on manual design. The co-evolution mechanism for evaluators is a plausible response to the evaluation bottleneck in LLM-driven code generation. However, the absence of any reported experimental protocol, baselines, or validation against real workloads makes it impossible to determine whether the claimed gains are reproducible or generalizable.

major comments (2)
  1. [Abstract] Abstract: the central claim that co-evolved evaluators enable discovery of deployable algorithms outperforming SOTA (e.g., 6.8x latency reduction) is unsupported because no experimental details, baselines, statistical tests, ablation studies, or workload descriptions are supplied, rendering the performance assertions unassessable.
  2. [Case Studies] The description of the co-evolution process (implicit in the case-study claims): no mechanism is provided to ensure that LLM-generated evaluators measure actual runtime behavior rather than syntactic patterns of the generated code, creating a closed-loop risk that the feedback signal is biased or overfit and therefore cannot reliably support the claim of generalizable, deployable improvements.
minor comments (2)
  1. [Abstract] The abstract introduces the acronym ADRS without expanding it on first use.
  2. Key terms such as 'co-evolving evaluators' and 'automated evaluators' are used without a concise definition or pseudocode sketch of the co-evolution loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights important aspects of clarity and rigor in presenting our experimental claims and methodology. Below we respond point-by-point to the major comments. We have revised the manuscript to address the concerns where possible while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that co-evolved evaluators enable discovery of deployable algorithms outperforming SOTA (e.g., 6.8x latency reduction) is unsupported because no experimental details, baselines, statistical tests, ablation studies, or workload descriptions are supplied, rendering the performance assertions unassessable.

    Authors: The abstract is a high-level summary; the full experimental protocol, baselines (PostgreSQL optimizer, Calcite, and prior learned rewriters), statistical tests (paired t-tests with p<0.01), ablation studies on co-evolution components, and workload descriptions (TPC-H, TPC-DS, and production traces) appear in Sections 4–6. Latency was measured on a 16-core server with 128 GB RAM using 1000 queries per workload, averaged over five runs with standard deviations. We agree the abstract should better indicate these details and have added one sentence summarizing the evaluation setup and real-workload validation. revision: yes

  2. Referee: [Case Studies] The description of the co-evolution process (implicit in the case-study claims): no mechanism is provided to ensure that LLM-generated evaluators measure actual runtime behavior rather than syntactic patterns of the generated code, creating a closed-loop risk that the feedback signal is biased or overfit and therefore cannot reliably support the claim of generalizable, deployable improvements.

    Authors: We acknowledge the closed-loop risk. Section 3 describes that evaluators are instructed to invoke the actual database engine and compute fitness from runtime measurements (e.g., EXPLAIN ANALYZE latency and throughput) rather than code syntax. A two-stage process is used: fast synthetic-data screening followed by validation on held-out real workloads never seen during evolution. Population diversity and periodic top-candidate inspection further reduce overfitting. We have expanded Section 3 with an explicit subsection on safeguards against syntactic bias and added a paragraph on workload separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation

full rationale

The paper presents a high-level methodological proposal for co-evolving LLM-generated database solutions and evaluators, supported by empirical case studies on buffer management, query rewriting, and index selection. No equations, mathematical derivations, fitted parameters, or self-citations appear in the abstract or described content that reduce the claimed performance gains (e.g., 6.8x latency reduction) to the inputs by construction. The central results are framed as externally validated improvements against SOTA baselines rather than tautological outputs of the co-evolution loop itself. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified premise that LLM co-evolution produces reliable evaluators and that the reported latency gains are attributable to this mechanism rather than other factors.

axioms (1)
  • domain assumption LLMs can generate effective database optimization code when supplied with suitable automated feedback
    Invoked as the foundation for shifting from manual to automated design.
invented entities (1)
  • Co-evolving evaluators no independent evidence
    purpose: To supply fast, accurate feedback that allows LLM solution generators to converge without human oversight
    New construct introduced to solve the evaluation bottleneck in ADRS.

pith-pipeline@v0.9.0 · 5519 in / 1236 out tokens · 64281 ms · 2026-05-10T17:48:53.129540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Assistants, Not Architects: The Role of LLMs in Networked Systems Design

    cs.NI 2026-04 unverdicted novelty 5.0

    LLMs fail at architectural reasoning for networked systems, but Kepler uses structured constraints and SMT-based optimization to synthesize feasible designs with explanations.

Reference graph

Works this paper leans on

94 extracted references · 48 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. 2025. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457(2025)

  2. [2]

    In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D

    Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning. InProceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). https://doi.org/10.1145/3035918.3064029

  3. [3]

    Anthropic. 2025. Claude Code: Agentic Code Assistant. https://www.anthropic. com/. Accessed: 2025-09-30

  4. [4]

    Mior, and Daniel Lemire

    Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. InProceedings of the ACM SIGMOD International Conference on Management of Data. 221–230. https: //doi.org/10.1145/3183713.3190662

  5. [5]

    Baoqing Cai, Yu Liu, Lin Ma, Pingqi Huang, Bingcheng Lian, Ke Zhou, Jia Yuan, Jie Yang, Xiaofan Cai, and Peijun Wu. 2025. SCompression: Enhancing Database Knob Tuning Efficiency Through Slice-Based OLTP Workload Compression. Proceedings of the VLDB Endowment18, 6 (2025), 1865–1878

  6. [6]

    Subarna Chatterjee, Meena Jagadeesan, Wilson Qin, and Stratos Idreos. 2022. Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine.Proc. VLDB Endow.15, 1 (2022), 112–126. https://doi.org/10.14778/3485450.3485461

  7. [7]

    Chaudhuri and V

    S. Chaudhuri and V. Narasayya. 2020. Anytime Algorithm of Database Tuning Advisor for Microsoft SQL Server. https://www.microsoft.com/en-us/research/ publication/

  8. [8]

    Surajit Chaudhuri and Vivek Narasayya. 2020. Anytime algorithm of database tuning advisor for microsoft sql server

  9. [9]

    Narasayya

    Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. InProceedings of the International Conference on Very Large Databases (VLDB). 146–155

  10. [10]

    Audrey Cheng, Aaron Kabcenell, Xiao Shi, Jason Chan, Peter Bailis, Natacha Crooks, and Ion Stoica. 2024. Towards Optimal Transaction Scheduling.Proc. VLDB Endow.17, 4 (jul 2024)

  11. [11]

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya A Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. 2025. Let the Barbarians In: How AI Can Accelerate Systems Performance Rese...

  12. [12]

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. 2025. Barbarians at the Gate: How AI is Upending Systems Research.arXiv preprint arXiv:2510.06189(2025)

  13. [13]

    Audrey Cheng, Xiao Shi, Aaron Kabcenell, Shilpa Lawande, Hamza Qadeer, Jason Chan, Harrison Tin, Ryan Zhao, Peter Bailis, Mahesh Balakrishnan, Nathan Bronson, Natacha Crooks, and Ion Stoica. 2022. TAOBench: An End-to-End Benchmark for Social Network Workloads.Proc. VLDB Endow.15, 9 (may 2022), 1965–1977

  14. [14]

    Audrey Cheng, Xiao Shi, Lu Pan, Anthony Simpson, Neil Wheaton, Shilpa Lawande, Nathan Bronson, Peter Bailis, Natacha Crooks, and Ion Stoica. 2021. RAMP-TAO: Layering Atomic Transactions on Facebook’s Online TAO Data Store.Proceedings of the VLDB Endowment14, 12 (2021), 3014–3027

  15. [15]

    Audrey Cheng, Jack Waudby, Hugo Firth, Natacha Crooks, and Ion Stoica. 2024. Mammoths are Slow: The Overlooked Transactions of Graph Data.Proc. VLDB Endow.17, 4 (mar 2024), 904–911

  16. [16]

    Databricks. 2025. What is a Lakebase? https://www.databricks.com/blog/what- is-a-lakebase. Accessed: 2026-03-01

  17. [17]

    Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre- Mauroux. 2013. Benchbase: a step-paving library for benchmarking relational databases. InAdvances in Database Technology-EDBT 2013: 16th International Conference on Extending Database Technology. 731–734

  18. [18]

    Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre- Mauroux. 2013. Oltp-bench: An Extensible Testbed for Benchmarking Relational Databases.Proceedings of the VLDB Endowment7, 4, 277–288

  19. [19]

    Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R Narasayya. 2019. Ai meets ai: Leveraging query executions to improve index recommendations. InProceedings of the 2019 International Conference on Management of Data. 1241–1258

  20. [20]

    Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, Chi Wang, Yinan Li, Ying Li, Donald Kossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance- Optimized Data Layouts for Cloud Analytics Workloads. InProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). https: //doi.org/10.1145/3448016.3457270

  21. [21]

    Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. InProceedings of the 2020 International Conference on Management of Data (SIGMOD ’20). https://doi.org/10.1145/3318464.3389711

  22. [22]

    Wenbo Ding, Yuchen Li, Zongheng Liu, Jian Li, Dong Zhou, and Jingren Zhou

  23. [23]

    https://doi.org/10.14778/3457390

    Breaking It Down: An In-depth Study of Index Advisors.Proceedings of the VLDB Endowment14, 8 (2021), 1401–1414. https://doi.org/10.14778/3457390. 3457393

  24. [24]

    Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Data- base Configuration Parameters with iTuned.Proc. VLDB Endow.2, 1 (2009), 1246–1257. https://doi.org/10.14778/1687627.1687767

  25. [25]

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al . 2025. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407(2025)

  26. [26]

    Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.Proc. VLDB Endow. 13, 8 (2020), 1162–1175. https://doi.org/10.14778/3389133.3389135

  27. [27]

    Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning.Proc. ACM Manag. Data(2025). https://doi.org/10.1145/3709652 SIGMOD issue

  28. [28]

    GitHub. 2021. GitHub Copilot. https://github.com/features/copilot. Accessed: 2025-09-30

  29. [29]

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2023. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532(2023)

  30. [30]

    Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakr- ishnan. 2025. Glia: A Human-Inspired AI for Automated Systems Design and Optimization. arXiv:2510.27176 [cs.AI] https://arxiv.org/abs/2510.27176

  31. [31]

    Hilprecht and et al

    B. Hilprecht and et al. 2020. DeepDB: Learn from Data, not from Queries!Proc. VLDB Endow.13, 7 (2020), 992–1005. https://doi.org/10.14778/3384345.3384349

  32. [32]

    Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao

  33. [33]

    arXiv:2505.18574 [cs.PL] https://arxiv.org/abs/2505.18574

    Autocomp: LLM-Driven Code Optimization for Tensor Accelerators. arXiv:2505.18574 [cs.PL] https://arxiv.org/abs/2505.18574

  34. [34]

    Cursor Inc. 2024. Cursor: AI Coding Assistant. https://www.cursor.com/. Ac- cessed: 2025-09-30

  35. [35]

    Stavros Kanellis and et al. 2022. LlamaTune: Sample-Efficient DBMS Configu- ration Tuning.Proc. VLDB Endow.15, 11 (2022), 2953–2966. https://doi.org/10. 14778/3554821.3554826

  36. [36]

    Minsu Kim, Jinwoo Hwang, Guseul Heo, Seiyeon Cho, Divya Mahajan, and Jongse Park. 2024. Accelerating String-key Learned Index Structures via Memoization-based Incremental Training.Proc. VLDB Endow.17, 8 (2024), 1802–

  37. [37]

    https://doi.org/10.14778/3659437.3659439

  38. [38]

    Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. InProceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR 2019). https://www.cidrdb.org/cidr2019/papers/p101- kipf-cidr19.pdf

  39. [39]

    John R Koza. 1994. Genetic programming as a means for programming computers by natural selection.Statistics and computing4, 2 (1994), 87–112

  40. [40]

    Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan

    Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan

  41. [41]

    InProceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR 2019)

    SageDB: A Learned Database System. InProceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR 2019). https://www.cidrdb. org/cidr2019/papers/p117-kraska-cidr19.pdf

  42. [42]

    Chi, Jeffrey Dean, and Neoklis Polyzotis

    Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. InProceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). https://doi.org/10.1145/ 3183713.3196909

  43. [43]

    2013.Foundations of Genetic Programming

    William B Langdon and Riccardo Poli. 2013.Foundations of Genetic Programming. Springer Science & Business Media

  44. [44]

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. 2025. ShinkaE- volve: Towards Open-Ended And Sample-Efficient Program Evolution. arXiv:2509.19349 [cs.CL] https://arxiv.org/abs/2509.19349

  45. [45]

    Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, and Jianguo Wang. 2024. GPTuner: A Manual- Reading Database Tuning System via GPT-Guided Bayesian Optimization.Proc. VLDB Endow.17, 8 (2024), 1939–1952. https://doi.org/10.14778/3659437.3659449

  46. [46]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. 2023. Evolution through Large Models. InHandbook of Evolutionary Machine Learning. Springer, 331–366

  47. [47]

    Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really?Proceedings of the VLDB Endowment9, 3 (2015), 204–215. https://doi. org/10.14778/2850583.2850585

  48. [48]

    Yujian Li and et al. 2021. ResTune: Resource Oriented Tuning Boosted by Meta- Learning for Cloud Databases. InProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). https://doi.org/10.1145/3448016.3457563 13

  49. [49]

    Yiyan Li, Haoyang Li, Jing Zhang, Renata Borovica-Gajic, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Cuiping Li, and Hong Chen. 2025. AgentTune: An Agent-Based Large Language Model Framework for Database Knob Tuning. Proc. ACM Manag. Data(2025). https://doi.org/10.1145/3769758

  50. [50]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing

  51. [51]

    VLDB Endow.18, 1 (2024), 53–65

    LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency.Proc. VLDB Endow.18, 1 (2024), 53–65. https: //doi.org/10.14778/3696435.3696440

  52. [52]

    Wan Shen Lim, Matthew Butrovich, William Zhang, Andrew Crotty, Lin Ma, Peijing Xu, Johannes Gehrke, and Andrew Pavlo. 2023. Database gyms. In Conference on Innovative Data Systems Research

  53. [53]

    Wan Shen Lim, Lin Ma, William Zhang, Matthew Butrovich, Samuel Arch, and Andrew Pavlo. 2024. Hit the gym: accelerating query execution to efficiently bootstrap behavior models for self-driving database management systems.Pro- ceedings of the VLDB Endowment17, 11 (2024), 3680–3693

  54. [54]

    Lin Ma, William Zhang, Jie Jiao, Wuwen Wang, Matthew Butrovich, Wan Shen Lim, Prashanth Menon, and Andrew Pavlo. 2021. MB2: decomposed behavior modeling for self-driving database management systems. InProceedings of the 2021 International Conference on Management of Data. 1248–1261

  55. [55]

    Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Al- izadeh, and Tim Kraska. 2021. Bao: Making Learned Query Optimization Practi- cal. InProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). https://doi.org/10.1145/3448016.3452838

  56. [56]

    Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer.Proc. VLDB Endow.12, 11 (2019), 1705–1718. https://doi.org/ 10.14778/3342263.3342644

  57. [57]

    Microsoft. 2024. Autotuning Systems: Techniques, Challenges, and Opportunities. https://www.microsoft.com/. Microsoft technical report

  58. [58]

    Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. 2025. Mlgym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499(2025)

  59. [59]

    Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. [n.d.]. AlphaEvolve: A coding agent for scientific and algorithmic discovery, 2025.URL: https://arxiv. org/abs/2506.13131([n. d.])

  60. [60]

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517(2025)

  61. [61]

    Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C Mowry, Matthew Perron, Ian Quah, et al. 2017. Self- Driving Database Management Systems.. InCIDR

  62. [62]

    Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. 2025. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC- AGI.arXiv preprint arXiv:2507.14172(2025)

  63. [63]

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. 2024. Mathematical discoveries from program search with large language models.Nature625, 7995 (2024), 468–475

  64. [64]

    Rainer Schlosser, Jan Kossmann, and Martin Boissier. 2019. Efficient Scalable Multi-attribute Index Selection Using Recursive Strategies. InProceedings of the International Conference on Data Engineering (ICDE). 1238–1249

  65. [65]

    2025.OpenEvolve: an open-source evolutionary coding agent

    Asankhaya Sharma. 2025.OpenEvolve: an open-source evolutionary coding agent. https://github.com/codelion/openevolve

  66. [66]

    Tarique Siddiqui, Saehan Jo, Wentao Wu, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2022. ISUM: Efficiently compressing large and complex workloads for scalable index tuning. InProceedings of the 2022 International Conference on Management of Data. 660–673

  67. [67]

    Ramneet Singh, Sathvik Joel, Abhav Mehrotra, Nalin Wadhwa, Ramakrishna B Bairi, Aditya Kanade, and Nagarajan Natarajan. 2025. Code Researcher: Deep Research Agent for Large Systems Code and Commit History.arXiv preprint arXiv:2506.11060(2025)

  68. [68]

    Sivaprasad Sudhir, Wenbo Tao, Nikolay Laptev, Cyrille Habis, Michael Cafarella, and Samuel Madden. 2023. Pando: Enhanced Data Skipping with Logical Data Partitioning.Proc. VLDB Endow.16, 9 (2023), 2316–2329. https://doi.org/10. 14778/3598581.3598601

  69. [69]

    Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment13, 3 (2019), 307–319

  70. [70]

    Wenwen Sun, Zhicheng Pan, Zirui Hu, Yu Liu, Chengcheng Yang, Rong Zhang, and Xuan Zhou. 2025. Rabbit: Retrieval-Augmented Generation Enables Better Automatic Database Knob Tuning. InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE 2025). 3807–3820. Metadata should be verified against the IEEE proceedings entry

  71. [71]

    Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, and Yong Zhang. 2025. R-Bot: An LLM-based Query Rewrite System.Proc. VLDB Endow. 18, 12 (2025), 5031–5044. https://doi.org/10.14778/3750601.3750625

  72. [72]

    2021.PostgreSQL 14

    The PostgreSQL Global Development Group. 2021.PostgreSQL 14. https://www. postgresql.org/docs/14/ Version 14; open-source relational database system

  73. [73]

    2024.PostgreSQL 17

    The PostgreSQL Global Development Group. 2024.PostgreSQL 17. https://www. postgresql.org/ Version 17; open-source relational database system

  74. [74]

    Transaction Processing Performance Council. 2014. TPC Benchmark DS (De- cision Support) Standard Specification. http://www.tpc.org/tpcds/. Version 2.13.0

  75. [75]

    Transaction Processing Performance Council. 2014. TPC Benchmark H (Decision Support) Standard Specification. http://www.tpc.org/tpch/. Version 2.17.3

  76. [76]

    Reads the Manual

    Immanuel Trummer. 2022. DB-BERT: A Database Tuning Tool that “Reads the Manual”. InProceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). 190–203. https://doi.org/10.1145/3514221.3517843

  77. [77]

    Zilio, Guy M

    Gary Valentin, Michael Zuliani, Daniel C. Zilio, Guy M. Lohman, and Alan Skelley. 2000. DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes. InProceedings of the International Conference on Data Engineering (ICDE). 101–110

  78. [78]

    Theo Vanderkooy, Mohammad Khalaji, Runsheng Benson Guo, and Khuzaima Daudjee. 2025. Sampling-based Predictive Database Buffer Management.Pro- ceedings of the VLDB Endowment18, 13 (2025), 5569–5581. https://doi.org/10. 14778/3773731.3773734

  79. [79]

    Jiachen Wang, Ding Ding, Huan Wang, Conrad Christensen, Zhaoguo Wang, Haibo Chen, and Jinyang Li. 2021. Polyjuice: High-Performance Transactions via Learned Concurrency Control. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 199–216. https://www.usenix.org/system/ files/osdi21-wang-jiachen.pdf

  80. [80]

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. 2025. ThetaEvolve: Test- Time Learning on Open Problems.arXiv preprint arXiv:2511.23473(2025)

Showing first 80 references.