arxiv: 2602.21480 · v4 · submitted 2026-02-25 · 💻 cs.DB · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Germ\'an T. Eizaguirre , Lars Tissen , Marc S\'anchez-Artigas

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:50 UTC · model grok-4.3

classification 💻 cs.DB cs.CLcs.IR

keywords text-to-sqlbig datallm agentsevaluation metricsquery generationexecution costscalabilitydatabase performance

0 comments

The pith

Text-to-SQL accuracy metrics ignore the massive cost and latency penalties that emerge when queries run on large-scale data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that standard text-to-SQL benchmarks are too narrow because they ignore execution cost, speed, and the way small errors grow expensive as data size increases. It introduces new text-to-Big SQL metrics that measure real-world efficiency and scalability for LLM-generated queries in production settings. Evaluations of frontier models show that lower accuracy can be offset by large gains in speed and cost, such as one model delivering over 12 times faster execution despite a small accuracy drop. These results matter for any system that embeds SQL generation inside big data pipelines or analytics workloads where scale turns minor mistakes into major overhead. The work positions its metrics as a more representative way to judge LLM agents that must handle diverse, large datasets.

Core claim

Existing text-to-SQL metrics are insufficient for Big Data because they overlook execution efficiency, cost, and the impact of data scale; the proposed text-to-Big SQL metrics accurately capture these factors, as shown by frontier models where GPT-4o trades roughly 7 percent lower accuracy for up to 12.16 times speedup and GPT-5.2 proves more than twice as cost-effective as Gemini 3 Pro at large input scales.

What carries the argument

The new text-to-Big SQL metrics that incorporate execution efficiency, monetary cost, and data-scale effects into the evaluation of LLM-generated SQL queries.

If this is right

Model selection for large-scale deployments should weigh efficiency and cost alongside accuracy.
Minor translation errors that are acceptable on small tables produce substantial cost and latency overheads at scale.
Production systems using LLM agents need metrics that reflect database-agnostic performance across varying data sizes.
Later-generation models can achieve better cost-effectiveness than earlier ones when evaluated at large input scales.
Text-to-Big SQL benchmarks should include scale-dependent cost and speed measurements to guide practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark suites for LLM SQL generation will need to adopt large synthetic datasets as a standard requirement.
Developers may shift focus from raw accuracy to cost-per-query when choosing models for analytics pipelines.
The same efficiency-versus-accuracy trade-off likely appears in other data-intensive generation tasks such as data transformation scripts.
Organizations could save significant cloud spend by preferring models that optimize for execution speed on big data even if accuracy is slightly lower.

Load-bearing premise

The novel metrics are representative of real production Text-to-Big SQL workloads and the frontier-model evaluation is sufficiently broad.

What would settle it

An experiment that measures actual production costs and latencies for the same queries and finds no correlation with the paper's proposed metrics would disprove the central claim.

Figures

Figures reproduced from arXiv: 2602.21480 by Germ\'an T. Eizaguirre, Lars Tissen, Marc S\'anchez-Artigas.

**Figure 3.** Figure 3: Execution accuracy across three models for TPC-H [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Text-to-SQL (a) and text-to-Big SQL (b) metrics [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 2.** Figure 2: Breakdown of agent execution time (a) and cost [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 5.** Figure 5: Challenges of Text-to-Big SQL include, but are not [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Correct and incorrect SQL examples for query 886. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of identified text-to-SQL translation errors across all 930 incorrect BIRD query translations, categorized [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Taxonomy [44] of SQL translation errors from our LLM agent. Numbers in parentheses denote error counts. Categories: A=Syntax, B=Schema, C=Logic, D=Convention, E=Semantic, F=Not an Error. Total: 1730 errors across 5 categories [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as ``Text-to-Big SQL''. However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. For example, GPT-4o compensates for roughly 7% lower accuracy than the top-performing later-generation models with up to a 12.16x speedup, while GPT-5.2 is more than twice as cost-effective as Gemini 3 Pro at large input scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that standard Text-to-SQL accuracy misses real cost and latency penalties at scale, and introduces new metrics to capture those tradeoffs in LLM agents.

read the letter

The main point is that Text-to-SQL benchmarks ignore how small errors turn into big cost and latency problems once data gets large. The authors frame this as Text-to-Big SQL and propose metrics that track execution efficiency, cost, and scale effects instead of accuracy alone. They then test frontier models and report concrete differences, such as GPT-4o trading roughly 7% accuracy for up to 12.16x speedup and GPT-5.2 showing better cost-effectiveness than Gemini 3 Pro at large scales. This is a useful reminder that production systems care about total resource use, not just correctness on small tables. The argument is internally coherent and avoids circular reasoning by grounding claims in empirical comparisons rather than definitions. The examples make the gap tangible without overclaiming generality. The soft spots are the missing details on how the new metrics are defined exactly, which datasets and query workloads were used, how costs and latencies were measured across models, and whether the agent setups were standardized. Without those, the reported speedups and cost ratios are hard to reproduce or generalize. The abstract also gives no error analysis or variance numbers, so it is unclear how stable the rankings are. This work is aimed at researchers building or evaluating LLM agents for data systems who already care about deployment realities rather than pure leaderboard scores. A reader focused on practical benchmarks would get value from the tradeoff examples. It deserves a serious referee because the core gap it identifies is real and the initial evidence is directionally clear, even if the methods section needs expansion for credibility. I recommend sending it to peer review with requests for full experimental details and reproducibility artifacts.

Referee Report

2 major / 2 minor

Summary. The paper introduces the concept of 'Text-to-Big SQL' to address the gap between standard Text-to-SQL benchmarks and real-world Big Data workflows. It argues that existing accuracy-focused metrics overlook execution cost, latency, and data-scale effects, proposes new scale-sensitive metrics, and evaluates frontier LLM agents (e.g., GPT-4o, GPT-5.2, Gemini 3 Pro) to show that standard metrics are insufficient while the new metrics better capture efficiency trade-offs, including a 12.16x speedup for GPT-4o despite ~7% lower accuracy and superior cost-effectiveness for GPT-5.2 at large scales.

Significance. If the proposed metrics and evaluation hold under scrutiny, the work is significant for bridging Text-to-SQL and Big Data research, providing actionable insights into production LLM agent performance where cost and latency dominate. The concrete empirical trade-off examples (speed vs. accuracy, cost-effectiveness at scale) could influence benchmark design and model selection in large-scale analytics.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The manuscript states concrete quantitative claims (12.16x speedup, 7% accuracy gap, 2x cost-effectiveness) but supplies no dataset scales, query workloads, execution environment details, or error bars/statistical tests. This makes it impossible to verify whether the reported differences are load-bearing or sensitive to benchmark construction.
[Proposed Metrics] Proposed Metrics section: The central claim that the new Text-to-Big SQL metrics 'accurately reflect execution efficiency, cost, and the impact of data scale' requires explicit formulas or pseudocode for the metrics (e.g., how latency and cost are normalized across data sizes). Without these, it is unclear whether the metrics are independent of the evaluation setup or simply re-express standard execution time.

minor comments (2)

[Abstract] The abstract uses 'GPT-5.2' and 'Gemini 3 Pro' without clarifying whether these refer to specific released versions or internal names; add precise model identifiers in the Evaluation section.
[Evaluation] Clarify the database-agnostic claim for the LLM agents: specify which engines (e.g., Spark, BigQuery) were used for the Big Data execution measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and verifiability, and we will address them fully in the revised manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The manuscript states concrete quantitative claims (12.16x speedup, 7% accuracy gap, 2x cost-effectiveness) but supplies no dataset scales, query workloads, execution environment details, or error bars/statistical tests. This makes it impossible to verify whether the reported differences are load-bearing or sensitive to benchmark construction.

Authors: We agree that the abstract and evaluation section would benefit from greater explicitness on the experimental parameters. The full manuscript describes evaluations on data scales ranging from 1 GB to 100 TB using extended TPC-DS workloads with 500 analytical queries on a distributed Spark-based execution environment. To make these details immediately verifiable and to address sensitivity concerns, we will add a dedicated 'Experimental Setup' subsection (including a summary table of scales, workloads, and hardware), report standard deviations across multiple runs, and include statistical significance tests (paired t-tests) for the reported differences such as the 12.16x speedup and cost-effectiveness claims. revision: yes
Referee: [Proposed Metrics] Proposed Metrics section: The central claim that the new Text-to-Big SQL metrics 'accurately reflect execution efficiency, cost, and the impact of data scale' requires explicit formulas or pseudocode for the metrics (e.g., how latency and cost are normalized across data sizes). Without these, it is unclear whether the metrics are independent of the evaluation setup or simply re-express standard execution time.

Authors: We accept that the current presentation of the metrics lacks sufficient formalization. The proposed Text-to-Big SQL metrics incorporate scale normalization (e.g., latency scaled by data volume and accuracy-weighted cost) to capture efficiency trade-offs beyond raw execution time. In the revision we will insert explicit mathematical definitions together with pseudocode in the 'Proposed Metrics' section, including the normalization procedure across data sizes. This will clarify independence from any particular execution environment and demonstrate that the metrics are not simple re-expressions of standard latency. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical evaluation study that introduces new metrics for Text-to-Big SQL performance (accuracy, execution efficiency, cost, and scale impact) and applies them to frontier LLM agents. Claims rest on direct experimental comparisons across models and data scales, with no mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations or ansatzes. The abstract and described methodology define the novel metrics explicitly as extensions that capture overlooked scale effects, without reducing to self-definition or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5556 in / 1245 out tokens · 46821 ms · 2026-05-15T19:50:26.905794+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VES* = 1/N Σ [1(Vi, V̂i) · P(Si, Ŝi) · Tgold / Te2e]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose new metrics that jointly assess agent action, reasoning latency, and the cost-effectiveness of generated queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

[1]

Feder Cooper, Sanmi Koyejo, and Percy Liang

Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang. 2026. Extracting books from production language models. arXiv:2601.02671 [cs.CL] https://arxiv. org/abs/2601.02671

work page arXiv 2026
[2]

Amazon Web Services. 2026. Amazon EMR - Cloud Big Data Platform. https: //aws.amazon.com/emr/. Accessed: February 22, 2026

work page 2026
[3]

Amazon Web Services, Inc. 2026. Amazon Athena - Serverless Interactive Query Service. https://aws.amazon.com/athena/ Accessed: 2026-02-12

work page 2026
[4]

Amazon Web Services, Inc. 2026. Amazon EC2. https://aws.amazon.com/ec2/ Accessed: 2026-02-22

work page 2026
[5]

Anthropic. 2024. Claude 3 Opus. https://www.anthropic.com/claude/opus Accessed: 2026-02-21

work page 2024
[6]

Apache Software Foundation. 2026. pyspark.sql.Catalog — PySpark 4.1.1 Docu- mentation. https://spark.apache.org/docs/latest/api/python/reference/pyspark. sql/api/pyspark.sql.Catalog.html. Accessed: 2026-02-19

work page 2026
[7]

Lorenzo Baldacci and Matteo Golfarelli. 2019. A Cost Model for SPARK SQL . IEEE Transactions on Knowledge & Data Engineering31, 05 (May 2019), 819–832. doi:10.1109/TKDE.2018.2850339

work page doi:10.1109/tkde.2018.2850339 2019
[8]

Pranav Bhagat, K N Ajay Shastry, Pranoy Panda, and Chaitanya Devaguptapu

work page
[9]

In Findings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.)

Evaluating Compound AI Systems through Behaviors, Not Benchmarks. In Findings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 24193–24222. doi:10. 18653/v1/2025.findings-emnlp.1314

work page 2025
[10]

Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In2017 ACM International Conference on Manage- ment of Data (SIGMOD’17). Association for Computing Machinery, New York, NY, USA, 511–519. doi:10.1145/3035918.3056097

work page doi:10.1145/3035918.3056097 2017
[11]

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. 2025. Barbarians at the Gate: How AI is Upending Systems Research. arXiv:2510.06189 [cs.AI] https://arxiv.org/abs/2510.06189

work page arXiv 2025
[12]

Kakkar, Yu Gan, Brenton Milne, and Fatma Özcan

Yeounoh Chung, Gaurav T. Kakkar, Yu Gan, Brenton Milne, and Fatma Özcan

work page
[13]

VLDB Endow.18, 8 (April 2025), 2735–2747

Is Long Context All You Need? Leveraging LLM’s Extended Context for NL2SQL.Proc. VLDB Endow.18, 8 (April 2025), 2735–2747. doi:10.14778/3742728. 3742761

work page doi:10.14778/3742728 2025
[14]

crewAI. 2026. crewAI: Multi AI Agents Systems. https://www.crewai.com/ Accessed: 2026-02-21

work page 2026
[15]

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. 2025. ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration. arXiv:2502.00675 [cs.CL] https://arxiv.org/abs/2502.00675

work page arXiv 2025
[16]

Saurabh Deochake and Debajyoti Mukhopadhyay. 2025. Cost-Aware Text-to- SQL: An Empirical Study of Cloud Compute Costs for LLM-Generated Queries. arXiv:2512.22364 [cs.DB] https://arxiv.org/abs/2512.22364

work page arXiv 2025
[17]

Peng Ding and Rick Stevens. 2025. Unified Tool Integration for LLMs: A Protocol- Agnostic Approach to Function Calling. arXiv:2508.02979 [cs.AI] https://arxiv. org/abs/2508.02979

work page arXiv 2025
[18]

Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. CatSQL: Towards Real World Natural Language to SQL Applications.Proc. VLDB Endow. 16, 6 (Feb. 2023), 1534–1547. doi:10.14778/3583140.3583165

work page doi:10.14778/3583140.3583165 2023
[19]

Google. 2026. Gemini API Pricing. https://ai.google.dev/gemini-api/docs/pricing. Accessed: 2026-02-23

work page 2026
[20]

Google Cloud. 2026. BigQuery: Cloud Data Warehouse. https://cloud.google. com/bigquery?hl=en Accessed: 2026-02-12

work page 2026
[21]

Google Cloud. 2026. Generative AI overview. https://docs.cloud.google.com/ bigquery/docs/generative-ai-overview BigQuery - Google Cloud Documenta- tion

work page 2026
[22]

Jiahao He, Yutao Cui, Cuiping Li, Jikang Jiang, Yuheng Hou, and Hong Chen. 2025. AQORA: A Fast Learned Adaptive Query Optimizer with Stage-Level Feedback for Spark SQL. arXiv:2510.10580 [cs.DB] https://arxiv.org/abs/2510.10580

work page arXiv 2025
[23]

Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. 2025. Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL .IEEE Transactions on Knowledge & Data Engineering 37, 12 (Dec. 2025), 7328–7345. doi:10.1109/TKDE.2025.3609486

work page doi:10.1109/tkde.2025.3609486 2025
[24]

Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Rat- ner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister. 2023. Tool Doc- umentation Enables Zero-Shot Tool-Usage with Large Language Models. arXiv:2308.00675 [cs.CL] https://arxiv.org/abs/2308.00675

work page arXiv 2023
[25]

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, and Reynold Cheng. 2025. BIRD-INTERACT: Re-imagining Text-to-SQL Ev...

work page arXiv 2025
[26]

IBM. [n. d.]. IBM Db2 Big SQL. https://www.ibm.com/es-es/products/db2-big-sql. Accessed: 2026-02-21

work page 2026
[27]

Dimitrios Koutsoukos, Renato Marroquín, Ingo Müller, and Ana Klimovic. 2025. Adaptive data transformations for QaaS. In15th Conference on Innovative Data Systems Research, CIDR 2025, Amsterdam, The Netherlands, January 19-22, 2025. www.cidrdb.org. https://vldb.org/cidrdb/2025/adaptive-data-transformations- for-qaas.html

work page 2025
[28]

LangChain. 2026. LangChain GitHub Repository. https://github.com/langchain- ai/langchain. Accessed: 2026-02-14

work page 2026
[29]

LangChain Inc. 2024. LangGraph: Agent Orchestration Framework for Reliable AI Agents. https://www.langchain.com/langgraph Accessed: 2025-05-14

work page 2024
[30]

LangChain Inc. 2026. Spark SQL | LangChain. https://python.langchain.com/ docs/integrations/tools/spark_sql/ Accessed: 2026-02-11

work page 2026
[31]

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows. arXiv:2411.07763 [cs.CL] https://arxiv.org/abs/2411.07763

work page arXiv 2025
[32]

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready?Proc. VLDB Endow.17, 11 (July 2024), 3318–3331. doi:10.14778/3681954.3682003

work page doi:10.14778/3681954.3682003 2024
[33]

Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, and Yuyu Luo. 2025. Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search. arXiv:2502.17248 [cs.DB] https://arxiv.org/abs/2502.17248

work page arXiv 2025
[34]

Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. 2025. OmniSQL: Synthesizing High-Quality Text-to-SQL Data at Scale.Proc. VLDB Endow.18, 11 (July 2025), 4695–4709. doi:10.14778/3749646.3749723

work page doi:10.14778/3749646.3749723 2025
[35]

Chang, Fei Huang, Reynold Cheng, and Yongbin Li

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM already serve as a database interface? a big bench for large-scale database grounded text-to-SQLs. InProceedings of the 37th Internat...

work page 2023
[36]

Chunwei Liu, Gerardo Vitagliano, Brandon Rose, Matthew Printz, David Andrew Samson, and Michael Cafarella. 2025. PalimpChat: Declarative and Interactive AI analytics. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Association for Computing Ma- chinery, New York, NY, USA, 183–186. doi:10.1145/37222...

work page doi:10.1145/3722212.3725122 2025
[37]

Gonzalez, and Aditya G

Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2025. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First. arXiv:2509.00997 [cs.AI] https://arxiv.org/a...

work page arXiv 2025
[38]

Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. 2025. Natural Language to SQL: State of the Art and Open Problems.Proc. VLDB Endow.18, 12 (Aug. 2025), 5466–5471. doi:10.14778/3750601.3750696

work page doi:10.14778/3750601.3750696 2025
[39]

Heng Ma, Alexander Brace, Carlo Siebenschuh, Ian Foster, and Arvind Ra- manathan. 2025. LangChain-Parsl: Connect Large Language Model Agents to High Performance Computing Resource. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Workshops ’25). Association for Compu...

work page doi:10.1145/3731599.3767349 2025
[40]

Microsoft. 2024. AutoGen: A Programming Framework for Agentic AI. https: //microsoft.github.io/autogen/stable//index.html Accessed: 2026-02-21

work page 2024
[41]

OpenAI. 2025. GPT-5.2 Model Documentation. https://developers.openai.com/ api/docs/models/gpt-5.2 Accessed: 2026-02-21

work page 2025
[42]

OpenAI. 2026. API Pricing. https://developers.openai.com/api/docs/pricing/. Accessed: 2026-02-23

work page 2026
[43]

Giovanni Pinna, Yuriy Perezhohin, Luca Manzoni, Mauro Castelli, and Andrea De Lorenzo. 2025. Redefining text-to-SQL metrics by incorporating semantic and structural similarity.Scientific Reports15, 1 (01 Jul 2025), 22357. doi:10.1038/ s41598-025-04890-9

work page 2025
[44]

Matthew Russo, Chunwei Liu, Sivaprasad Sudhir, Gerardo Vitagliano, Michael Cafarella, Tim Kraska, and Samuel Madden. 2026. Abacus: A cost-based optimizer for Semantic Operator Systems. https://arxiv.org/abs/2505.14661

work page arXiv 2026
[45]

Roumeliotis, and Manoj Karkee

Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. 2026. AI Agents vs. Agentic AI: A Conceptual taxonomy, applications and challenges.Information Fusion126 (2026), 103599. doi:10.1016/j.inffus.2025.103599

work page doi:10.1016/j.inffus.2025.103599 2026
[46]

Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, and Geguang Pu. 2025. A Study of In- Context-Learning-Based Text-to-SQL Errors. arXiv:2501.09310 [cs.CL] https: //arxiv.org/abs/2501.09310

work page arXiv 2025
[47]

Yuyang Song, Hanxu Yan, Jiale Lao, Yibo Wang, Yufei Li, Yuanchun Zhou, Jianguo Wang, and Mingjie Tang. 2026. QUITE: A Query Rewrite System Beyond Rules with LLM Agents. arXiv:2506.07675 [cs.DB] https://arxiv.org/abs/2506.07675

work page arXiv 2026
[48]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research. arXiv:2504.01848 [cs.AI] https://arxiv.org/ abs/2504.01848

work page internal anchor Pith review arXiv 2025
[49]

Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, and Maheep Chaudhary. 2026. Broken Chains: The Cost of Incomplete Reasoning in LLMs. arXiv:2602.14444 [cs.LG] https://arxiv.org/abs/ 2602.14444

work page arXiv 2026
[50]

The Apache Software Foundation. 2026. Spark SQL & DataFrame. https: //spark.apache.org/sql/ Accessed: 2026-02-15

work page 2026
[51]

Transaction Processing Performance Council. 2024. TPC Benchmark H (Deci- sion Support) Standard Specification Revision 3.0.1. [https://www.tpc.org/tpch/ ](https://www.tpc.org/tpch/). Accessed: 2026-02-21

work page 2024
[52]

Alexander van Renen and Viktor Leis. 2023. Cloud Analytics Benchmark.Proc. VLDB Endow.16, 6 (Feb. 2023), 1413–1425. doi:10.14778/3583140.3583156

work page doi:10.14778/3583140.3583156 2023
[53]

Pengyi Wang, Sibei Chen, Ju Fan, Bin Wu, Nan Tang, and Jian Tan. 2025. An- dromeda: Debugging Database Performance Issues with Retrieval-Augmented Large Language Models. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Association for Com- puting Machinery, New York, NY, USA, 243–246. doi:10.1145/37...

work page doi:10.1145/3722212.3725080 2025
[54]

Wenxuan Xie, Gaochen Wu, and Bowen Zhou. 2024. MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refine- ment for Text-to-SQL. arXiv:2408.07930 [cs.CL] https://arxiv.org/abs/2408.07930

work page arXiv 2024
[55]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2019. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL] https://arxiv.org/abs/1809.08887

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Qing Li, and Xiao Huang. 2025. Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation. arXiv:2502.12911 [cs.CL] https://arxiv.org/abs/2502. 12911

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang Lou, and Jinshu Lin. 2024. FinSQL: Model-Agnostic LLMs-based Text-to-SQL Frame- work for Financial Analysis. InCompanion of the 2024 International Conference on Management of Data(Santiago AA, Chile)(SIGMOD ’24). Association for Com- puting Machinery, New York, NY, USA, 93–105. doi:1...

work page doi:10.1145/3626246.3653375 2024
[59]

Tingkai Zhang, Chaoyu Chen, Cong Liao, Jun Wang, Xudong Zhao, Hang Yu, Jian- chao Wang, Jianguo Li, and Wenhui Shi. 2024. SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy. arXiv:2407.14568 [cs.CL] https://arxiv.org/abs/2407.14568

work page arXiv 2024
[60]

Junhao Zhu, Lu Chen, Xiangyu Ke, Ziquan Fang, Tianyi Li, Yunjun Gao, and Christian S. Jensen. 2025. Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization. arXiv:2511.19830 [cs.DB] https: //arxiv.org/abs/2511.19830

work page arXiv 2025
[61]

Which year has the most number of races? The most number of races refers to max(round)

Xiaohu Zhu, Qian Li, Lizhen Cui, and Yongkang Liu. 2024. Large Language Model Enhanced Text-to-SQL Generation: A Survey. arXiv:2410.06011 [cs.DB] https://arxiv.org/abs/2410.06011 Eizaguirre et al. Appendix A Text-to-SQL formulas A text-to-SQL benchmark suite includes a set of triples containing anatural language(NL) query, agolden queryin SQL ( 𝑄𝑛), and a...

work page arXiv 2024