arxiv: 2604.16493 · v1 · submitted 2026-04-13 · 💻 cs.DB · cs.AI· cs.CL· cs.LG

Recognition: unknown

NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

Beng Chin Ooi, Nuo Chen, Peng Lu, Quang-Trung Ta, Shizheng Hou, Wenqi Pei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:43 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.CLcs.LG

keywords NL2SQLLLMbenchmarking frameworkSQL generationmodular evaluationschema selectionquery revisionperformance metrics

0 comments

The pith

NL2SQLBench decomposes LLM NL2SQL systems into three modules and shows current methods have major accuracy shortfalls plus high computational costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NL2SQLBench as the first modular evaluation framework for LLM-enabled natural language to SQL systems. It decomposes these systems into three core modules—Schema Selection, Candidate Generation, and Query Revision—reviews strategies for each, and introduces fine-grained metrics to quantify module-level effectiveness and efficiency within a configurable multi-agent setup. The framework is applied to rigorously test ten representative open-source methods on the BIRD and ScienceBenchmark development sets using DeepSeek-V3 and GPT-4o mini. The evaluation finds substantial gaps, including room for accuracy gains and severe computational inefficiency that limits real-world use, while also flagging issues like inaccurate gold SQL annotations in existing datasets.

Core claim

NL2SQLBench is a modular benchmarking framework that dissects LLM-enabled NL2SQL approaches into Schema Selection, Candidate Generation, and Query Revision modules, equips each with novel fine-grained metrics, and through evaluation of ten open-source methods on two datasets with two LLMs reveals significant accuracy shortfalls and substantial computational inefficiency that hampers practical adoption, while also identifying shortcomings in current benchmark datasets and evaluation rules.

What carries the argument

The three-module decomposition of NL2SQL systems (Schema Selection, Candidate Generation, Query Revision) together with the set of fine-grained metrics implemented inside a flexible multi-agent framework that supports configurable benchmarking across approaches.

If this is right

Different NL2SQL approaches can be compared fairly and systematically across individual modules.
Targeted improvements can focus on specific weak modules to raise overall accuracy.
Computational costs must be reduced substantially before widespread real-world deployment becomes viable.
Dataset creators need to correct inaccurate gold SQL annotations and refine evaluation rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Module-specific metrics could guide separate optimization of each stage rather than end-to-end tuning.
The framework supplies a reusable testbed that future methods can adopt to demonstrate gains over the current reference point.
Addressing the identified dataset annotation errors would tighten the reliability of all future NL2SQL benchmarks.

Load-bearing premise

The three-module breakdown fully covers every critical part of LLM-enabled NL2SQL systems and the new fine-grained metrics accurately reflect real-world effectiveness without bias from the multi-agent implementation.

What would settle it

A high-performing NL2SQL system that cannot be mapped onto the three proposed modules, or a user study showing that the module-level metrics fail to predict actual query success or satisfaction.

Figures

Figures reproduced from arXiv: 2604.16493 by Beng Chin Ooi, Nuo Chen, Peng Lu, Quang-Trung Ta, Shizheng Hou, Wenqi Pei.

**Figure 2.** Figure 2: Execution accuracy using different schemas. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@k results for multiple-candidate approaches. 4.3.3 Practical guide. Our evaluation demonstrates that syntactically valid but semantically misaligned queries constitute the dominant failure case, far exceeding execution errors. We recommend implementing execution-result-based semantic validation during the generation phase rather than deferring all validation to the revision stage. Specifically, for … view at source ↗

**Figure 4.** Figure 4: Analysis of the Query Revision module on BIRD. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis of the Query Revision module on Sci [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Breakdown of questions by number of approaches [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Correct Rates, Incorrect Rates, and Error Rates on BIRD dev set using DeepSeek-V3 for the Candidate Generation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Correct Rates, Incorrect Rates, and Error Rates on BIRD dev set using GPT-4o-mini for the Candidate Generation [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Correct Rates, Incorrect Rates, and Error Rates on ScienceBenchmark dev set using DeepSeek-V3 for the Candidate [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Correct Rates, Incorrect Rates, and Error Rates on ScienceBenchmark dev set using GPT-4o-mini for the Candidate [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Correct Rates, Incorrect Rates, and Error Rates on BIRD dev set using DeepSeek-V3 for Query Revision module [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Correct Rates, Incorrect Rates, and Error Rates on BIRD dev set using GPT-4o-mini for Query Revision module [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Correct Rates, Incorrect Rates, and Error Rates on ScienceBenchmark dev set using DeepSeek-V3 for Query Revision [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Correct Rates, Incorrect Rates, and Error Rates on ScienceBenchmark dev set using GPT-4o-mini for Query Revision [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: The coefficient heatmap of different solutions on [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NL2SQLBench gives a finer-grained view of NL2SQL pipelines than prior end-to-end benchmarks, but the three-module split and multi-agent harness risk turning the evaluation into a test of the harness itself.

read the letter

The main thing to know is that this paper introduces a modular breakdown of NL2SQL systems into Schema Selection, Candidate Generation, and Query Revision, along with new per-module metrics and a configurable multi-agent implementation. They run ten open-source methods on the BIRD and ScienceBenchmark dev sets using DeepSeek-V3 and GPT-4o mini, and report both accuracy shortfalls and high compute costs while flagging bad gold labels in the datasets. That level of dissection is new relative to BIRD-style benchmarks and could help people target improvements instead of just chasing overall accuracy numbers. The framework itself looks practical for anyone who wants to plug in different LLMs or methods without rewriting everything from scratch. They do a decent job laying out existing strategies for each module and showing where current approaches fall short on efficiency. The evaluation covers multiple dimensions beyond simple execution accuracy, which is a step up from most NL2SQL papers. The soft spot is the lack of a clear check that the modular re-implementations match the original published versions of those ten methods when run in their native form on the same LLMs. If the multi-agent orchestration changes prompting, context passing, or revision behavior, then the module-level scores and rankings become tied to the benchmark setup rather than the methods alone. The abstract claims rigorous evaluation but gives no numbers on how they validated the new metrics or handled dataset noise beyond noting the problems. No error bars or statistical tests are mentioned in the summary. This paper is aimed at NL2SQL researchers and tool builders who need better diagnostics than whole-pipeline accuracy. A reader working on database interfaces or LLM agents will get concrete ideas for where to focus next. It is solid enough on the empirical side to deserve a serious referee, though the implementation fidelity question will need direct answers in revision. I would send it out for review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper introduces NL2SQLBench, a modular benchmarking framework for LLM-enabled Natural Language to SQL (NL2SQL) systems. It decomposes NL2SQL approaches into three core modules—Schema Selection, Candidate Generation, and Query Revision—reviews existing strategies, proposes novel fine-grained metrics for module-level effectiveness and efficiency, and implements them in a configurable multi-agent framework. The authors evaluate ten representative open-source methods on the BIRD and ScienceBenchmark development sets using DeepSeek-V3 and GPT-4o mini, assess performance across modules and multiple dimensions, identify accuracy and efficiency gaps, and highlight shortcomings in existing benchmark datasets and evaluation rules.

Significance. If the modular decomposition and metrics prove faithful to original method behaviors, NL2SQLBench would offer a valuable standardized reference for fair comparisons in LLM-based NL2SQL research. The multi-dataset, multi-LLM evaluation and explicit call-out of dataset annotation issues provide concrete guidance for targeted improvements in accuracy and computational efficiency, potentially accelerating progress in a fast-moving area.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: The central claim that NL2SQLBench enables 'rigorous' and 'fair' comparison revealing 'significant gaps' depends on the fixed three-module decomposition plus multi-agent harness faithfully representing the ten evaluated methods. No cross-check is described showing that the modular re-implementations produce end-to-end accuracy and latency statistically indistinguishable from the original published monolithic versions on identical LLM back-ends; without this, module scores and rankings risk being harness artifacts.
[Metrics proposal] Metrics proposal section: Novel fine-grained metrics are introduced for each module, yet the manuscript provides no validation (e.g., correlation with end-to-end accuracy, comparison against prior metrics, or sensitivity analysis), nor error bars or statistical significance tests on the reported module-level and overall results. This directly affects the reliability of the 'substantial room for accuracy improvements' conclusion.
[Dataset analysis] Dataset analysis subsection: The identification of inaccurate gold SQL annotations and evaluation-rule limitations is useful, but the paper does not detail how these issues were handled during evaluation (e.g., exclusion, correction, or sensitivity quantification) or their quantitative impact on the ten-method rankings and gap measurements.

minor comments (1)

[Abstract] Abstract: The description of the two datasets would benefit from explicit mention of the exact splits or query counts used from the development sets to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the thorough and insightful review. The comments highlight important aspects that will help improve the clarity and rigor of our work. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim that NL2SQLBench enables 'rigorous' and 'fair' comparison revealing 'significant gaps' depends on the fixed three-module decomposition plus multi-agent harness faithfully representing the ten evaluated methods. No cross-check is described showing that the modular re-implementations produce end-to-end accuracy and latency statistically indistinguishable from the original published monolithic versions on identical LLM back-ends; without this, module scores and rankings risk being harness artifacts.

Authors: We agree that demonstrating the fidelity of our modular re-implementations to the original methods is crucial for the validity of our claims. Our implementations are based on detailed analysis of the original papers and available code repositories. However, a full statistical cross-check was not performed due to variations in original experimental setups and LLM versions. In the revised version, we will include a new subsection on 'Implementation Fidelity' where we discuss how closely each method was replicated, provide qualitative comparisons to published results, and acknowledge potential artifacts. This will strengthen the justification for our 'rigorous' and 'fair' comparison claims. revision: partial
Referee: [Metrics proposal] Metrics proposal section: Novel fine-grained metrics are introduced for each module, yet the manuscript provides no validation (e.g., correlation with end-to-end accuracy, comparison against prior metrics, or sensitivity analysis), nor error bars or statistical significance tests on the reported module-level and overall results. This directly affects the reliability of the 'substantial room for accuracy improvements' conclusion.

Authors: We acknowledge the lack of explicit validation for the proposed metrics in the current manuscript. The metrics were developed to provide granular insights into module performance that end-to-end metrics cannot capture. To address this, we will revise the Metrics proposal section to include: (1) correlation analysis between module metrics and overall accuracy, (2) comparison with existing metrics where applicable, (3) sensitivity analysis, and (4) error bars and statistical tests for the reported results. These additions will support the reliability of our conclusions regarding accuracy gaps. revision: yes
Referee: [Dataset analysis] Dataset analysis subsection: The identification of inaccurate gold SQL annotations and evaluation-rule limitations is useful, but the paper does not detail how these issues were handled during evaluation (e.g., exclusion, correction, or sensitivity quantification) or their quantitative impact on the ten-method rankings and gap measurements.

Authors: Thank you for this observation. In the evaluation, we followed the standard dataset usage without modifications to ensure fair comparison with prior benchmarks. We will expand the Dataset analysis subsection to explicitly describe our handling approach (no exclusion or correction applied) and add a quantitative sensitivity analysis, such as the impact of known annotation errors on rankings by simulating corrections on a subset. This will quantify the effect on our gap measurements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking framework

full rationale

This paper introduces NL2SQLBench by proposing a three-module decomposition of NL2SQL systems (Schema Selection, Candidate Generation, Query Revision), defining fine-grained metrics for each, and implementing them in a configurable multi-agent harness to evaluate ten existing open-source methods on BIRD and ScienceBenchmark datasets. There are no mathematical derivations, fitted parameters, predictions, or first-principles claims that reduce to self-defined inputs by construction. All load-bearing assertions rest on direct empirical measurements and comparisons performed within the framework, with no self-citation chains, ansatz smuggling, or renaming of known results invoked to justify core results. The work is self-contained as a benchmarking contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that NL2SQL systems can be cleanly decomposed into the three named modules and that existing datasets plus the chosen LLMs provide a representative testbed, despite noted flaws in the datasets.

axioms (1)

domain assumption The three modules (Schema Selection, Candidate Generation, Query Revision) cover the essential components of LLM-enabled NL2SQL systems.
Stated in the abstract as the basis for the framework design.

pith-pipeline@v0.9.0 · 5617 in / 1308 out tokens · 21992 ms · 2026-05-10T15:43:43.486560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Anastasia Ailamaki, Samuel Madden, Daniel Abadi, Gustavo Alonso, Sihem Amer-Yahia, Magdalena Balazinska, Philip A Bernstein, Peter Boncz, Michael Cafarella, Surajit Chaudhuri, et al. 2025. The Cambridge Report on Database Research.arXiv preprint arXiv:2504.11259(2025)

work page arXiv 2025
[2]

Arian Askari, Christian Poelitz, and Xinye Tang. 2025. Magic: Generating self- correction guideline for in-context text-to-sql. InProceedings of the AAAI Con- ference on Artificial Intelligence, Vol. 39. 23433–23441

2025
[3]

Hasan Alp Caferoğlu and Özgür Ulusoy. 2024. E-sql: Direct schema linking via question enrichment in text-to-sql.arXiv preprint arXiv:2409.16751(2024)

work page arXiv 2024
[4]

Zhenbiao Cao, Yuanlei Zheng, Zhihao Fan, Xiaojin Zhang, and Wei Chen. 2024. RSL-SQL: Robust Schema Linking in Text-to-SQL Generation.arXiv preprint arXiv:2411.00073(2024)

work page arXiv 2024
[5]

Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexan- der Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, and Bing Xi- ang. 2023. Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. InICLR

2023
[6]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2024. Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Fron- tiers and Future. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2024.acl-long.65 2024
[7]

Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. 2025. Command A: An enterprise-ready large language model. arXiv:2504.00698 [cs.CL] https://arxiv.org/abs/2504.00698

work page arXiv 2025
[8]

DeepSeek-AI. 2025. DeepSeek-V3-0324 Release. https://api-docs.deepseek.com/ news/news250325. Accessed: October 31, 2025

2025
[9]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comp...

work page doi:10.18653/v1/2024.emnlp-main.64 2024
[10]

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, lu Chen, Jin- shu Lin, and Dongfang Lou. 2023. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv:2307.07306 [cs.CL] https://arxiv.org/abs/2307.07306

work page arXiv 2023
[11]

Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL.Proc. VLDB Endow. 17, 11 (July 2024), 2750–2763. https://doi.org/10.14778/3681954.3681960

work page doi:10.14778/3681954.3681960 2024
[12]

Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino

Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Haglei- ther, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, Alex Van Grootel, Brandon Chow, Kai Deng, Katherine Lin, Marcos Campos, K. Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino. 2024. NL2SQL is a solved problem... Not!

2024
[13]

Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning,

Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver, John R. Woodward, John Drake, and Qiaofu Zhang. 2021. Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. InFindings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Spe- cia, and Scott Wen-tau Yih (Eds.). Association for...

work page doi:10.18653/v1/2021 2021
[14]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.Proc. VLDB Endow.17, 5 (January 2024), 1132–1145. https://www.vldb.org/pvldb/vol17/p1132-gao.pdf

2024
[15]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yuntao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li. 2024. A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL. arXiv preprint arXiv:2411.08599(2024). https://arxiv.org/abs/2411.08599

work page arXiv 2024
[16]

GSR-SQL. 2025. LLM Prompting for Text2SQL via Gradual SQL Refinement. https://github.com/GSR-SQL/GSR.GitHub repository. Accessed: October 31, 2025

2025
[17]

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Data- base with Intermediate Representation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computationa...

work page doi:10.18653/v1/p19-1444 2019
[18]

Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. 2025. Next-Generation Database Interfaces: a Survey of LLM-based Text-to-SQL.IEEE Transactions on Knowledge and Data Engineering (2025), 1–20. https://doi.org/10.1109/TKDE.2025.3609486

work page doi:10.1109/tkde.2025.3609486 2025
[19]

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169(2023)

work page arXiv 2023
[20]

Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. 2025. MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation. InProceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Ass...

2025
[21]

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, ZHAOQING SUO, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Work- flows. InThe Thirteenth International Conference on Learning Re...

2025
[22]

Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan, Wei Lu, Min-Yen Kan, and Tat-Seng Chua. 2020. Re-examining the Role of Schema Linking in Text-to- SQL. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online...

work page doi:10.18653/v1/2020.emnlp-main.564 2020
[23]

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready?Proc. VLDB Endow.17, 11 (July 2024), 3318–3331. https://doi.org/10.14778/3681954.3682003

work page doi:10.14778/3681954.3682003 2024
[24]

Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. 2025. OmniSQL: Synthesizing High-Quality Text-to-SQL Data at Scale.Proc. VLDB Endow.18, 11 (July 2025), 4695–4709. https://doi.org/10.14778/3749646. 3749723

work page doi:10.14778/3749646 2025
[25]

Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. 2023. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13067–13075

2023
[26]

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL.Proc. ACM Manag. Data2, 3, Article 127 (May 2024), 28 pages. https://doi.org/10.1145/3654930

work page doi:10.1145/3654930 2024
[27]

Chang, Fei Huang, Reynold Cheng, and Yongbin Li

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM already Serve as a Database Interface? A big Bench for Large-Scale Database Grounded Text-To-SQLs.Advances in Neural Information Proc...

2023
[28]

Xiuwen Li, Qifeng Cai, Yang Shu, Chenjuan Guo, and Bin Yang. 2025. AID-SQL: Adaptive In-Context Learning of Text-to-SQL with Difficulty-Aware Instruction and Retrieval-Augmented Generation . In2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Computer Society, Los Alamitos, CA, USA, 3945–3957. https://doi.org/10.1109/ICDE65448.2025.00294

work page doi:10.1109/icde65448.2025.00294 2025
[29]

Yinheng Li. 2023. A Practical Survey on Zero-Shot Prompt Design for In- Context Learning. InProceedings of the 14th International Conference on Re- cent Advances in Natural Language Processing, Ruslan Mitkov and Galia An- gelova (Eds.). INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, 641–647. https://aclanthology.org/2023.ranlp-1.69/

2023
[30]

Zhishuai Li, Xiang Wang, Jingjing Zhao, Sun Yang, Guoqing Du, Xiaoru Hu, Bin Zhang, Yuxiao Ye, Ziyue Li, Rui Zhao, and Hangyu Mao. 2024. PET-SQL: A Prompt-enhanced Two-stage Text-to-SQL Framework with Cross-consistency

2024
[31]

Jinqing Lian, Xinyi Liu, Yingxia Shao, Yang Dong, Ming Wang, Zhang Wei, Tianqi Wan, Ming Dong, and Hailin Yan. 2024. ChatBI: Towards Natural Language to Complex Business Intelligence SQL.arXiv preprint arXiv:2405.00527(2024)

work page arXiv 2024
[32]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. https://doi.org/10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[33]

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2025. A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going?IEEE Trans. Knowl. Data Eng.37, 10 (2025), 5735–5754. https://doi.org/10.1109/TKDE.2025.3592032

work page doi:10.1109/tkde.2025.3592032 2025
[34]

Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. 2025. NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 5662–5673. https://doi.org/10.1145/37...

work page doi:10.1145/3711896.3737427 2025
[35]

Lin Long, Xijun Gu, Xinjie Sun, Wentao Ye, Haobo Wang, Sai Wu, Gang Chen, and Junbo Zhao. 2025. Bridging the Semantic Gap Between Text and Table: A Case Study on NL2SQL. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=qmsX2R19p9

2025
[36]

Shuai Lyu, Haoran Luo, Zhonghong Ou, Yifan Zhu, Xiaoran Shang, Yang Qin, and Meina Song. 2025. SQL-o1: A Self-Reward Heuristic Dynamic Search Method NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions for Text-to-SQL.arXiv preprint arXiv:2502.11741(2025)

work page arXiv 2025
[37]

Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi
[38]

InNeurIPS 2024 Third Table Representation Learning Workshop

The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models. InNeurIPS 2024 Third Table Representation Learning Workshop. https://openreview.net/forum?id=fglyh5pa7d

2024
[39]

Karime Maamari and Amine Mhedhbi. 2024. End-to-end text-to-sql generation within an analytics insight engine.arXiv preprint arXiv:2406.12104(2024)

work page arXiv 2024
[40]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing ...

2023
[41]

Wenxin Mao, Ruiqi Wang, Jiyu Guo, Jichuan Zeng, Cuiyun Gao, Peiyi Han, and Chuanyi Liu. 2024. Enhancing Text-to-SQL Parsing through Question Rewriting and Execution-Guided Refinement. InFindings of the Association for Computa- tional Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...

work page doi:10.18653/v1/2024.findings-acl.120 2024
[42]

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2025. Large language models: A survey. arXiv:2402.06196 [cs.CL] https://arxiv.org/abs/2402.06196

work page internal anchor Pith review arXiv 2025
[43]

Anna Mitsopoulou and Georgia Koutrika. 2025. Analysis of text-to-SQL bench- marks: limitations, challenges and opportunities. InProceedings 28th International Conference on Extending Database Technology, EDBT 2025. 199–212

2025
[44]

Maida, and Raju Gottumukkala

Ali Mohammadjafari, Anthony S. Maida, and Raju Gottumukkala. 2024. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems. arXiv:2410.01066 [cs.CL]

work page arXiv 2024
[45]

Beng Chin Ooi, Shaofeng Cai, Gang Chen, Yanyan Shen, Kian-Lee Tan, Yuncheng Wu, Xiaokui Xiao, Naili Xing, Cong Yue, Lingze Zeng, Meihui Zhang, and Zhan- hao Zhao. 2024. NeurDB: an AI-powered autonomous data system.Sci. China Inf. Sci.67, 10 (2024). https://doi.org/10.1007/S11432-024-4125-9

work page doi:10.1007/s11432-024-4125-9 2024
[46]

OpenAI. 2025. GPT-4o mini: Fast, affordable small model for focused tasks. https://platform.openai.com/docs/models/gpt-4o-mini Accessed: October 31, 2025

2025
[47]

Wenqi Pei, Hailing Xu, Henry Hengyuan Zhao, CHEN Han, Zining Zhang, Shizheng Hou, Luo Pingyi, and Bingsheng He. 2025. Optimizing Small Language Models for NL2SQL. InICLR 2025 Third Workshop on Deep Learning for Code. https://openreview.net/forum?id=xGkxWP2wE4

2025
[48]

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik
[49]

InThe Thirteenth International Conference on Learning Representations

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=CvGqMD5OtX
[50]

Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. InThirty-seventh Con- ference on Neural Information Processing Systems. https://openreview.net/forum? id=p53QDxSIc5

2023
[51]

Mohammadreza Pourreza and Davood Rafiei. 2023. Evaluating Cross-Domain Text-to-SQL Models and Benchmarks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1601–1611. https://doi.org/10.18653/v1/2023.emnlp-main.99

work page doi:10.18653/v1/2023.emnlp-main.99 2023
[52]

Mohammadreza Pourreza and Davood Rafiei. 2024. DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 8212–8220. https://doi.org/10.18653/v1/2024.findi...

work page doi:10.18653/v1/2024.findings-emnlp.481 2024
[53]

Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, and Sercan O Arik. 2025. Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning- Enhanced Text-to-SQL. InSecond Conference on Language Modeling. https: //openreview.net/forum?id=HbwkIDWQgN

2025
[54]

Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, and Jieping Ye. 2025. ROUTE: Robust Multitask Tuning and Collaboration for Text- to-SQL. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=BAglD6NGy0

2025
[55]

Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold Cheng. 2024. Before Generation, Align it! A Novel and Effective Strategy for Mit- igating Hallucinations in Text-to-SQL Generation. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...

work page doi:10.18653/v1/2024.findings-acl.324 2024
[56]

Adapting large language models by integrating collaborative semantics for recommen- dation

Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and X. Sean Wang. 2024. PURPLE: Making a Large Language Model a Better SQL Writer . In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE Computer Society, Los Alamitos, CA, USA, 15–28. https://doi.org/10.1109/ICDE60146.2024.00009

work page doi:10.1109/icde60146.2024.00009 2024
[57]

Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019. Natu- ral Language Querying of Complex Business Intelligence Queries. InProceedings of the 2019 International Conference on Management of Data(Amsterdam, Nether- lands)(SIGMOD ’19). Association for Computing Ma...

work page doi:10.1145/3299869.3320248 2019
[58]

Lei Sheng and Xu Shuai Shuai. 2025. CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning. InProceedings of the 14th Inter- national Joint Conference on Natural Language Processing and the 4th Con- ference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics, Kentaro Inui, Sakriani Sakti, Haofen Wang, Der...

2025
[59]

Liang Shi, Zhengju Tang, Nan Zhang, Xiaotong Zhang, and Zhi Yang. 2025. A Sur- vey on Employing Large Language Models for Text-to-SQL Tasks.ACM Comput. Surv.58, 2, Article 54 (Sept. 2025), 37 pages. https://doi.org/10.1145/3737873

work page doi:10.1145/3737873 2025
[60]

Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne Goujon. 2024. Enhancing Text-to-SQL translation for financial system design. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 252–262

2024
[61]

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis. arXiv preprint arXiv:2405.16755(2024)

work page arXiv 2024
[62]

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun Li. 2025. MAC-SQL: A Multi-Agent Collaborative Frame- work for Text-to-SQL. InProceedings of the 31st International Conference on Computational Linguistics. 540–557

2025
[63]

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computationa...

work page doi:10.18653/v1/2020.acl-main.677 2020
[64]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- proves Chain of Thought Reasoning in Language Models. InThe Eleventh Inter- national Conference on Learning Representations. https://openreview.net/forum? id=1PL1NIMMrw

2023
[65]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models.Transactions on Ma- chine Learning Research(2022). https://openreview.net/fo...

2022
[66]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Informa- tion Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=_VjQlMeSB_J

2022
[67]

Luoxuan Weng, Yinghao Tang, Yingchaojie Feng, Zhuo Chang, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Yang Li, Huaming Rao, Haonan Wang, Canshi Wei, Xiaofeng Yang, Yuhui Zhang, Yifeng Zheng, Xiuqi Huang, Minfeng Zhu, Yuxin Ma, Bin Cui, Peng Chen, and Wei Chen. 2025. Data- Lab: A Unified Platform for LLM-Powered Business Intelligence. In2025 IEEE 41...

work page doi:10.1109/icde65448.2025.00326 2025
[68]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embed- dings. InProceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval(Washington DC, USA)(SI- GIR ’24). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3626772.3657878 2024
[69]

Xiangjin Xie, Guangwei Xu, Lingyan Zhao, and Ruijie Guo. 2025. OpenSearch- SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Align- ment.Proc. ACM Manag. Data3, 3, Article 194 (June 2025), 24 pages. https: //doi.org/10.1145/3725331

work page doi:10.1145/3725331 2025
[70]

Yuanzhen Xie, Xinzhou Jin, Tao Xie, Matrixmxlin Matrixmxlin, Liang Chen, Chenyun Yu, Cheng Lei, Chengxiang Zhuo, Bo Hu, and Zang Li. 2024. Decompo- sition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm. InFindings of the Association for Computational Lin- guistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

work page doi:10.18653/v1/2024.findings-acl.641 2024
[71]

Siqiao Xue, Danrui Qi, Caigao Jiang, Fangyin Cheng, Keting Chen, Zhiping Zhang, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu, Shizheng Hou1, Wenqi Pei1, Nuo Chen1, Quang-Trung Ta1, Peng Lu2∗, Beng Chin Ooi2 Hongjun Yang, and Faqiang Chen. 2024. Demonstration of DB-GPT: Next Genera- tion Data Interaction System Empowered by Large...

work page doi:10.14778/3685800.3685876 2024
[72]

Yicun Yang, Zhaoguo Wang, Yu Xia, Zhuoran Wei, Haoran Ding, Ruzica Piskac, Haibo Chen, and Jinyang Li. 2025. Automated Validating and Fixing of Text-to- SQL Translation with Execution Consistency. InSIGMOD

2025
[73]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev
[74]

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross- Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chi- ang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 3911–3921. ht...

work page doi:10.18653/v1/d18-1425 2018
[75]

Hongwei Yuan, Xiu Tang, Ke Chen, Lidan Shou, Gang Chen, and Huan Li. 2025. CogSQL: A Cognitive Framework for Enhancing Large Language Models in Text-to-SQL Translation. InAAAI. 25778–25786. https://doi.org/10.1609/aaai. v39i24.34770

work page doi:10.1609/aaai 2025
[76]

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao, Ziyue Li, and Hangyu Mao. 2024. Benchmarking the Text- to-SQL Capability of Large Language Models: A Comprehensive Evaluation. arXiv:2403.02951 [cs.CL] https://arxiv.org/abs/2403.02951

work page arXiv 2024
[77]

Meihui Zhang, Zhaoxuan Ji, Zhaojing Luo, Yuncheng Wu, and Chengliang Chai
[78]

Adapting large language models by integrating collaborative semantics for recommen- dation

Applications and Challenges for Large Language Models: From Data Management Perspective. In2024 IEEE 40th International Conference on Data Engineering (ICDE). 5530–5541. https://doi.org/10.1109/ICDE60146.2024.00441

work page doi:10.1109/icde60146.2024.00441 2024
[79]

Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten, Geor- gia Koutrika, and Kurt Stockinger. 2023. ScienceBenchmark: A Complex Real- World Benchmark for Evaluating Natural Language to SQL Systems.Proc. VLDB Endow.17, 4 (2023), 685–698

2023
[80]

Fuheng Zhao, Shaleen Deep, Fotis Psallidas, Avrilia Floratou, Divyakant Agrawal, and Amr El Abbadi. 2025. Sphinteract: Resolving Ambiguities in NL2SQL through User Interaction.Proc. VLDB Endow.18, 4 (2025), 1145–1158. https: //doi.org/10.14778/3717755.3717772

work page doi:10.14778/3717755.3717772 2025

Showing first 80 references.