pith. sign in

arxiv: 2510.17586 · v3 · submitted 2025-10-20 · 💻 cs.DB

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

Pith reviewed 2026-05-18 06:05 UTC · model grok-4.3

classification 💻 cs.DB
keywords Text-to-SQLLarge language modelsSoftware engineeringOrchestrationBIRD benchmarkSpider benchmarkSchema linkingN-version generation
0
0 comments X

The pith

Treating Text-to-SQL as structured software development reaches 89.8 percent accuracy on Spider with modest models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing Text-to-SQL systems fall short because they lack structured orchestration across the full workflow rather than because the underlying models are too small. It presents DeepEye-SQL as a framework that models the task after the software development life cycle, using four linked stages to ground the query, generate multiple candidate solutions, verify them before running, and pick the best one with execution feedback. With open-source mixture-of-experts models of roughly 30 billion total parameters and no fine-tuning, the system records 73.5 percent execution accuracy on the BIRD development set, 75.07 percent on the official BIRD test leaderboard, and 89.8 percent on Spider test. A sympathetic reader would see this as evidence that careful process design can deliver more reliable natural-language database access than simply using bigger models or more training data.

Core claim

DeepEye-SQL reframes Text-to-SQL as the development of a small software program executed through a verifiable process guided by the Software Development Life Cycle. The framework integrates four synergistic stages: robust schema linking that enforces relational closure, N-version SQL generation for fault tolerance, a Syntax-Logic-Quality toolchain that intercepts errors before execution, and confidence-aware selection that resolves ambiguity through execution-guided adjudication rather than simple majority voting. Using open-source MoE LLMs without any fine-tuning, it achieves 73.5 percent execution accuracy on BIRD-Dev, 75.07 percent on the official BIRD-Test leaderboard, and 89.8 percent 0

What carries the argument

The four synergistic stages modeled on the Software Development Life Cycle that together provide intent grounding, fault tolerance, pre-execution verification, and confidence-based selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged verification approach could be tested on other structured output tasks such as generating API calls or data transformation scripts.
  • Applying the framework to production databases with evolving schemas would show whether the schema-linking stage remains effective outside benchmark conditions.
  • If the verification rules were extended with domain-specific checks, accuracy on specialized enterprise databases might improve further.

Load-bearing premise

The Syntax-Logic-Quality toolchain reliably intercepts errors before execution and N-version generation plus confidence-aware selection together provide fault tolerance that exceeds simple majority voting.

What would settle it

An independent run on the same BIRD or Spider test sets that finds many queries passing the toolchain yet failing at execution time, or that shows confidence-aware selection performing no better than majority voting, would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2510.17586 by Boyan Li, Chong Chen, Yinan Mei, Yuyu Luo, Zhujun Xue.

Figure 1
Figure 1. Figure 1: Key Idea of DeepEye-SQL (LLMs) [42, 47, 52, 56] have substantially advanced Text-to-SQL, achieving strong results on benchmarks such as Spider [51] and BIRD [18]. For example, Alpha-SQL [15] leverages dynamic multi￾step reasoning, while XiYan-SQL [21] improves SQL generation and multi-candidate SQL selection through task-specific fine-tuning. Despite these advances, state-of-the-art performance on the BIRD… view at source ↗
Figure 2
Figure 2. Figure 2: DeepEye-SQL, a plug-and-play Text-to-SQL frame￾work, consistently surpasses prior SOTA methods using ∼30B open-source LLMs without any task-specific fine-tuning. JOIN, NULL/result) that trigger targeted LLM repair, ensuring ver￾ifiable correctness (addressing C3). Finally, in release, we introduce confidence-aware selection that clusters execution results to esti￾mate confidence and then takes a high-confi… view at source ↗
Figure 3
Figure 3. Figure 3: DeepEye-SQL Overview. (e.g., syntax, JOIN correctness). If a flaw is detected, the tool pro￾vides an explicit and actionable directive to the LLM for a targeted revision, mirroring a formal bug report and fix cycle. Phase-4: Confidence-aware SQL Selection. The final phase cor￾responds to the release stage, governed by a Quality Gate. Instead of simply choosing the most common answer, this stage arbitrates … view at source ↗
Figure 6
Figure 6. Figure 6: Confidence-Gated Selection. Based on the confidence of the top￾ranked candidate S1, our framework follows one of two distinct paths, as detailed in Algorithm 5. High-Confidence Shortcut. If the confidence score𝐶𝑜𝑛 𝑓 (S1) exceeds a predefined high-confidence threshold 𝜃𝑐𝑜𝑛 𝑓 , we conclude that there is overwhelming agreement among the generated candidates. In this scenario, we directly select S1 as the fina… view at source ↗
Figure 5
Figure 5. Figure 5: Correctness overlap analysis of three SQL genera [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution accuracy vs. Confidence Shorcut Thresh￾old on BIRD-Dev dataset with Qwen3-Coder-30B-A3B model. To answer this, we first analyze the relationship between voting confidence and correctness, and then present a direct comparison of our method against the standard self-consistency approach. The core motivation behind our Confidence-aware SQL Selection is the observation that the reliability of consist… view at source ↗
read the original abstract

Large language models (LLMs) have advanced Text-to-SQL, yet existing solutions still fall short of system-level reliability. The limitation is not merely in individual modules -- e.g., schema linking, reasoning, and verification -- but more critically in the lack of structured orchestration that enforces correctness across the entire workflow. This gap motivates a paradigm shift: treating Text-to-SQL not as free-form language generation but as a software-engineering problem that demands structured, verifiable orchestration. We present DeepEye-SQL, a software-engineering-inspired framework that reframes Text-to-SQL as the development of a small software program, executed through a verifiable process guided by the Software Development Life Cycle (SDLC). DeepEye-SQL integrates four synergistic stages: it grounds user intent through robust schema linking, enforcing relational closure; enhances fault tolerance with N-version SQL generation; ensures deterministic verification via a ``Syntax-Logic-Quality'' tool-chain that intercepts errors pre-execution; and introduces confidence-aware selection that leverages execution-guided adjudication to resolve ambiguity beyond simple majority voting. Leveraging open-source MoE LLMs (~30B total, ~3B activated parameters) without any fine-tuning, DeepEye-SQL achieves 73.5% execution accuracy on BIRD-Dev, 75.07% on the official BIRD-Test leaderboard, and 89.8% on Spider-Test, outperforming state-of-the-art solutions that rely on larger models or extensive training. This highlights that principled orchestration, rather than LLM scaling alone, is key to achieving system-level reliability in Text-to-SQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DeepEye-SQL, a Text-to-SQL framework that reframes the task as a software-engineering problem guided by the Software Development Life Cycle (SDLC). It consists of four synergistic stages: robust schema linking to ground user intent and enforce relational closure, N-version SQL generation to enhance fault tolerance, a Syntax-Logic-Quality toolchain for deterministic pre-execution verification, and confidence-aware selection that uses execution-guided adjudication to resolve ambiguity beyond simple majority voting. Using open-source MoE LLMs (~30B total parameters) without fine-tuning, the framework reports execution accuracies of 73.5% on BIRD-Dev, 75.07% on the official BIRD-Test leaderboard, and 89.8% on Spider-Test, outperforming prior state-of-the-art methods that rely on larger models or extensive training.

Significance. If the central claims hold, the work would demonstrate that structured orchestration and verifiable stages can deliver competitive Text-to-SQL performance with smaller open-source models, shifting emphasis from LLM scaling to system-level design. This has potential implications for building reliable, production-grade database query interfaces and could encourage similar SDLC-inspired approaches in other LLM-driven data tasks.

major comments (2)
  1. [Abstract and evaluation sections] Abstract and evaluation sections: the central claim that the Syntax-Logic-Quality toolchain plus confidence-aware selection deliver fault tolerance exceeding simple majority voting is load-bearing, yet no ablation study isolates the selection step while holding the N-version candidate pool fixed. Without this controlled comparison, it remains possible that reported gains derive primarily from multi-generation diversity rather than the orchestration's adjudication logic.
  2. [Description of the four synergistic stages] Description of the four synergistic stages: the manuscript provides no quantitative breakdown or controlled experiments showing the incremental benefit of confidence-aware selection over majority voting on identical candidates, nor details on how execution-guided adjudication resolves ambiguity in practice. This weakens the argument that principled SDLC-style stages outperform basic ensembling.
minor comments (2)
  1. [Abstract] The abstract states results on BIRD-Dev, BIRD-Test, and Spider-Test but provides no error bars, standard deviations across runs, or details on how post-hoc design choices in the stages affect final accuracies.
  2. [Abstract] Implementation details for the open-source MoE LLMs (specific model names, exact parameter counts, and prompting templates) are referenced but not fully specified, limiting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The concerns regarding the need for controlled ablations of the confidence-aware selection component are well-taken and point to an opportunity to strengthen the empirical support for our claims. We address each major comment below and commit to revisions that directly respond to the feedback.

read point-by-point responses
  1. Referee: [Abstract and evaluation sections] Abstract and evaluation sections: the central claim that the Syntax-Logic-Quality toolchain plus confidence-aware selection deliver fault tolerance exceeding simple majority voting is load-bearing, yet no ablation study isolates the selection step while holding the N-version candidate pool fixed. Without this controlled comparison, it remains possible that reported gains derive primarily from multi-generation diversity rather than the orchestration's adjudication logic.

    Authors: We agree that an explicit ablation isolating the confidence-aware selection while holding the N-version candidate pool fixed would provide clearer evidence that the adjudication logic contributes beyond diversity from multi-generation alone. The manuscript reports overall framework results and comparisons to prior SOTA, but does not contain this specific controlled experiment. In the revised version we will add the requested ablation, reporting execution accuracy for majority voting versus execution-guided adjudication on identical candidate sets from the same N-version generation stage. This will be placed in the evaluation section and referenced from the abstract. revision: yes

  2. Referee: [Description of the four synergistic stages] Description of the four synergistic stages: the manuscript provides no quantitative breakdown or controlled experiments showing the incremental benefit of confidence-aware selection over majority voting on identical candidates, nor details on how execution-guided adjudication resolves ambiguity in practice. This weakens the argument that principled SDLC-style stages outperform basic ensembling.

    Authors: We acknowledge that the current description of the four stages would benefit from quantitative incremental analysis and concrete examples of ambiguity resolution. The manuscript explains the design of execution-guided adjudication but does not include side-by-side metrics on identical candidates or case studies. In revision we will add (1) a quantitative breakdown comparing confidence-aware selection to majority voting on fixed candidate pools and (2) practical examples illustrating cases where adjudication correctly selects the right SQL when majority voting fails. These additions will be incorporated into the section describing the synergistic stages. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents DeepEye-SQL as an engineering framework with four described stages (schema linking, N-version generation, Syntax-Logic-Quality toolchain, confidence-aware selection) and reports execution accuracies on external public benchmarks (BIRD-Dev, BIRD-Test, Spider-Test). No equations, first-principles derivations, fitted parameters, or self-referential definitions appear in the provided text. Performance results are measured against independent datasets rather than quantities defined internally by the framework itself. The central claims rest on empirical evaluation and orchestration description without any reduction of outputs to inputs by construction, self-citation load-bearing, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that SDLC principles transfer directly to Text-to-SQL without introducing new free parameters or invented entities; no fitted constants or new particles/forces are postulated.

axioms (1)
  • domain assumption Software Development Life Cycle principles can be mapped to Text-to-SQL to enforce correctness across the workflow.
    Invoked in the motivation and framework overview as the basis for the four synergistic stages.

pith-pipeline@v0.9.0 · 5817 in / 1323 out tokens · 40838 ms · 2026-05-18T06:05:50.970719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  2. Data-aware candidate selection in NL2SQL translation via small separating instances

    cs.DB 2026-05 unverdicted novelty 6.0

    A selection technique based on separating instances and provenance outperforms baselines for choosing among 2-3 NL2SQL candidates on a BIRD-DEV subset without consistency scores.

  3. FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.

  4. DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency

    cs.DB 2026-04 unverdicted novelty 6.0

    DPC selects correct text-to-SQL outputs by enforcing execution consistency between SQL and Python on an adversarially constructed minimal distinguishing database.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 4 Pith papers · 8 internal anchors

  1. [1]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Ange- liki Lazaridou, Orhan Firat, James Molloy, Michae...

  2. [2]

    Jinheon Baek, Horst Samulowitz, Oktie Hassanzadeh, Dharmashankar Subrama- nian, Sola Shirai, Alfio Gliozzo, and Debarun Bhattacharjya. 2025. Knowledge Base Construction for Knowledge-Augmented Text-to-SQL. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterin...

  3. [3]

    Zhenbiao Cao, Yuanlei Zheng, Zhihao Fan, Xiaojin Zhang, Wei Chen, and Xiang Bai. 2024. RSL-SQL: Robust Schema Linking in Text-to-SQL Generation.CoRR abs/2411.00073 (2024). arXiv:2411.00073 doi:10.48550/ARXIV.2411.00073

  4. [4]

    Liming Chen and Algirdas Avizienis. 1978. N-version programming: A fault- tolerance approach to reliability of software operation. InProc. 8th IEEE Int. Symp. on Fault-Tolerant Computing (FTCS-8), Vol. 1. 3–9

  5. [5]

    2025.The AI-native open-source embedding database

    chroma core. 2025.The AI-native open-source embedding database. https://github. com/chroma-core/chroma Accessed: 2025-10-17

  6. [6]

    Yeounoh Chung, Gaurav Tarlok Kakkar, Yu Gan, Brenton Milne, and Fatma Ozcan

  7. [7]

    VLDB Endow.18, 8 (2025), 2735–2747

    Is Long Context All You Need? Leveraging LLM’s Extended Context for NL2SQL.Proc. VLDB Endow.18, 8 (2025), 2735–2747

  8. [8]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.Proc. VLDB Endow.17, 5 (2024), 1132–1145. doi:10.14778/ 3641204.3641221

  9. [9]

    Hierons, Kirill Bogdanov, Jonathan P

    Robert M. Hierons, Kirill Bogdanov, Jonathan P. Bowen, Rance Cleaveland, John Derrick, Jeremy Dick, Marian Gheorghe, Mark Harman, Kalpesh Kapoor, Paul J. Krause, Gerald Lüttgen, Anthony J. H. Simons, Sergiy A. Vilkomir, Martin R. Woodward, and Hussein Zedan. 2009. Using formal specifications to support testing.ACM Comput. Surv.41, 2 (2009), 9:1–9:76. doi:...

  10. [10]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). arXiv:2409.12186 doi:10.48550/ ARXIV.2409.12186

  11. [11]

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau,...

  12. [12]

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai ...

  13. [13]

    Gemma 3 Technical Report

    Gemma 3 Technical Report.CoRRabs/2503.19786 (2025). arXiv:2503.19786 doi:10.48550/ARXIV.2503.19786

  14. [14]

    Kolodner

    Janet L. Kolodner. 1993.Case-Based Reasoning. Morgan Kaufmann. doi:10.1016/ C2009-0-27670-7

  15. [15]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, Jason Flinn, Margo I. Se...

  16. [16]

    Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? [Experiment, Analysis & Benchmark ].Proc. VLDB Endow.17, 11 (2024), 3318–3331. doi:10.14778/3681954. 3682003

  17. [17]

    Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, and Yuyu Luo. 2025. Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search. InForty-second International Conference on Machine Learning. OpenReview.net. https://openreview.net/forum?id=kGg1ndttmI

  18. [18]

    Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li

  19. [19]

    VLDB Endow.18, 11 (2025), 4695–4709

    OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale.Proc. VLDB Endow.18, 11 (2025), 4695–4709. https://www.vldb.org/pvldb/vol18/p4695-li.pdf

  20. [20]

    Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL.Proc. ACM Manag. Data2, 3 (2024), 127. doi:10.1145/3654930

  21. [21]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2024)

  22. [22]

    Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2025. A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going?IEEE Trans. Knowl. Data Eng.37, 10 (2025), 5735–5754

  23. [23]

    Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. 2025. NL2SQL- BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation.CoRR abs/2503.11984 (2025). arXiv:2503.11984 doi:10.48550/ARXIV.2503.11984

  24. [24]

    Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, and Jingren Zhou. 2025. XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL.CoRRabs/2507.04701 (2025). arXiv:2507.04701 doi:10.48550/ARXIV.2507.04701

  25. [25]

    Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, and Yuyu Luo. 2025. nvBench 2.0: Resolving Ambiguity in Text- to-Visualization through Stepwise Reasoning.arXiv preprint arXiv:2503.12880 (2025). DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework Conference’17, July 2017, Washington, DC, USA

  26. [26]

    Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. 2025. Natural Language to SQL: State of the Art and Open Problems.Proc. VLDB Endow.18, 12 (2025), 5466–5471

  27. [27]

    Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. In34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. IEEE Computer Society, 101–112. doi:10.1109/ICDE.2018.00019

  28. [28]

    Yuyu Luo, Nan Tang, Guoliang Li, Chengliang Chai, Wenbo Li, and Xuedi Qin

  29. [29]

    InSIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.)

    Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks. InSIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1235–1247. doi:10.1145/3448016. 3457261

  30. [30]

    Yuyu Luo, Nan Tang, Guoliang Li, Jiawei Tang, Chengliang Chai, and Xuedi Qin

  31. [31]

    Natural Language to Visualization by Neural Machine Translation.IEEE Trans. Vis. Comput. Graph.28, 1 (2022), 217–226. doi:10.1109/TVCG.2021.3114848

  32. [32]

    Peixian Ma, Boyan Li, Runzhi Jiang, Ju Fan, Nan Tang, and Yuyu Luo. 2024. A Plug-and-Play Natural Language Rewriter for Natural Language to SQL.CoRR abs/2412.17068 (2024). arXiv:2412.17068 doi:10.48550/ARXIV.2412.17068

  33. [33]

    Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi

  34. [34]

    The death of schema linking? text-to-sql in the age of well-reasoned language models,

    The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models.CoRRabs/2408.07702 (2024). arXiv:2408.07702 doi:10.48550/ ARXIV.2408.07702

  35. [35]

    Rozantsev, M

    Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (2020), 824–836. doi:10.1109/TPAMI.2018. 2889473

  36. [36]

    Joel Ossher, Sushil Krishna Bajracharya, and Cristina Videira Lopes. 2010. Au- tomated dependency resolution for open source software. InProceedings of the 7th International Working Conference on Mining Software Repositories, MSR 2010 (Co-located with ICSE), Cape Town, South Africa, May 2-3, 2010, Proceedings, Jim Whitehead and Thomas Zimmermann (Eds.). I...

  37. [37]

    Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö. Arik

  38. [38]

    InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

    CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https: //openreview.net/forum?id=CvGqMD5OtX

  39. [39]

    Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Sae...

  40. [40]

    Ge Qu, Jinyang Li, Bowen Qin, Xiaolong Li, Nan Huo, Chenhao Ma, and Reynold Cheng. 2025. SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Na...

  41. [41]

    M. G. Rekoff. 1985. On reverse engineering.IEEE Trans. Syst. Man Cybern.15, 2 (1985), 244–252. doi:10.1109/TSMC.1985.6313354

  42. [42]

    Per Runeson. 2006. A Survey of Unit Testing Practices.IEEE Softw.23, 4 (2006), 22–29. doi:10.1109/MS.2006.91

  43. [43]

    Ruparelia

    Nayan B. Ruparelia. 2010. Software development lifecycle models.ACM SIGSOFT Softw. Eng. Notes35, 3 (2010), 8–13

  44. [44]

    Joscha Schnell and Gunther Reinhart. 2016. Quality management for battery production: a quality gate concept.Procedia CIRP57 (2016), 568–573

  45. [45]

    Philip Sedgwick. 2012. Pearson’s correlation coefficient.Bmj345 (2012)

  46. [46]

    Lei Sheng and Shuai-Shuai Xu. 2025. CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning.CoRRabs/2505.13271 (2025). arXiv:2505.13271 doi:10.48550/ARXIV.2505.13271

  47. [47]

    Lei Sheng, Shuai-Shuai Xu, and Wei Xie. 2025. BASE-SQL: A powerful open source Text-To-SQL baseline approach.CoRRabs/2502.10739 (2025). arXiv:2502.10739 doi:10.48550/ARXIV.2502.10739

  48. [48]

    Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson, and Parisa Ghane

  49. [49]

    arXiv:2505.19988 doi:10.48550/ARXIV.2505.19988

    Automatic Metadata Extraction for Text-to-SQL.CoRRabs/2505.19988 (2025). arXiv:2505.19988 doi:10.48550/ARXIV.2505.19988

  50. [50]

    Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, and Weikai Yang. 2025. DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning. CoRRabs/2508.01700 (2025). arXiv:2508.01700 doi:10.48550/ARXIV.2508.01700

  51. [51]

    Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis. CoRRabs/2405.16755 (2024). arXiv:2405.16755 doi:10.48550/ARXIV.2405.16755

  52. [52]

    Martyn Thomas and Frank E. McGarry. 1994. Top-Down vs. Bottom-Up Process Improvement.IEEE Softw.11, 4 (1994), 12–13. doi:10.1109/52.300121

  53. [53]

    Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Wolfgang Schroder-Preikschat, and Reiner Schmid. 2012. Eliminating single points of failure in software-based redundancy. In2012 Ninth European Dependable Computing Conference. IEEE, 49–60

  54. [54]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/forum?id=1PL1NIMMrw

  55. [55]

    Yifan Wu, Jingze Shi, Bingheng Wu, Jiayi Zhang, Xiaotian Lin, Nan Tang, and Yuyu Luo. 2025. Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting.CoRRabs/2505.19716 (2025)

  56. [56]

    Xiangjin Xie, Guangwei Xu, Lingyan Zhao, and Ruijie Guo. 2025. OpenSearch- SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Align- ment.Proc. ACM Manag. Data3, 3 (2025), 194:1–194:24. doi:10.1145/3725331

  57. [57]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  58. [58]

    Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. 2024. Synthesizing Text-to-SQL Data from Weak and Strong LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). A...

  59. [59]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev

  60. [60]

    Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross- Domain Semantic Parsing and Text-to-SQL Task. InEMNLP. Association for Computational Linguistics, 3911–3921

  61. [61]

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRevi...

  62. [62]

    Qingjie Zhang, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang, Ke Xu, Hewu Li, Liu Yan, and Han Qiu. 2025. Understanding the Dark Side of LLMs’ Intrinsic Self-Correction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wan...

  63. [63]

    Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. 2025. Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process- Supervised Rewards.CoRRabs/2505.04671 (2025)

  64. [64]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.CoRRabs/2506.05176 (2025). arXiv:2506.05176 doi:10.48550/ARXIV.2506.05176

  65. [65]

    Yizhang Zhu, Shiyin Du, Boyan Li, Yuyu Luo, and Nan Tang. 2024. Are Large Language Models Good Statisticians?. InNeurIPS

  66. [66]

    Yizhang Zhu, Runzhi Jiang, Boyan Li, Nan Tang, and Yuyu Luo. 2025. El- lieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing.CoRR abs/2503.22402 (2025)