SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

Ankan Bansal; Chuan Lei; Daniel Garcia; Dan Roth; Fjona Parllaku; Marianne Menglin Liu; Rongguang Wang; Sai Ashish Somayajula; Sujeeth Bharadwaj; Sujith Ravi

arxiv: 2606.11424 · v1 · pith:NXI27MLPnew · submitted 2026-06-09 · 💻 cs.CL

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

Sai Ashish Somayajula , Marianne Menglin Liu , Chuan Lei , Fjona Parllaku , Daniel Garcia , Rongguang Wang , Syed Fahad Allam Shah , Ankan Bansal

show 4 more authors

Sujeeth Bharadwaj Tao Sheng Sujith Ravi Dan Roth

This is my paper

Pith reviewed 2026-06-27 13:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords NL-to-SQLambiguity resolutionsynthetic query logsexecution probingnatural language interfaces to databasesSQL generationschema grounding

0 comments

The pith

SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by building synthetic query logs and running targeted execution probes to select or repair the correct SQL without human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SOMA-SQL as a method that first constructs synthetic query logs to ground schema meanings and steer candidate SQL generation, then issues probing queries guided by an ambiguity taxonomy to gather evidence from execution results. This evidence lets the system pick or fix the final SQL for questions that are underspecified across user intent, schema structure, and model output. A sympathetic reader would care because real-world NL-to-SQL systems break on large, ambiguous databases and vague questions, and current fixes either demand human clarification or fail to scale. If the approach works as described, NL database interfaces could operate reliably on unseen schemas and query styles without ongoing human oversight or retraining.

Core claim

SOMA-SQL constructs synthetic query logs to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

What carries the argument

Synthetic query log construction paired with ambiguity-taxonomy-driven probing queries that generate execution-based disambiguation evidence.

If this is right

NL-to-SQL systems can resolve ambiguity across user questions, schemas, and model outputs without requiring human clarification.
Performance gains hold on ambiguous questions and extend to previously unseen schemas and query distributions.
The method scales to large, complex databases where static schema representations alone are insufficient.
Execution probing supplies runtime evidence that complements candidate generation from language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same log-and-probe pattern could be tested on other structured output tasks such as API call generation or knowledge-graph querying.
If probing cost stays low, the technique might reduce reliance on large-scale domain-specific fine-tuning for each new database.
Active evidence collection at inference time offers a route to make black-box model outputs more trustworthy without changing the underlying generator.

Load-bearing premise

That synthetic query logs and probing queries will produce enough disambiguation evidence to correctly select or repair SQL for arbitrary unseen schemas and underspecified questions.

What would settle it

Running SOMA-SQL on a fresh benchmark containing schemas and question distributions withheld from all development stages and finding execution accuracy no higher than the strongest baseline.

Figures

Figures reproduced from arXiv: 2606.11424 by Ankan Bansal, Chuan Lei, Daniel Garcia, Dan Roth, Fjona Parllaku, Marianne Menglin Liu, Rongguang Wang, Sai Ashish Somayajula, Sujeeth Bharadwaj, Sujith Ravi, Syed Fahad Allam Shah, Tao Sheng.

**Figure 1.** Figure 1: An overview of SOMA-SQL consisting of (a) Synthetic Query Log Construction for Disambiguation, (b) Ambiguity-aware SQL Probing and Correction, and (c) Multi-SQL Generation. SQL queries that are syntactically valid yet return results that do not match the analyst’s intent, a failure that may not surface until downstream decisions have already been made. Challenges. These limitations manifest as three ambigu… view at source ↗

read the original abstract

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOMA-SQL proposes synthetic logs plus taxonomy-driven probing to handle ambiguity in NL-to-SQL and claims 13% gains, but the abstract supplies no method or experiment details to check whether those gains hold.

read the letter

The paper introduces SOMA-SQL, which resolves multi-source ambiguity in NL-to-SQL by constructing synthetic query logs and using execution probing guided by an ambiguity taxonomy.

This is new in its autonomous approach that combines log-based grounding with targeted probing to select or repair SQL without human input.

It does well at framing the problem around ambiguity in questions, schemas, and model outputs, and proposes a scalable alternative to existing methods.

The claimed 13% average improvement and up to 16.7% on ambiguous questions across six benchmarks would be meaningful for the field if the experiments are rigorous.

The soft spots are clear from the abstract alone. No details are given on synthetic log construction, the structure of the taxonomy, how probing queries are generated from candidate disagreements, or the experimental controls and baselines. This makes it impossible to verify the generalization to unseen schemas or the robustness of the gains.

The load-bearing assumption is that the probing evidence will be sufficient and generalizable, but without the method specifics, that can't be checked.

This paper is for people in the NL-to-SQL community who are trying to make systems work on real, ambiguous inputs. A reader could find the taxonomy and probing idea useful even if the results need more scrutiny.

It deserves peer review to see the full details and whether the evidence supports the claims.

Referee Report

2 major / 0 minor

Summary. The paper claims that SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by constructing synthetic query logs to ground schema interpretation and executing targeted probing queries based on an ambiguity taxonomy and candidate disagreements to select or repair the correct SQL. This approach is said to generalize across unseen schemas and query distributions without human-in-the-loop. On six public benchmarks, it improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

Significance. If the empirical claims hold and the method generalizes as described, this would represent a meaningful advance in autonomous ambiguity resolution for NL2SQL systems, addressing a central failure mode without reliance on human clarification.

major comments (2)

[Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.
[Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.

Authors: Abstracts are by design concise overviews of the contribution and key results. The manuscript supplies the requested details in the Experiments section (benchmarks, baselines, dataset statistics, controls for ambiguity, and results including error bars and per-benchmark breakdowns). The abstract already notes the six benchmarks and the 13.0% average gain. We will revise the abstract to incorporate one additional sentence summarizing the experimental setup and statistical controls. revision: partial
Referee: [Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.

Authors: The Method section supplies the concrete mechanics: the procedure for generating synthetic query logs from schema metadata, the full ambiguity taxonomy with categories and examples, the algorithm for generating probing queries from candidate disagreements, and the selection/repair logic based on execution outcomes. Generalization to unseen schemas is validated empirically across the six benchmarks (which include schema and query distribution shifts), with larger gains precisely on the ambiguous subset. If the editor wishes, we can add pseudocode and an additional ablation validating the probing step. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical NL-to-SQL system that constructs synthetic query logs, applies an ambiguity taxonomy for probing queries, and selects/repairs SQL based on execution evidence. All claims reduce to measured execution accuracy gains on public benchmarks rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text as load-bearing steps; the method is described as a practical pipeline whose validity is tested externally via generalization experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of the introduced synthetic-log and probing mechanisms for autonomous disambiguation; the evaluation assumes public benchmarks capture representative ambiguity.

axioms (1)

domain assumption Public NL2SQL benchmarks are representative of real-world multi-source ambiguity scenarios.
The generalization claim is supported only by results on six public benchmarks.

pith-pipeline@v0.9.1-grok · 5783 in / 1230 out tokens · 26435 ms · 2026-06-27T13:10:26.913372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

[1]

Asking clarifying questions in open-domain information-seeking conversations

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. InProceedings of the 42nd international acm sigir conference on research and development in information retrieval, pages 475–484, 2019

2019
[2]

Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Ça˘gatay Demiralp, and Michael Stonebraker. Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

Pith/arXiv arXiv 2024
[3]

Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

Peter Baile Chen, Tomer Wolfson, Michael Cafarella, and Dan Roth. Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

arXiv 2025
[4]

Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

arXiv 2025
[5]

Ambisql: Interactive ambiguity detection and resolution for text-to-sql

Zhongjun Ding, Yin Lin, Tianjing Zeng, Rong Zhu, Bolin Ding, and Jingren Zhou. Ambisql: Interactive ambiguity detection and resolution for text-to-sql. InCompanion of the International Conference on Management of Data, pages 26–29, 2026

2026
[6]

Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino

Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, Alex Van Grootel, Brandon Chow, Kai Deng, Katherine Lin, Marcos Campos, K. Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino. NL2SQL is a solved problem. . . not! In Proceeding...

2024
[7]

Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

arXiv 2023
[8]

Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

2026
[9]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/,
[11]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024
[12]

Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, et al. Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[13]

How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

Kimiya Keyvan and Jimmy Xiangji Huang. How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

2022
[14]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

arXiv 2024
[15]

Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

2026
[16]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. 10

2023
[17]

Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning

Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, and Dan Roth. Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning. In International Joint Conference on Knowledge Graphs, pages 537–544. Springer, 2025

2025
[18]

Xiyan-sql: A novel multi-generator framework for text-to-sql

Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql. IEEE Transactions on Knowledge and Data Engineering, 2026

2026
[19]

Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,

Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,
[20]

doi: 10.14778/3750601.3750696

work page doi:10.14778/3750601.3750696
[21]

SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine

Toby Mao and SQLGlot contributors. SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine. https://github.com/tobymao/sqlglot, 2023. Accessed: 2026-04- 29

2023
[22]

AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026

Aritra Mazumder, Parth Desai, Fuheng Zhao, and Anna Fariha. AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026. Demo Track, VLDB 2026

2026
[23]

Ambigqa: Answering ambiguous open-domain questions

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5783–5797, 2020

2020
[24]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-06

2025
[25]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,
[26]

Accessed: 2026-05-06

2026
[27]

Introducing gpt-5.3-codex

OpenAI. Introducing gpt-5.3-codex. https://openai.com/index/ introducing-gpt-5-3-codex/, 2026. Accessed: 2026-05-06

2026
[28]

Beaver (oracle conversion)

Oracle Corporation. Beaver (oracle conversion). https://github.com/oracle-samples/ beaver, 2024. GitHub repository, accessed: 2026-04-27

2024
[29]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

2023
[30]

Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tar- lok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. InInternational Confer- ence on Learning Representations, volume 2025, pages 60385–60415, 2025

2025
[31]

Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information

Sudha Rao and Hal Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746, 2018

2018
[32]

Courier Corporation, 1994

Fazlollah M Reza.An introduction to information theory. Courier Corporation, 1994

1994
[33]

Oracle db skills: Sql development

Kris Rice. Oracle db skills: Sql development. https://github.com/krisrice/ oracle-db-skills/tree/main?tab=readme-ov-file#sql-development , 2025. GitHub repository, accessed 2026-05-06

2025
[34]

Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

Irina Saparina and Mirella Lapata. Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

2024
[35]

Agenticdata: An agentic data analytics system for heterogeneous data, 2025

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. Agenticdata: An agentic data analytics system for heterogeneous data, 2025. URL https://arxiv.org/abs/2508. 05002. 11

2025
[36]

Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

Pith/arXiv arXiv 2024
[37]

Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

Yuan Tian and Tianyi Zhang. Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

Pith/arXiv arXiv 2026
[38]

Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

Kapil Vaidya, Abishek Sankararaman, Jialin Ding, Chuan Lei, Xiao Qin, Balakrishnan Narayanaswamy, and Tim Kraska. Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

arXiv 2025
[39]

Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql

Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql. InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714, 2023

2023
[40]

Mac-sql: A multi-agent collaborative framework for text-to-sql

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

2025
[41]

Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale

Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, and Xiang Bai. Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33809–33817, 2026

2026
[42]

Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

Pith/arXiv arXiv 2024
[43]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

2018
[44]

Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning

Danna Zheng, Mirella Lapata, and Jeff Pan. Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 94–111, 2024. 12 A Related Works Beyond NL2SQL: Ambiguity in NL.Ambiguit...

2024
[45]

Which <entity> ... ?

17 Return columns: <...> General Rules • Use ONLY column and table names in schema. Do not invent names. • Prefer scalar subqueries for single values; use JOINs only when necessary. • Keep each step focused (3–6 steps total). • End with exactly one line:Return columns: <exact list> • Use at most two guardrails per plan. Projection Minimality (Strict) • Re...
[46]

Input: question, seed SQL (primary patch target), candidate SQLs, implementation difference report of candidate SQLs, schema evidence, and optional external info
[47]

(a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale

Generate a clarifying-question plan. (a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale. (b) Generate exactly 8–9 questions. (c) If implementation-diff is available, exactly 3 must be diff-grounded and the rest taxonomy-grounded; otherwise all taxonomy-grounded
[48]

(a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric)

Generate and execute ambiguity probes. (a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric). (b)AmbiValue: question values do not directly align with database values, leading to ambiguous filter literals. (c)AmbiIntent: ambiguous SQL semantics (e.g.,ORDER BYvs.GROUP BY). (d) Write one probe SQL per question and execute all probes, saving S...
[49]

Convert probe results into resolved assumptions
[50]

(a) Patch only the seed SQL baseline; use candidates as reference only

Apply a minimal probe-evidence-backed SQL patch. (a) Patch only the seed SQL baseline; use candidates as reference only. (b) Apply changes only when supported by probe evidence; keep edits minimal and localized. (c) Use implementation diff as hypothesis context only, never as a patch source. (d) Ensure all rewrites are probe-evidence-backed (ambiguity_res...
[51]

analyze trends

Run final sanity and freeze. (a) Execute final SQL withexecute_sql.py. (b) If execution fails or returns empty, fall back to baseline SQL. (c) Always producefixed_sql/<instance_id>.sql; limit to 3 attempts. Constraints • Do not read or use gold SQL or gold execution during fixing. • No SQL patch before probe evidence exists. • If implementation diff is pr...
[52]

Parse the question and any clarifying Q&A into a precise intent
[53]

Compare all 10 candidates against that intent
[54]

Check schema validity and semantic faithfulness
[55]

Select the single best candidate
[56]

largest floor number

Output the candidate number and the exact SQL text, unchanged. Output Format Candidate: <N> <exact SQL copied verbatim from candidate N> Input Fields • Question: {question} • Clarifying Ambiguity Q&A: {ambi_blob} • Candidate SQLs (10): {generated_sql_list} • Schema: {references} Hard Requirement • Output only the candidate number and the selected SQL. • N...

2017

[1] [1]

Asking clarifying questions in open-domain information-seeking conversations

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. InProceedings of the 42nd international acm sigir conference on research and development in information retrieval, pages 475–484, 2019

2019

[2] [2]

Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Ça˘gatay Demiralp, and Michael Stonebraker. Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

Pith/arXiv arXiv 2024

[3] [3]

Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

Peter Baile Chen, Tomer Wolfson, Michael Cafarella, and Dan Roth. Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

arXiv 2025

[4] [4]

Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

arXiv 2025

[5] [5]

Ambisql: Interactive ambiguity detection and resolution for text-to-sql

Zhongjun Ding, Yin Lin, Tianjing Zeng, Rong Zhu, Bolin Ding, and Jingren Zhou. Ambisql: Interactive ambiguity detection and resolution for text-to-sql. InCompanion of the International Conference on Management of Data, pages 26–29, 2026

2026

[6] [6]

Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino

Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, Alex Van Grootel, Brandon Chow, Kai Deng, Katherine Lin, Marcos Campos, K. Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino. NL2SQL is a solved problem. . . not! In Proceeding...

2024

[7] [7]

Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

arXiv 2023

[8] [8]

Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

2026

[9] [9]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/,

[10] [11]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024

[11] [12]

Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, et al. Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[12] [13]

How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

Kimiya Keyvan and Jimmy Xiangji Huang. How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

2022

[13] [14]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

arXiv 2024

[14] [15]

Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

2026

[15] [16]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. 10

2023

[16] [17]

Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning

Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, and Dan Roth. Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning. In International Joint Conference on Knowledge Graphs, pages 537–544. Springer, 2025

2025

[17] [18]

Xiyan-sql: A novel multi-generator framework for text-to-sql

Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql. IEEE Transactions on Knowledge and Data Engineering, 2026

2026

[18] [19]

Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,

Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,

[19] [20]

doi: 10.14778/3750601.3750696

work page doi:10.14778/3750601.3750696

[20] [21]

SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine

Toby Mao and SQLGlot contributors. SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine. https://github.com/tobymao/sqlglot, 2023. Accessed: 2026-04- 29

2023

[21] [22]

AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026

Aritra Mazumder, Parth Desai, Fuheng Zhao, and Anna Fariha. AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026. Demo Track, VLDB 2026

2026

[22] [23]

Ambigqa: Answering ambiguous open-domain questions

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5783–5797, 2020

2020

[23] [24]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-06

2025

[24] [25]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

[25] [26]

Accessed: 2026-05-06

2026

[26] [27]

Introducing gpt-5.3-codex

OpenAI. Introducing gpt-5.3-codex. https://openai.com/index/ introducing-gpt-5-3-codex/, 2026. Accessed: 2026-05-06

2026

[27] [28]

Beaver (oracle conversion)

Oracle Corporation. Beaver (oracle conversion). https://github.com/oracle-samples/ beaver, 2024. GitHub repository, accessed: 2026-04-27

2024

[28] [29]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

2023

[29] [30]

Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tar- lok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. InInternational Confer- ence on Learning Representations, volume 2025, pages 60385–60415, 2025

2025

[30] [31]

Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information

Sudha Rao and Hal Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746, 2018

2018

[31] [32]

Courier Corporation, 1994

Fazlollah M Reza.An introduction to information theory. Courier Corporation, 1994

1994

[32] [33]

Oracle db skills: Sql development

Kris Rice. Oracle db skills: Sql development. https://github.com/krisrice/ oracle-db-skills/tree/main?tab=readme-ov-file#sql-development , 2025. GitHub repository, accessed 2026-05-06

2025

[33] [34]

Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

Irina Saparina and Mirella Lapata. Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

2024

[34] [35]

Agenticdata: An agentic data analytics system for heterogeneous data, 2025

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. Agenticdata: An agentic data analytics system for heterogeneous data, 2025. URL https://arxiv.org/abs/2508. 05002. 11

2025

[35] [36]

Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

Pith/arXiv arXiv 2024

[36] [37]

Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

Yuan Tian and Tianyi Zhang. Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

Pith/arXiv arXiv 2026

[37] [38]

Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

Kapil Vaidya, Abishek Sankararaman, Jialin Ding, Chuan Lei, Xiao Qin, Balakrishnan Narayanaswamy, and Tim Kraska. Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

arXiv 2025

[38] [39]

Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql

Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql. InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714, 2023

2023

[39] [40]

Mac-sql: A multi-agent collaborative framework for text-to-sql

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

2025

[40] [41]

Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale

Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, and Xiang Bai. Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33809–33817, 2026

2026

[41] [42]

Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

Pith/arXiv arXiv 2024

[42] [43]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

2018

[43] [44]

Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning

Danna Zheng, Mirella Lapata, and Jeff Pan. Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 94–111, 2024. 12 A Related Works Beyond NL2SQL: Ambiguity in NL.Ambiguit...

2024

[44] [45]

Which <entity> ... ?

17 Return columns: <...> General Rules • Use ONLY column and table names in schema. Do not invent names. • Prefer scalar subqueries for single values; use JOINs only when necessary. • Keep each step focused (3–6 steps total). • End with exactly one line:Return columns: <exact list> • Use at most two guardrails per plan. Projection Minimality (Strict) • Re...

[45] [46]

Input: question, seed SQL (primary patch target), candidate SQLs, implementation difference report of candidate SQLs, schema evidence, and optional external info

[46] [47]

(a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale

Generate a clarifying-question plan. (a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale. (b) Generate exactly 8–9 questions. (c) If implementation-diff is available, exactly 3 must be diff-grounded and the rest taxonomy-grounded; otherwise all taxonomy-grounded

[47] [48]

(a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric)

Generate and execute ambiguity probes. (a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric). (b)AmbiValue: question values do not directly align with database values, leading to ambiguous filter literals. (c)AmbiIntent: ambiguous SQL semantics (e.g.,ORDER BYvs.GROUP BY). (d) Write one probe SQL per question and execute all probes, saving S...

[48] [49]

Convert probe results into resolved assumptions

[49] [50]

(a) Patch only the seed SQL baseline; use candidates as reference only

Apply a minimal probe-evidence-backed SQL patch. (a) Patch only the seed SQL baseline; use candidates as reference only. (b) Apply changes only when supported by probe evidence; keep edits minimal and localized. (c) Use implementation diff as hypothesis context only, never as a patch source. (d) Ensure all rewrites are probe-evidence-backed (ambiguity_res...

[50] [51]

analyze trends

Run final sanity and freeze. (a) Execute final SQL withexecute_sql.py. (b) If execution fails or returns empty, fall back to baseline SQL. (c) Always producefixed_sql/<instance_id>.sql; limit to 3 attempts. Constraints • Do not read or use gold SQL or gold execution during fixing. • No SQL patch before probe evidence exists. • If implementation diff is pr...

[51] [52]

Parse the question and any clarifying Q&A into a precise intent

[52] [53]

Compare all 10 candidates against that intent

[53] [54]

Check schema validity and semantic faithfulness

[54] [55]

Select the single best candidate

[55] [56]

largest floor number

Output the candidate number and the exact SQL text, unchanged. Output Format Candidate: <N> <exact SQL copied verbatim from candidate N> Input Fields • Question: {question} • Clarifying Ambiguity Q&A: {ambi_blob} • Candidate SQLs (10): {generated_sql_list} • Schema: {references} Hard Requirement • Output only the candidate number and the selected SQL. • N...

2017