SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing
Pith reviewed 2026-06-27 13:10 UTC · model grok-4.3
The pith
SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by building synthetic query logs and running targeted execution probes to select or repair the correct SQL without human input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SOMA-SQL constructs synthetic query logs to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.
What carries the argument
Synthetic query log construction paired with ambiguity-taxonomy-driven probing queries that generate execution-based disambiguation evidence.
If this is right
- NL-to-SQL systems can resolve ambiguity across user questions, schemas, and model outputs without requiring human clarification.
- Performance gains hold on ambiguous questions and extend to previously unseen schemas and query distributions.
- The method scales to large, complex databases where static schema representations alone are insufficient.
- Execution probing supplies runtime evidence that complements candidate generation from language models.
Where Pith is reading between the lines
- The same log-and-probe pattern could be tested on other structured output tasks such as API call generation or knowledge-graph querying.
- If probing cost stays low, the technique might reduce reliance on large-scale domain-specific fine-tuning for each new database.
- Active evidence collection at inference time offers a route to make black-box model outputs more trustworthy without changing the underlying generator.
Load-bearing premise
That synthetic query logs and probing queries will produce enough disambiguation evidence to correctly select or repair SQL for arbitrary unseen schemas and underspecified questions.
What would settle it
Running SOMA-SQL on a fresh benchmark containing schemas and question distributions withheld from all development stages and finding execution accuracy no higher than the strongest baseline.
Figures
read the original abstract
Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by constructing synthetic query logs to ground schema interpretation and executing targeted probing queries based on an ambiguity taxonomy and candidate disagreements to select or repair the correct SQL. This approach is said to generalize across unseen schemas and query distributions without human-in-the-loop. On six public benchmarks, it improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.
Significance. If the empirical claims hold and the method generalizes as described, this would represent a meaningful advance in autonomous ambiguity resolution for NL2SQL systems, addressing a central failure mode without reliance on human clarification.
major comments (2)
- [Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.
- [Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for their comments. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.
Authors: Abstracts are by design concise overviews of the contribution and key results. The manuscript supplies the requested details in the Experiments section (benchmarks, baselines, dataset statistics, controls for ambiguity, and results including error bars and per-benchmark breakdowns). The abstract already notes the six benchmarks and the 13.0% average gain. We will revise the abstract to incorporate one additional sentence summarizing the experimental setup and statistical controls. revision: partial
-
Referee: [Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.
Authors: The Method section supplies the concrete mechanics: the procedure for generating synthetic query logs from schema metadata, the full ambiguity taxonomy with categories and examples, the algorithm for generating probing queries from candidate disagreements, and the selection/repair logic based on execution outcomes. Generalization to unseen schemas is validated empirically across the six benchmarks (which include schema and query distribution shifts), with larger gains precisely on the ambiguous subset. If the editor wishes, we can add pseudocode and an additional ablation validating the probing step. revision: no
Circularity Check
No significant circularity
full rationale
The paper presents an empirical NL-to-SQL system that constructs synthetic query logs, applies an ambiguity taxonomy for probing queries, and selects/repairs SQL based on execution evidence. All claims reduce to measured execution accuracy gains on public benchmarks rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text as load-bearing steps; the method is described as a practical pipeline whose validity is tested externally via generalization experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Public NL2SQL benchmarks are representative of real-world multi-source ambiguity scenarios.
Reference graph
Works this paper leans on
-
[1]
Asking clarifying questions in open-domain information-seeking conversations
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. InProceedings of the 42nd international acm sigir conference on research and development in information retrieval, pages 475–484, 2019
2019
-
[2]
Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024
Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Ça˘gatay Demiralp, and Michael Stonebraker. Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024
Pith/arXiv arXiv 2024
-
[3]
Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025
Peter Baile Chen, Tomer Wolfson, Michael Cafarella, and Dan Roth. Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025
arXiv 2025
-
[4]
Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025
arXiv 2025
-
[5]
Ambisql: Interactive ambiguity detection and resolution for text-to-sql
Zhongjun Ding, Yin Lin, Tianjing Zeng, Rong Zhu, Bolin Ding, and Jingren Zhou. Ambisql: Interactive ambiguity detection and resolution for text-to-sql. InCompanion of the International Conference on Management of Data, pages 26–29, 2026
2026
-
[6]
Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino
Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, Alex Van Grootel, Brandon Chow, Kai Deng, Katherine Lin, Marcos Campos, K. Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino. NL2SQL is a solved problem. . . not! In Proceeding...
2024
-
[7]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023
arXiv 2023
-
[8]
Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026
Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026
2026
-
[9]
Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/,
-
[11]
A survey on llm-as-a-judge.The Innovation, 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024
2024
-
[12]
Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions
Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, et al. Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[13]
How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022
Kimiya Keyvan and Jimmy Xiangji Huang. How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022
2022
-
[14]
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024
arXiv 2024
-
[15]
Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026
Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026
2026
-
[16]
Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. 10
2023
-
[17]
Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning
Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, and Dan Roth. Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning. In International Joint Conference on Knowledge Graphs, pages 537–544. Springer, 2025
2025
-
[18]
Xiyan-sql: A novel multi-generator framework for text-to-sql
Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql. IEEE Transactions on Knowledge and Data Engineering, 2026
2026
-
[19]
Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,
Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,
-
[20]
doi: 10.14778/3750601.3750696
-
[21]
SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine
Toby Mao and SQLGlot contributors. SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine. https://github.com/tobymao/sqlglot, 2023. Accessed: 2026-04- 29
2023
-
[22]
AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026
Aritra Mazumder, Parth Desai, Fuheng Zhao, and Anna Fariha. AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026. Demo Track, VLDB 2026
2026
-
[23]
Ambigqa: Answering ambiguous open-domain questions
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5783–5797, 2020
2020
-
[24]
Introducing gpt-5
OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-06
2025
-
[25]
Introducing gpt-5.4
OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,
-
[26]
Accessed: 2026-05-06
2026
-
[27]
Introducing gpt-5.3-codex
OpenAI. Introducing gpt-5.3-codex. https://openai.com/index/ introducing-gpt-5-3-codex/, 2026. Accessed: 2026-05-06
2026
-
[28]
Beaver (oracle conversion)
Oracle Corporation. Beaver (oracle conversion). https://github.com/oracle-samples/ beaver, 2024. GitHub repository, accessed: 2026-04-27
2024
-
[29]
Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023
Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023
2023
-
[30]
Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql
Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tar- lok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. InInternational Confer- ence on Learning Representations, volume 2025, pages 60385–60415, 2025
2025
-
[31]
Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information
Sudha Rao and Hal Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746, 2018
2018
-
[32]
Courier Corporation, 1994
Fazlollah M Reza.An introduction to information theory. Courier Corporation, 1994
1994
-
[33]
Oracle db skills: Sql development
Kris Rice. Oracle db skills: Sql development. https://github.com/krisrice/ oracle-db-skills/tree/main?tab=readme-ov-file#sql-development , 2025. GitHub repository, accessed 2026-05-06
2025
-
[34]
Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024
Irina Saparina and Mirella Lapata. Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024
2024
-
[35]
Agenticdata: An agentic data analytics system for heterogeneous data, 2025
Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. Agenticdata: An agentic data analytics system for heterogeneous data, 2025. URL https://arxiv.org/abs/2508. 05002. 11
2025
-
[36]
Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024
Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024
Pith/arXiv arXiv 2024
-
[37]
Yuan Tian and Tianyi Zhang. Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026
Pith/arXiv arXiv 2026
-
[38]
Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025
Kapil Vaidya, Abishek Sankararaman, Jialin Ding, Chuan Lei, Xiao Qin, Balakrishnan Narayanaswamy, and Tim Kraska. Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025
arXiv 2025
-
[39]
Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql
Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql. InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714, 2023
2023
-
[40]
Mac-sql: A multi-agent collaborative framework for text-to-sql
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025
2025
-
[41]
Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale
Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, and Xiang Bai. Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33809–33817, 2026
2026
-
[42]
Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024
Pith/arXiv arXiv 2024
-
[43]
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018
2018
-
[44]
Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning
Danna Zheng, Mirella Lapata, and Jeff Pan. Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 94–111, 2024. 12 A Related Works Beyond NL2SQL: Ambiguity in NL.Ambiguit...
2024
-
[45]
Which <entity> ... ?
17 Return columns: <...> General Rules • Use ONLY column and table names in schema. Do not invent names. • Prefer scalar subqueries for single values; use JOINs only when necessary. • Keep each step focused (3–6 steps total). • End with exactly one line:Return columns: <exact list> • Use at most two guardrails per plan. Projection Minimality (Strict) • Re...
-
[46]
Input: question, seed SQL (primary patch target), candidate SQLs, implementation difference report of candidate SQLs, schema evidence, and optional external info
-
[47]
(a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale
Generate a clarifying-question plan. (a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale. (b) Generate exactly 8–9 questions. (c) If implementation-diff is available, exactly 3 must be diff-grounded and the rest taxonomy-grounded; otherwise all taxonomy-grounded
-
[48]
(a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric)
Generate and execute ambiguity probes. (a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric). (b)AmbiValue: question values do not directly align with database values, leading to ambiguous filter literals. (c)AmbiIntent: ambiguous SQL semantics (e.g.,ORDER BYvs.GROUP BY). (d) Write one probe SQL per question and execute all probes, saving S...
-
[49]
Convert probe results into resolved assumptions
-
[50]
(a) Patch only the seed SQL baseline; use candidates as reference only
Apply a minimal probe-evidence-backed SQL patch. (a) Patch only the seed SQL baseline; use candidates as reference only. (b) Apply changes only when supported by probe evidence; keep edits minimal and localized. (c) Use implementation diff as hypothesis context only, never as a patch source. (d) Ensure all rewrites are probe-evidence-backed (ambiguity_res...
-
[51]
analyze trends
Run final sanity and freeze. (a) Execute final SQL withexecute_sql.py. (b) If execution fails or returns empty, fall back to baseline SQL. (c) Always producefixed_sql/<instance_id>.sql; limit to 3 attempts. Constraints • Do not read or use gold SQL or gold execution during fixing. • No SQL patch before probe evidence exists. • If implementation diff is pr...
-
[52]
Parse the question and any clarifying Q&A into a precise intent
-
[53]
Compare all 10 candidates against that intent
-
[54]
Check schema validity and semantic faithfulness
-
[55]
Select the single best candidate
-
[56]
largest floor number
Output the candidate number and the exact SQL text, unchanged. Output Format Candidate: <N> <exact SQL copied verbatim from candidate N> Input Fields • Question: {question} • Clarifying Ambiguity Q&A: {ambi_blob} • Candidate SQLs (10): {generated_sql_list} • Schema: {references} Hard Requirement • Output only the candidate number and the selected SQL. • N...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.