pith. sign in

arxiv: 2606.11424 · v1 · pith:NXI27MLPnew · submitted 2026-06-09 · 💻 cs.CL

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

Pith reviewed 2026-06-27 13:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords NL-to-SQLambiguity resolutionsynthetic query logsexecution probingnatural language interfaces to databasesSQL generationschema grounding
0
0 comments X

The pith

SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by building synthetic query logs and running targeted execution probes to select or repair the correct SQL without human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SOMA-SQL as a method that first constructs synthetic query logs to ground schema meanings and steer candidate SQL generation, then issues probing queries guided by an ambiguity taxonomy to gather evidence from execution results. This evidence lets the system pick or fix the final SQL for questions that are underspecified across user intent, schema structure, and model output. A sympathetic reader would care because real-world NL-to-SQL systems break on large, ambiguous databases and vague questions, and current fixes either demand human clarification or fail to scale. If the approach works as described, NL database interfaces could operate reliably on unseen schemas and query styles without ongoing human oversight or retraining.

Core claim

SOMA-SQL constructs synthetic query logs to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

What carries the argument

Synthetic query log construction paired with ambiguity-taxonomy-driven probing queries that generate execution-based disambiguation evidence.

If this is right

  • NL-to-SQL systems can resolve ambiguity across user questions, schemas, and model outputs without requiring human clarification.
  • Performance gains hold on ambiguous questions and extend to previously unseen schemas and query distributions.
  • The method scales to large, complex databases where static schema representations alone are insufficient.
  • Execution probing supplies runtime evidence that complements candidate generation from language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-and-probe pattern could be tested on other structured output tasks such as API call generation or knowledge-graph querying.
  • If probing cost stays low, the technique might reduce reliance on large-scale domain-specific fine-tuning for each new database.
  • Active evidence collection at inference time offers a route to make black-box model outputs more trustworthy without changing the underlying generator.

Load-bearing premise

That synthetic query logs and probing queries will produce enough disambiguation evidence to correctly select or repair SQL for arbitrary unseen schemas and underspecified questions.

What would settle it

Running SOMA-SQL on a fresh benchmark containing schemas and question distributions withheld from all development stages and finding execution accuracy no higher than the strongest baseline.

Figures

Figures reproduced from arXiv: 2606.11424 by Ankan Bansal, Chuan Lei, Daniel Garcia, Dan Roth, Fjona Parllaku, Marianne Menglin Liu, Rongguang Wang, Sai Ashish Somayajula, Sujeeth Bharadwaj, Sujith Ravi, Syed Fahad Allam Shah, Tao Sheng.

Figure 1
Figure 1. Figure 1: An overview of SOMA-SQL consisting of (a) Synthetic Query Log Construction for Disambiguation, (b) Ambiguity-aware SQL Probing and Correction, and (c) Multi-SQL Generation. SQL queries that are syntactically valid yet return results that do not match the analyst’s intent, a failure that may not surface until downstream decisions have already been made. Challenges. These limitations manifest as three ambigu… view at source ↗
read the original abstract

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that SOMA-SQL resolves multi-source ambiguity in NL-to-SQL by constructing synthetic query logs to ground schema interpretation and executing targeted probing queries based on an ambiguity taxonomy and candidate disagreements to select or repair the correct SQL. This approach is said to generalize across unseen schemas and query distributions without human-in-the-loop. On six public benchmarks, it improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

Significance. If the empirical claims hold and the method generalizes as described, this would represent a meaningful advance in autonomous ambiguity resolution for NL2SQL systems, addressing a central failure mode without reliance on human clarification.

major comments (2)
  1. [Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.
  2. [Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] The abstract states accuracy improvements but supplies no experimental details, baseline descriptions, error bars, dataset statistics, or controls, so it is impossible to assess whether the data support the claim.

    Authors: Abstracts are by design concise overviews of the contribution and key results. The manuscript supplies the requested details in the Experiments section (benchmarks, baselines, dataset statistics, controls for ambiguity, and results including error bars and per-benchmark breakdowns). The abstract already notes the six benchmarks and the 13.0% average gain. We will revise the abstract to incorporate one additional sentence summarizing the experimental setup and statistical controls. revision: partial

  2. Referee: [Method] The description of synthetic query log construction, the ambiguity taxonomy, probing query generation, and candidate selection mechanics is high-level only; without these specifics or any validation of the load-bearing assumption that they produce sufficient disambiguation evidence for arbitrary unseen schemas, the central claim cannot be evaluated.

    Authors: The Method section supplies the concrete mechanics: the procedure for generating synthetic query logs from schema metadata, the full ambiguity taxonomy with categories and examples, the algorithm for generating probing queries from candidate disagreements, and the selection/repair logic based on execution outcomes. Generalization to unseen schemas is validated empirically across the six benchmarks (which include schema and query distribution shifts), with larger gains precisely on the ambiguous subset. If the editor wishes, we can add pseudocode and an additional ablation validating the probing step. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical NL-to-SQL system that constructs synthetic query logs, applies an ambiguity taxonomy for probing queries, and selects/repairs SQL based on execution evidence. All claims reduce to measured execution accuracy gains on public benchmarks rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text as load-bearing steps; the method is described as a practical pipeline whose validity is tested externally via generalization experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of the introduced synthetic-log and probing mechanisms for autonomous disambiguation; the evaluation assumes public benchmarks capture representative ambiguity.

axioms (1)
  • domain assumption Public NL2SQL benchmarks are representative of real-world multi-source ambiguity scenarios.
    The generalization claim is supported only by results on six public benchmarks.

pith-pipeline@v0.9.1-grok · 5783 in / 1230 out tokens · 26435 ms · 2026-06-27T13:10:26.913372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

  1. [1]

    Asking clarifying questions in open-domain information-seeking conversations

    Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. InProceedings of the 42nd international acm sigir conference on research and development in information retrieval, pages 475–484, 2019

  2. [2]

    Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

    Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Ça˘gatay Demiralp, and Michael Stonebraker. Beaver: an enterprise benchmark for text-to-sql.arXiv preprint arXiv:2409.02038, 2024

  3. [3]

    Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

    Peter Baile Chen, Tomer Wolfson, Michael Cafarella, and Dan Roth. Enrichindex: Using llms to enrich retrieval indices offline.arXiv preprint arXiv:2504.03598, 2025

  4. [4]

    Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

    Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration.arXiv preprint arXiv:2502.00675, 2025

  5. [5]

    Ambisql: Interactive ambiguity detection and resolution for text-to-sql

    Zhongjun Ding, Yin Lin, Tianjing Zeng, Rong Zhu, Bolin Ding, and Jingren Zhou. Ambisql: Interactive ambiguity detection and resolution for text-to-sql. InCompanion of the International Conference on Management of Data, pages 26–29, 2026

  6. [6]

    Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino

    Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, Alex Van Grootel, Brandon Chow, Kai Deng, Katherine Lin, Marcos Campos, K. Venkatesh Emani, Vivek Pandit, Victor Shnayder, Wenjing Wang, and Carlo Curino. NL2SQL is a solved problem. . . not! In Proceeding...

  7. [7]

    Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

  8. [8]

    Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

    Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. Sqlens: An end-to-end framework for error detection and correction in text-to-sql.Advances in Neural Information Processing Systems, 38:135571–135604, 2026

  9. [9]

    Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/,

  10. [11]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  11. [12]

    Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions

    Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, et al. Bird-interact: Re-imagining text-to-sql evaluation via lens of dynamic interactions. InThe Fourteenth International Conference on Learning Representations, 2026

  12. [13]

    How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

    Kimiya Keyvan and Jimmy Xiangji Huang. How to approach ambiguous queries in conver- sational search: A survey of techniques, approaches, tools, and challenges.ACM Computing Surveys, 55(6):1–40, 2022

  13. [14]

    Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

    Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

  14. [15]

    Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

    Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. Deepeye-sql: A software- engineering-inspired text-to-sql framework.Proceedings of the ACM on Management of Data, 4(3 (SIGMOD):1–28, 2026

  15. [16]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. 10

  16. [17]

    Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning

    Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, and Dan Roth. Oraplan–sql: A planning-centric framework for complex bilingual nl2sql reasoning. In International Joint Conference on Knowledge Graphs, pages 537–544. Springer, 2025

  17. [18]

    Xiyan-sql: A novel multi-generator framework for text-to-sql

    Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql. IEEE Transactions on Knowledge and Data Engineering, 2026

  18. [19]

    Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,

    Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. Natural language to SQL: State of the art and open problems.Proceedings of the VLDB Endowment, 18(12):5466–5471,

  19. [20]

    doi: 10.14778/3750601.3750696

  20. [21]

    SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine

    Toby Mao and SQLGlot contributors. SQLGlot: A no-dependency sql parser, transpiler, optimizer, and engine. https://github.com/tobymao/sqlglot, 2023. Accessed: 2026-04- 29

  21. [22]

    AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026

    Aritra Mazumder, Parth Desai, Fuheng Zhao, and Anna Fariha. AtomSQL: Interactive dis- ambiguation of NL-to-SQL via user-guided atom-level alignment.Proceedings of the VLDB Endowment, 19, 2026. Demo Track, VLDB 2026

  22. [23]

    Ambigqa: Answering ambiguous open-domain questions

    Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5783–5797, 2020

  23. [24]

    Introducing gpt-5

    OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-06

  24. [25]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

  25. [26]

    Accessed: 2026-05-06

  26. [27]

    Introducing gpt-5.3-codex

    OpenAI. Introducing gpt-5.3-codex. https://openai.com/index/ introducing-gpt-5-3-codex/, 2026. Accessed: 2026-05-06

  27. [28]

    Beaver (oracle conversion)

    Oracle Corporation. Beaver (oracle conversion). https://github.com/oracle-samples/ beaver, 2024. GitHub repository, accessed: 2026-04-27

  28. [29]

    Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

    Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339– 36348, 2023

  29. [30]

    Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql

    Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tar- lok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. InInternational Confer- ence on Learning Representations, volume 2025, pages 60385–60415, 2025

  30. [31]

    Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information

    Sudha Rao and Hal Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746, 2018

  31. [32]

    Courier Corporation, 1994

    Fazlollah M Reza.An introduction to information theory. Courier Corporation, 1994

  32. [33]

    Oracle db skills: Sql development

    Kris Rice. Oracle db skills: Sql development. https://github.com/krisrice/ oracle-db-skills/tree/main?tab=readme-ov-file#sql-development , 2025. GitHub repository, accessed 2026-05-06

  33. [34]

    Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

    Irina Saparina and Mirella Lapata. Ambrosia: A benchmark for parsing ambiguous questions into database queries.Advances in Neural Information Processing Systems, 37:90600–90628, 2024

  34. [35]

    Agenticdata: An agentic data analytics system for heterogeneous data, 2025

    Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. Agenticdata: An agentic data analytics system for heterogeneous data, 2025. URL https://arxiv.org/abs/2508. 05002. 11

  35. [36]

    Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

    Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

  36. [37]

    Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

    Yuan Tian and Tianyi Zhang. Pv-sql: Synergizing database probing and rule-based verification for text-to-sql agents.arXiv preprint arXiv:2604.17653, 2026

  37. [38]

    Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

    Kapil Vaidya, Abishek Sankararaman, Jialin Ding, Chuan Lei, Xiao Qin, Balakrishnan Narayanaswamy, and Tim Kraska. Odin: A nl2sql recommender to handle schema ambi- guity.arXiv preprint arXiv:2505.19302, 2025

  38. [39]

    Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql

    Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know what i don’t know: Han- dling ambiguous and unknown questions for text-to-sql. InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714, 2023

  39. [40]

    Mac-sql: A multi-agent collaborative framework for text-to-sql

    Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

  40. [41]

    Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale

    Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, and Xiang Bai. Autolink: Autonomous schema exploration and expansion for scalable schema linking in text-to-sql at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33809–33817, 2026

  41. [42]

    Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

  42. [43]

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

  43. [44]

    Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning

    Danna Zheng, Mirella Lapata, and Jeff Pan. Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 94–111, 2024. 12 A Related Works Beyond NL2SQL: Ambiguity in NL.Ambiguit...

  44. [45]

    Which <entity> ... ?

    17 Return columns: <...> General Rules • Use ONLY column and table names in schema. Do not invent names. • Prefer scalar subqueries for single values; use JOINs only when necessary. • Keep each step focused (3–6 steps total). • End with exactly one line:Return columns: <exact list> • Use at most two guardrails per plan. Projection Minimality (Strict) • Re...

  45. [46]

    Input: question, seed SQL (primary patch target), candidate SQLs, implementation difference report of candidate SQLs, schema evidence, and optional external info

  46. [47]

    (a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale

    Generate a clarifying-question plan. (a) For each question, assignquestion_id, taxonomy (AmbiSchema,AmbiValue, AmbiIntent),ambiguity_question,why_high_impact, and probe_sql_rationale. (b) Generate exactly 8–9 questions. (c) If implementation-diff is available, exactly 3 must be diff-grounded and the rest taxonomy-grounded; otherwise all taxonomy-grounded

  47. [48]

    (a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric)

    Generate and execute ambiguity probes. (a)AmbiSchema: ambiguous schema mappings (table/column/grain/metric). (b)AmbiValue: question values do not directly align with database values, leading to ambiguous filter literals. (c)AmbiIntent: ambiguous SQL semantics (e.g.,ORDER BYvs.GROUP BY). (d) Write one probe SQL per question and execute all probes, saving S...

  48. [49]

    Convert probe results into resolved assumptions

  49. [50]

    (a) Patch only the seed SQL baseline; use candidates as reference only

    Apply a minimal probe-evidence-backed SQL patch. (a) Patch only the seed SQL baseline; use candidates as reference only. (b) Apply changes only when supported by probe evidence; keep edits minimal and localized. (c) Use implementation diff as hypothesis context only, never as a patch source. (d) Ensure all rewrites are probe-evidence-backed (ambiguity_res...

  50. [51]

    analyze trends

    Run final sanity and freeze. (a) Execute final SQL withexecute_sql.py. (b) If execution fails or returns empty, fall back to baseline SQL. (c) Always producefixed_sql/<instance_id>.sql; limit to 3 attempts. Constraints • Do not read or use gold SQL or gold execution during fixing. • No SQL patch before probe evidence exists. • If implementation diff is pr...

  51. [52]

    Parse the question and any clarifying Q&A into a precise intent

  52. [53]

    Compare all 10 candidates against that intent

  53. [54]

    Check schema validity and semantic faithfulness

  54. [55]

    Select the single best candidate

  55. [56]

    largest floor number

    Output the candidate number and the exact SQL text, unchanged. Output Format Candidate: <N> <exact SQL copied verbatim from candidate N> Input Fields • Question: {question} • Clarifying Ambiguity Q&A: {ambi_blob} • Candidate SQLs (10): {generated_sql_list} • Schema: {references} Hard Requirement • Output only the candidate number and the selected SQL. • N...