A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases
Pith reviewed 2026-07-01 06:07 UTC · model grok-4.3
The pith
A semantic layer and SMQ intermediate form let NL2SQL agents reach 94 percent execution accuracy on enterprise benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decoupling semantic intent from physical SQL execution through a curated semantic layer and an SMQ intermediate representation, the agent composes verified building blocks into final queries, delivering substantially higher accuracy on heterogeneous enterprise schemas than schema-only generation methods.
What carries the argument
The Semantic Model Query (SMQ), a compact intermediate representation of semantic intent that a deterministic compiler maps to dialect-specific SQL.
If this is right
- The same agent workflow can target multiple SQL dialects without retraining the language model.
- Accuracy gains depend on semantic-layer quality rather than model scale alone.
- The constrained think-act loop reduces invalid SQL output by restricting actions to verified SMQ operations.
- End-to-end evaluation frameworks become feasible for comparing semantic-layer versus schema-only approaches.
Where Pith is reading between the lines
- The approach may lower the volume of labeled query examples needed for training by shifting grounding work to the semantic layer.
- Similar mediation layers could be applied to other structured query languages or API call generation tasks.
- Automated tools for maintaining semantic-layer consistency would be a natural next engineering step.
Load-bearing premise
The curated semantic layer accurately captures business intent across the enterprise schema without introducing systematic errors or bias that would invalidate the SMQ-to-SQL compilation.
What would settle it
Running the same agent on a fresh enterprise schema supplied with an incomplete or mismatched semantic layer and observing whether execution accuracy falls below that of a direct schema-only baseline.
Figures
read the original abstract
Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names, heterogeneous SQL dialects, and complex analytical workloads requiring nested aggregations, temporal reasoning, and multi-table joins. We present a semantic-layer-mediated NL2SQL agent that decouples semantic intent from physical SQL execution. Rather than generating SQL directly over raw schemas, the agent reasons over a curated semantic layer through a compact intermediate representation called the Semantic Model Query (SMQ). A deterministic compiler translates each SMQ into dialect-specific SQL, providing verified building blocks that the agent composes into the final query. The system employs a constrained think-act loop, supports SQLite, BigQuery, and Snowflake backends, and is integrated into an end-to-end evaluation framework. Using Gemini 3 Pro, the system achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark, ranking third on the official leaderboard and substantially outperforming schema-only approaches. We describe the system architecture, SMQ representation, agent workflow, evaluation results, and discuss semantic-layer quality and the trade-off between improved grounding and overfitting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a semantic-layer-mediated NL2SQL agent that reasons over a human-curated semantic layer using a compact intermediate representation called Semantic Model Query (SMQ). A deterministic compiler translates SMQ into dialect-specific SQL for backends including SQLite, BigQuery, and Snowflake. The agent employs a constrained think-act loop. On the 547-task Spider2-snow benchmark, the system with Gemini 3 Pro achieves 94.15% execution accuracy (third on the official leaderboard) and substantially outperforms schema-only baselines.
Significance. If the central performance claim holds after addressing the evaluation gaps, the work would demonstrate that semantic-layer mediation plus deterministic compilation can materially improve NL2SQL accuracy on heterogeneous enterprise schemas with cryptic names and complex analytical patterns. The deterministic compiler is a concrete strength for verifiability. The result would be of interest to both academic NL2SQL and industrial deployment communities, provided the contribution of the curated layer versus the agent architecture can be isolated.
major comments (1)
- [Abstract] Abstract: the 94.15% execution accuracy and the gap versus schema-only approaches are obtained by generating SMQ over a human-curated semantic layer. No ablation is reported that measures accuracy when the layer is replaced by an automatically derived or null layer, even though the abstract itself flags the "trade-off between improved grounding and overfitting." This measurement is required to attribute the reported gains to the SMQ compiler and constrained loop rather than benchmark-specific curation.
minor comments (2)
- [Abstract] Abstract and evaluation description: no details are supplied on the evaluation protocol, error analysis, baseline re-implementations, or statistical significance testing, which limits assessment of whether the data support the ranking and accuracy claims.
- The manuscript mentions integration into an end-to-end evaluation framework but provides no description of how the 547 tasks were executed or how execution accuracy was computed across the three supported dialects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and commit to revisions that directly respond to the concern about isolating the contribution of the human-curated semantic layer.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 94.15% execution accuracy and the gap versus schema-only approaches are obtained by generating SMQ over a human-curated semantic layer. No ablation is reported that measures accuracy when the layer is replaced by an automatically derived or null layer, even though the abstract itself flags the "trade-off between improved grounding and overfitting." This measurement is required to attribute the reported gains to the SMQ compiler and constrained loop rather than benchmark-specific curation.
Authors: We agree that the current evaluation does not include an ablation that replaces the human-curated semantic layer with either an automatically derived layer or a null layer, and that such a measurement would strengthen attribution of gains specifically to the SMQ compiler and constrained think-act loop. The schema-only baselines already remove the semantic layer and show a substantial gap, but they do not test an automatically derived alternative. We will add this ablation in the revised manuscript: we will automatically derive a semantic layer from the raw schema (following standard schema-to-semantic mapping heuristics) and re-run the 547-task Spider2-snow evaluation with Gemini 3 Pro under otherwise identical conditions. The results and updated discussion of the grounding-overfitting trade-off will be included in Section 5 and the abstract. revision: yes
Circularity Check
No significant circularity; performance claim is externally benchmarked.
full rationale
The paper's central result is an empirical execution accuracy of 94.15% on the independent Spider2-snow benchmark (547 tasks). This is measured output against an external test set rather than a quantity derived by construction from fitted parameters, self-citations, or redefinitions internal to the system. The semantic layer is described as human-curated input, but the reported metric does not reduce to that curation by algebraic identity or statistical forcing; the abstract explicitly flags the grounding-vs-overfitting trade-off without claiming the accuracy is forced by the layer definition itself. No equations, uniqueness theorems, or self-citation chains appear in the provided text that would collapse the result to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A high-quality curated semantic layer exists that accurately represents enterprise data semantics without bias or incompleteness.
invented entities (2)
-
Semantic Model Query (SMQ)
no independent evidence
-
Semantic layer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,
F. Lei, J. Chen, Y . Ye, R. Cao,et al., “Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,” inProc. Int. Conf. Learning Representations (ICLR), 2025
2025
-
[2]
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,
T. Yu, R. Zhang, K. Yang,et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 3911–3921
2018
-
[3]
Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),
J. Li, B. Hui, G. Qu,et al., “Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[4]
DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,
M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[5]
Text-to-SQL empowered by large language models: A benchmark evaluation,
D. Gao, H. Wang, Y . Li,et al., “Text-to-SQL empowered by large language models: A benchmark evaluation,”Proc. VLDB Endowment, vol. 17, no. 5, pp. 1132–1145, 2024
2024
-
[6]
C3: Zero-shot text-to-SQL with ChatGPT,
X. Dong, C. Zhang, Y . Ge,et al., “C3: Zero-shot text-to-SQL with ChatGPT,” arXiv:2307.07306, 2023
-
[7]
MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,
B. Wang, C. Ren, J. Yang,et al., “MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,” inProc. Int. Conf. Computational Linguistics (COLING), 2025
2025
-
[8]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
S. Talaei, M. Pourreza, Y .-C. Chang,et al., “CHESS: Contextual harnessing for efficient SQL synthesis,” arXiv:2405.16755, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
ReAct: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu,et al., “ReAct: Synergizing reasoning and acting in language models,” inProc. Int. Conf. Learning Representations (ICLR), 2023
2023
-
[10]
dbt Semantic Layer and MetricFlow,
dbt Labs, “dbt Semantic Layer and MetricFlow,” Technical docu- mentation, 2024. [Online]. Available: https://docs.getdbt.com/docs/build/ about-metricflow
2024
-
[11]
ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,
M. Deng, A. Ramachandran, C. Xu,et al., “ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,” arXiv:2502.00675, 2025
-
[12]
Spider 2.0 leaderboard,
XLang Lab, “Spider 2.0 leaderboard,” 2026. [Online]. Available: https: //spider2-sql.github.io/
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.