A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

Ha Jeong Kim; Saksonita Khoeurn; Ye Ji Yoon

arxiv: 2606.31041 · v1 · pith:W5IYLFAKnew · submitted 2026-06-30 · 💻 cs.CL

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

Ha Jeong Kim , Saksonita Khoeurn , Ye Ji Yoon This is my paper

Pith reviewed 2026-07-01 06:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords NL2SQLsemantic layerSemantic Model Queryenterprise databasesnatural language to SQLSQL generationagent workflowintermediate representation

0 comments

The pith

A semantic layer and SMQ intermediate form let NL2SQL agents reach 94 percent execution accuracy on enterprise benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an NL2SQL agent that reasons over a curated semantic layer rather than raw database schemas. Queries are expressed in a compact intermediate representation called Semantic Model Query (SMQ), which a deterministic compiler then translates into dialect-specific SQL for backends including SQLite, BigQuery, and Snowflake. This architecture uses a constrained think-act loop and achieves 94.15 percent execution accuracy on the 547-task Spider2-snow benchmark with Gemini 3 Pro, ranking third on the leaderboard while outperforming direct schema approaches. The work focuses on handling cryptic column names, heterogeneous dialects, and complex analytical workloads typical of enterprise databases.

Core claim

By decoupling semantic intent from physical SQL execution through a curated semantic layer and an SMQ intermediate representation, the agent composes verified building blocks into final queries, delivering substantially higher accuracy on heterogeneous enterprise schemas than schema-only generation methods.

What carries the argument

The Semantic Model Query (SMQ), a compact intermediate representation of semantic intent that a deterministic compiler maps to dialect-specific SQL.

If this is right

The same agent workflow can target multiple SQL dialects without retraining the language model.
Accuracy gains depend on semantic-layer quality rather than model scale alone.
The constrained think-act loop reduces invalid SQL output by restricting actions to verified SMQ operations.
End-to-end evaluation frameworks become feasible for comparing semantic-layer versus schema-only approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may lower the volume of labeled query examples needed for training by shifting grounding work to the semantic layer.
Similar mediation layers could be applied to other structured query languages or API call generation tasks.
Automated tools for maintaining semantic-layer consistency would be a natural next engineering step.

Load-bearing premise

The curated semantic layer accurately captures business intent across the enterprise schema without introducing systematic errors or bias that would invalidate the SMQ-to-SQL compilation.

What would settle it

Running the same agent on a fresh enterprise schema supplied with an incomplete or mismatched semantic layer and observing whether execution accuracy falls below that of a direct schema-only baseline.

Figures

Figures reproduced from arXiv: 2606.31041 by Ha Jeong Kim, Saksonita Khoeurn, Ye Ji Yoon.

read the original abstract

Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names, heterogeneous SQL dialects, and complex analytical workloads requiring nested aggregations, temporal reasoning, and multi-table joins. We present a semantic-layer-mediated NL2SQL agent that decouples semantic intent from physical SQL execution. Rather than generating SQL directly over raw schemas, the agent reasons over a curated semantic layer through a compact intermediate representation called the Semantic Model Query (SMQ). A deterministic compiler translates each SMQ into dialect-specific SQL, providing verified building blocks that the agent composes into the final query. The system employs a constrained think-act loop, supports SQLite, BigQuery, and Snowflake backends, and is integrated into an end-to-end evaluation framework. Using Gemini 3 Pro, the system achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark, ranking third on the official leaderboard and substantially outperforming schema-only approaches. We describe the system architecture, SMQ representation, agent workflow, evaluation results, and discuss semantic-layer quality and the trade-off between improved grounding and overfitting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is routing NL2SQL through a curated semantic layer and a compact SMQ intermediate that compiles to dialect-specific SQL, which delivers 94.15% on Spider2-snow, but the gain is not separated from the layer itself.

read the letter

The one or two things to know are that this system reaches third place on the Spider2-snow leaderboard by letting the LLM reason over a human-curated semantic layer via a new intermediate called SMQ, then compiling that deterministically to SQL for SQLite, BigQuery, or Snowflake. The constrained think-act loop keeps the agent within the semantic model rather than the raw schema.

The new pieces are the SMQ representation itself and the compiler that turns it into verified building blocks. This setup directly targets the pain points of enterprise schemas—hundreds of tables, cryptic names, nested aggregations, and multiple dialects—which most academic NL2SQL work ignores. The multi-backend support and end-to-end framework are also concrete engineering contributions that could be reused.

The architecture description and workflow are clear enough that a practitioner could implement the main loop. The authors themselves note the grounding-versus-overfitting trade-off, which shows they are thinking about the right risks.

The soft spot is the missing ablation: there is no measurement of what happens when the curated layer is replaced by an automatic extraction or by the raw schema alone. Without that, the 94% result cannot be cleanly attributed to SMQ or the think-act loop rather than the layer already encoding the needed joins and business logic. Evaluation details such as baseline re-implementations, error breakdown, and statistical tests are also absent from the abstract, so the soundness claim is hard to assess.

This is for engineers and researchers who build or evaluate production NL2SQL tools on real heterogeneous databases. A reader in that group would get usable ideas from the architecture even if the numbers need more scrutiny.

It deserves peer review so the layer construction process and the missing ablations can be checked against the full methods.

Referee Report

1 major / 2 minor

Summary. The paper presents a semantic-layer-mediated NL2SQL agent that reasons over a human-curated semantic layer using a compact intermediate representation called Semantic Model Query (SMQ). A deterministic compiler translates SMQ into dialect-specific SQL for backends including SQLite, BigQuery, and Snowflake. The agent employs a constrained think-act loop. On the 547-task Spider2-snow benchmark, the system with Gemini 3 Pro achieves 94.15% execution accuracy (third on the official leaderboard) and substantially outperforms schema-only baselines.

Significance. If the central performance claim holds after addressing the evaluation gaps, the work would demonstrate that semantic-layer mediation plus deterministic compilation can materially improve NL2SQL accuracy on heterogeneous enterprise schemas with cryptic names and complex analytical patterns. The deterministic compiler is a concrete strength for verifiability. The result would be of interest to both academic NL2SQL and industrial deployment communities, provided the contribution of the curated layer versus the agent architecture can be isolated.

major comments (1)

[Abstract] Abstract: the 94.15% execution accuracy and the gap versus schema-only approaches are obtained by generating SMQ over a human-curated semantic layer. No ablation is reported that measures accuracy when the layer is replaced by an automatically derived or null layer, even though the abstract itself flags the "trade-off between improved grounding and overfitting." This measurement is required to attribute the reported gains to the SMQ compiler and constrained loop rather than benchmark-specific curation.

minor comments (2)

[Abstract] Abstract and evaluation description: no details are supplied on the evaluation protocol, error analysis, baseline re-implementations, or statistical significance testing, which limits assessment of whether the data support the ranking and accuracy claims.
The manuscript mentions integration into an end-to-end evaluation framework but provides no description of how the 547 tasks were executed or how execution accuracy was computed across the three supported dialects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and commit to revisions that directly respond to the concern about isolating the contribution of the human-curated semantic layer.

read point-by-point responses

Referee: [Abstract] Abstract: the 94.15% execution accuracy and the gap versus schema-only approaches are obtained by generating SMQ over a human-curated semantic layer. No ablation is reported that measures accuracy when the layer is replaced by an automatically derived or null layer, even though the abstract itself flags the "trade-off between improved grounding and overfitting." This measurement is required to attribute the reported gains to the SMQ compiler and constrained loop rather than benchmark-specific curation.

Authors: We agree that the current evaluation does not include an ablation that replaces the human-curated semantic layer with either an automatically derived layer or a null layer, and that such a measurement would strengthen attribution of gains specifically to the SMQ compiler and constrained think-act loop. The schema-only baselines already remove the semantic layer and show a substantial gap, but they do not test an automatically derived alternative. We will add this ablation in the revised manuscript: we will automatically derive a semantic layer from the raw schema (following standard schema-to-semantic mapping heuristics) and re-run the 547-task Spider2-snow evaluation with Gemini 3 Pro under otherwise identical conditions. The results and updated discussion of the grounding-overfitting trade-off will be included in Section 5 and the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claim is externally benchmarked.

full rationale

The paper's central result is an empirical execution accuracy of 94.15% on the independent Spider2-snow benchmark (547 tasks). This is measured output against an external test set rather than a quantity derived by construction from fitted parameters, self-citations, or redefinitions internal to the system. The semantic layer is described as human-curated input, but the reported metric does not reduce to that curation by algebraic identity or statistical forcing; the abstract explicitly flags the grounding-vs-overfitting trade-off without claiming the accuracy is forced by the layer definition itself. No equations, uniqueness theorems, or self-citation chains appear in the provided text that would collapse the result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central performance claim depends on the existence and quality of a curated semantic layer and the correctness of the deterministic SMQ-to-SQL compiler, neither of which receive independent verification outside the reported benchmark result.

axioms (1)

domain assumption A high-quality curated semantic layer exists that accurately represents enterprise data semantics without bias or incompleteness.
The agent reasons exclusively over this layer; any mismatch would invalidate the SMQ construction and downstream compilation.

invented entities (2)

Semantic Model Query (SMQ) no independent evidence
purpose: Compact intermediate representation that decouples semantic intent from physical SQL execution.
Newly introduced construct whose correctness is asserted but not independently evidenced.
Semantic layer no independent evidence
purpose: Curated abstraction that grounds natural language intent over heterogeneous physical schemas.
Core dependency whose quality is discussed but not externally validated.

pith-pipeline@v0.9.1-grok · 5746 in / 1216 out tokens · 33800 ms · 2026-07-01T06:07:18.407366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,

F. Lei, J. Chen, Y . Ye, R. Cao,et al., “Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,” inProc. Int. Conf. Learning Representations (ICLR), 2025

2025
[2]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

T. Yu, R. Zhang, K. Yang,et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 3911–3921

2018
[3]

Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),

J. Li, B. Hui, G. Qu,et al., “Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[4]

DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,

M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[5]

Text-to-SQL empowered by large language models: A benchmark evaluation,

D. Gao, H. Wang, Y . Li,et al., “Text-to-SQL empowered by large language models: A benchmark evaluation,”Proc. VLDB Endowment, vol. 17, no. 5, pp. 1132–1145, 2024

2024
[6]

C3: Zero-shot text-to-SQL with ChatGPT,

X. Dong, C. Zhang, Y . Ge,et al., “C3: Zero-shot text-to-SQL with ChatGPT,” arXiv:2307.07306, 2023

work page arXiv 2023
[7]

MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,

B. Wang, C. Ren, J. Yang,et al., “MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,” inProc. Int. Conf. Computational Linguistics (COLING), 2025

2025
[8]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

S. Talaei, M. Pourreza, Y .-C. Chang,et al., “CHESS: Contextual harnessing for efficient SQL synthesis,” arXiv:2405.16755, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu,et al., “ReAct: Synergizing reasoning and acting in language models,” inProc. Int. Conf. Learning Representations (ICLR), 2023

2023
[10]

dbt Semantic Layer and MetricFlow,

dbt Labs, “dbt Semantic Layer and MetricFlow,” Technical docu- mentation, 2024. [Online]. Available: https://docs.getdbt.com/docs/build/ about-metricflow

2024
[11]

ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,

M. Deng, A. Ramachandran, C. Xu,et al., “ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,” arXiv:2502.00675, 2025

work page arXiv 2025
[12]

Spider 2.0 leaderboard,

XLang Lab, “Spider 2.0 leaderboard,” 2026. [Online]. Available: https: //spider2-sql.github.io/

2026

[1] [1]

Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,

F. Lei, J. Chen, Y . Ye, R. Cao,et al., “Spider 2.0: Evaluating language models on real-world enterprise text-to-SQL workflows,” inProc. Int. Conf. Learning Representations (ICLR), 2025

2025

[2] [2]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

T. Yu, R. Zhang, K. Yang,et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 3911–3921

2018

[3] [3]

Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),

J. Li, B. Hui, G. Qu,et al., “Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to- SQLs (BIRD),” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[4] [4]

DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,

M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[5] [5]

Text-to-SQL empowered by large language models: A benchmark evaluation,

D. Gao, H. Wang, Y . Li,et al., “Text-to-SQL empowered by large language models: A benchmark evaluation,”Proc. VLDB Endowment, vol. 17, no. 5, pp. 1132–1145, 2024

2024

[6] [6]

C3: Zero-shot text-to-SQL with ChatGPT,

X. Dong, C. Zhang, Y . Ge,et al., “C3: Zero-shot text-to-SQL with ChatGPT,” arXiv:2307.07306, 2023

work page arXiv 2023

[7] [7]

MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,

B. Wang, C. Ren, J. Yang,et al., “MAC-SQL: A multi-agent collab- orative framework for text-to-SQL,” inProc. Int. Conf. Computational Linguistics (COLING), 2025

2025

[8] [8]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

S. Talaei, M. Pourreza, Y .-C. Chang,et al., “CHESS: Contextual harnessing for efficient SQL synthesis,” arXiv:2405.16755, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu,et al., “ReAct: Synergizing reasoning and acting in language models,” inProc. Int. Conf. Learning Representations (ICLR), 2023

2023

[10] [10]

dbt Semantic Layer and MetricFlow,

dbt Labs, “dbt Semantic Layer and MetricFlow,” Technical docu- mentation, 2024. [Online]. Available: https://docs.getdbt.com/docs/build/ about-metricflow

2024

[11] [11]

ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,

M. Deng, A. Ramachandran, C. Xu,et al., “ReFoRCE: A text-to- SQL agent with self-refinement, consensus enforcement, and column exploration,” arXiv:2502.00675, 2025

work page arXiv 2025

[12] [12]

Spider 2.0 leaderboard,

XLang Lab, “Spider 2.0 leaderboard,” 2026. [Online]. Available: https: //spider2-sql.github.io/

2026