arxiv: 2604.27261 · v1 · submitted 2026-04-29 · 💻 cs.DB

Recognition: unknown

SynSQL: Synthesizing Relational Databases for Robust Evaluation of Text-to-SQL Systems

Mohammadamin Habibollah , Davood Rafiei

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:17 UTC · model grok-4.3

classification 💻 cs.DB

keywords text-to-SQLdatabase synthesislarge language modelsevaluation robustnesssynthetic datarelational databasesquery executionstructured generation

0 comments

The pith

Large language models can synthesize alternative databases that expose errors masked by static text-to-SQL benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating text-to-SQL systems usually depends on a single static database, yet the same SQL query can succeed or fail depending on the actual data instance. The work investigates whether large language models are capable of creating new databases that fit a given question and schema while staying consistent. If successful, these databases allow testing models under varied conditions to check their true robustness. Results from testing ten models on three benchmarks show drops in performance between 3 and 14 percent. This indicates that many apparent successes in current evaluations may be tied to specific benchmark data rather than general understanding.

Core claim

SynSQL is a framework that synthesizes relational databases by breaking the task into schema selection, question-guided data synthesis, and constraint-aware critique followed by iterative refinement. Applying this to evaluation causes text-to-SQL models to show reduced accuracy of 3-14% on the generated databases versus the original static ones, thereby revealing errors that standard benchmarks conceal.

What carries the argument

The SynSQL framework that conditions database synthesis on question-schema alignment through three stages of selection, synthesis, and critique.

If this is right

Text-to-SQL performance is more variable across database instances than static tests reveal.
Current benchmarks may overestimate the reliability of these systems due to data-specific artifacts.
LLM-based synthesis provides a method for creating controlled variations for robustness testing.
Analysis of generation quality can identify where LLMs struggle with relational constraints.
Structured data synthesis offers a way to probe LLM reasoning in constrained environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar synthesis approaches could test robustness in other areas like code generation or data querying.
Instance variability might be a general issue in many natural language to structured output tasks.
Refining the critique stage could make the method even more effective for uncovering model weaknesses.
The work implies that evaluation protocols should move toward multi-instance testing for better reliability measures.

Load-bearing premise

The synthesized databases are sufficiently semantically meaningful and schema-consistent that performance differences reflect actual model limitations rather than issues created during synthesis.

What would settle it

Demonstrating that the generated databases frequently fail basic consistency checks or that model accuracies remain unchanged when switching to them would undermine the claim.

Figures

Figures reproduced from arXiv: 2604.27261 by Davood Rafiei, Mohammadamin Habibollah.

**Figure 1.** Figure 1: Overview of the SynSQL framework. The schema selector identifies relevant view at source ↗

**Figure 2.** Figure 2: (a) Breakdown of failure cases (84 failures out of 500 BIRD dev questions). Schema selector failures: omitted tables/columns. Semantic failures: alignment mismatches and NL ambiguity. (b) Impact of the critic component on each of the five data quality criteria in SynSQL, using Gemini-2.5- Flash on Spider and BIRD dev sets. Spider results exclude Hint Alignment as evidence/hint entries are not present in Sp… view at source ↗

**Figure 3.** Figure 3: Impact of the critic component on success view at source ↗

**Figure 4.** Figure 4: Relational validity and data completeness of SynSQL and Vanilla-generated view at source ↗

**Figure 5.** Figure 5: Success rate and compound execution accuracy of SynSQL vs. GPT-4.1-Mini on view at source ↗

**Figure 6.** Figure 6: An example of schema selection failure. The synthetic data omits the setCode column from set translations, leading to a failed query. SynSQL has generated data for the setCode column in cards, but omitted the setCode column from set translations during schema selection. The gold query joins both tables on setCode, leading to failure. However, the synthetic data still contains valid setCode values, just not… view at source ↗

**Figure 7.** Figure 7: Example of misinterpretation: the synthetic data contains values such as owner (lowercase) in the type column, while the gold query expects OWNER (uppercase). This case sensitivity mismatch leads to a failed query. Here, the synthetic database reflects the casing found in the question or evidence, but the gold query expects a different case. Such mismatches between generated data and gold query expectation… view at source ↗

**Figure 8.** Figure 8: An example of misinterpretation due to synthetic data not matching gold query conditions. The synthetic data contains values that do not satisfy the gold query’s WHERE clause, leading to failure. The gold query expects district.A3 = ’Prague’, but the synthetic data contains values such as Prague 1, Prague 2, and Prague 3. Here, the LLM generated region names with appended numbers, resulting in a mismatch w… view at source ↗

**Figure 9.** Figure 9: An example of misinterpretation due to inconsistencies between question/evidence and gold query in the BIRD dev set. The synthetic data aligns with the question, but not the gold query, leading to failure. In this case, the question and evidence refer to cryokinesis (lowercase), while the gold query expects ’Cryokinesis’ (capitalized). The synthetic database contains power name = ’cryokinesis’, resulting i… view at source ↗

**Figure 10.** Figure 10: Impact of the critic component on compound execution accuracy ( view at source ↗

**Figure 11.** Figure 11: Example of critic feedback highlighting deficiencies in data complexity and variety, prompting regeneration of synthetic data to better align with question intent and evaluation robustness. The critic also frequently identifies key integrity violations, such as non-unique primary keys or referential integrity breaches view at source ↗

**Figure 12.** Figure 12: Example of critic feedback highlighting issues in foreign key integrity, leading to regeneration that enforces schema integrity. More examples of critic feedback are shown in Figures 13 and 14, demonstrating the critic’s consistent role in identifying and rectifying data quality problems. 18 view at source ↗

**Figure 13.** Figure 13: An example of critic feedback Question: Which country is the oldest driver from? Evidence: date of birth refers to drivers.dob; The larger the birthday value, the younger the person is, and vice versa Feedback: Increase the variety and range of birth dates to better highlight the oldest driver and include edge cases such as multiple drivers born on the same day or very close dates. Add explicit foreign ke… view at source ↗

**Figure 14.** Figure 14: An example of critic feedback This systematic feedback mechanism ensures that subsequent iterations produce more robust test databases that can effectively distinguish between semantically correct and incorrect SQL queries. Overall, the critic’s feedback focuses on: (1) key integrity and schema coverage to ensure structural validity, (2) presence of edge cases and boundary values, (3) diversity in categor… view at source ↗

**Figure 15.** Figure 15: An example from the formula 1 database (question 1000). Moreover, SynSQL ensures that values within each row are meaningfully related and contextually accurate. For example, if a row in the races table has the year set to 2024, all corresponding data in that row (such as race name or date) is consistent with that year. Similarly, in the circuits table, if the location is Monza, the country is set to Italy… view at source ↗

**Figure 16.** Figure 16: Generated synthetic table races for question 1000 from the formula 1 database. The synthetic data contains realistic values that align with the question intent. 20 view at source ↗

**Figure 17.** Figure 17: Generated synthetic table circuits for question 1000 from the formula 1 database. The synthetic data contains realistic values that align with the question intent. A.9 Inconsistency Examples from Spider Dev Set There are questions in the spider dev set that the gold query does not align with the content of original test databases. Below are some examples of such inconsistencies, which lead to the observed… view at source ↗

**Figure 18.** Figure 18: An example of inconsistencies between gold query and database contents in the Spider dev set. SynSQL aligns with the question, leading to recovery of such inconsistencies. Question: Which city and country is the Alton airport at? Evidence: N/A Gold Query: SELECT City, Country FROM AIRPORTS WHERE AirportName = "Alton" view at source ↗

**Figure 19.** Figure 19: An example of inconsistencies between gold query and database contents in the Spider dev set. SynSQL aligns with the question, leading to recovery of such inconsistencies. 21 view at source ↗

**Figure 20.** Figure 20: The prompt template used for column selection in the schema selector component of SynSQL. 22 view at source ↗

**Figure 21.** Figure 21: The prompt template used for column expansion in the schema selector component of SynSQL. 23 view at source ↗

**Figure 22.** Figure 22: The prompt template used for data synthesis component of SynSQL. 24 view at source ↗

**Figure 23.** Figure 23: The prompt template used for data critic component of SynSQL. 25 view at source ↗

read the original abstract

Evaluating text-to-SQL systems remains largely fragile: correctness is typically judged by executing predicted and gold SQL queries on a single static database, even though the same queries may behave differently under alternative database instances. This raises a broader language modeling question: Can large language models synthesize semantically meaningful, schema-consistent relational data directly from a natural language question? If so, such generation can serve as a controlled mechanism for stress-testing text-to-SQL systems beyond fixed benchmark databases. We introduce SynSQL, a framework that synthesizes test databases conditioned on question-schema alignment rather than gold SQL queries. SynSQL decomposes the task into three stages: (1) schema selection, (2) question-guided data synthesis, and (3) constraint-aware critique with iterative refinement, framing database construction as structured generation under semantic and relational constraints. Across ten text-to-SQL models on Spider, BIRD, and Spider 2.0, SynSQL-generated databases reveal performance drops of 3-14% compared to static evaluation, exposing errors masked by benchmark artifacts. We further analyze generation quality, constraint adherence, and failure modes, highlighting both the promise and limitations of LLMs in structured data synthesis. Our findings position synthetic database generation as a new lens for studying LLM reasoning, controllability, and robustness in structured environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynSQL gives a three-stage way to synthesize alternative databases from questions alone, showing 3-14% drops in text-to-SQL performance, but the key unverified piece is whether those new databases preserve the original query semantics.

read the letter

The main point is that this work shows how to synthesize new databases from natural language questions alone to test text-to-SQL systems more thoroughly, finding that accuracy drops between 3 and 14 percent across models and benchmarks. The new part is the three-stage framework: selecting a schema, generating data guided by the question, and then using constraint-aware critique with refinement to improve it. This is a structured way to create variations that don't depend on the gold SQL, which helps avoid some circularity. They do a good job running it on ten different models over Spider, BIRD, and Spider 2.0, and they include analysis of how well the generation sticks to constraints and what the common failure modes are. The potential issue is in the assumption that the synthesized databases are good enough proxies. The drops are presented as exposing masked errors, but we need evidence that the original gold queries still produce the expected results on these new instances and that the data distributions match the question intent. The abstract mentions they analyze generation quality, but without specific verification metrics like execution success rates for gold queries or human checks for semantic consistency, it's possible some drops come from synthesis artifacts instead. This paper is for researchers focused on improving evaluation methods for text-to-SQL and related structured reasoning tasks. A reader working on benchmark design or LLM controllability in data generation would get value from the approach and the empirical results. It deserves a serious referee because the core idea addresses a real weakness in current evaluation practices and the experiments are broad. I would recommend sending it out for peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces SynSQL, an LLM-based three-stage framework (schema selection, question-guided data synthesis, and constraint-aware critique with refinement) for generating alternative relational databases conditioned on natural language questions and schemas. It evaluates ten text-to-SQL models on Spider, BIRD, and Spider 2.0, claiming that these synthetic databases expose performance drops of 3-14% relative to static benchmark evaluation, thereby revealing model errors masked by fixed database artifacts. The work also reports analyses of generation quality, constraint adherence, and failure modes.

Significance. If the central assumption holds—that the synthesized databases are semantically faithful alternatives where gold queries execute correctly and preserve question intent—then SynSQL offers a practical method for stress-testing text-to-SQL robustness and studying LLM controllability in structured generation. This could shift evaluation practices away from single static databases toward distribution-shift testing, with implications for both database systems and LLM reasoning research.

major comments (2)

[Abstract] The headline claim of 3-14% performance drops exposing masked errors is load-bearing on the assumption that SynSQL databases are valid test instances. However, the abstract and high-level pipeline description provide no quantitative validation (e.g., execution success rates or result consistency metrics for gold SQL queries on the new instances) to rule out synthesis artifacts as the source of drops rather than genuine model fragility.
[Section 3] Section 3 (Methodology): The constraint-aware critique stage is described only at a high level with no details on how implicit constraints (foreign-key integrity, value distributions implied by the NL question) are enforced or measured at scale, nor any reported success rates for the iterative refinement process. This directly affects whether observed drops can be attributed to distribution shift.

minor comments (1)

[Abstract] The abstract states that generation quality and constraint adherence are analyzed, but does not name the specific metrics or thresholds used; adding these would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation and methodological clarity. We address each point below and commit to revisions that strengthen the presentation without altering the core findings.

read point-by-point responses

Referee: [Abstract] The headline claim of 3-14% performance drops exposing masked errors is load-bearing on the assumption that SynSQL databases are valid test instances. However, the abstract and high-level pipeline description provide no quantitative validation (e.g., execution success rates or result consistency metrics for gold SQL queries on the new instances) to rule out synthesis artifacts as the source of drops rather than genuine model fragility.

Authors: We agree that the abstract would benefit from explicit reference to the quantitative validations already performed in the paper. The manuscript reports analyses of generation quality, constraint adherence, and failure modes, including execution success of gold queries and consistency checks across instances. We will revise the abstract to include these metrics (e.g., gold query execution rates and result consistency) to directly address the concern that drops may stem from synthesis artifacts. revision: yes
Referee: [Section 3] Section 3 (Methodology): The constraint-aware critique stage is described only at a high level with no details on how implicit constraints (foreign-key integrity, value distributions implied by the NL question) are enforced or measured at scale, nor any reported success rates for the iterative refinement process. This directly affects whether observed drops can be attributed to distribution shift.

Authors: We acknowledge that Section 3 presents the critique stage at a high level. We will expand this section with concrete details on enforcement mechanisms for implicit constraints such as foreign-key integrity and question-implied value distributions, along with the specific verification steps used at scale. We will also add the reported success rates of the iterative refinement process from our experiments to demonstrate that the synthesized databases maintain semantic fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical synthesis framework evaluated independently on benchmarks

full rationale

The paper introduces SynSQL as an independent three-stage LLM-based pipeline (schema selection, question-guided synthesis, constraint-aware critique) to generate alternative databases from NL questions and schemas. Reported results consist of direct empirical measurements: performance drops of 3-14% across ten models on Spider/BIRD/Spider 2.0 when swapping to the synthesized instances. No equations, fitted parameters, or predictions are derived by construction from the evaluation outcomes themselves. No self-citations are used to justify uniqueness theorems or ansatzes, and the synthesis process is not defined in terms of the observed drops. The central claim rests on the (separately analyzed) quality of the generated databases rather than reducing to a tautology or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unproven capability of LLMs to perform reliable structured data synthesis under semantic and relational constraints; no free parameters, explicit axioms, or new invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1111 out tokens · 66407 ms · 2026-05-07T10:17:44.542123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references

[1]

if Country.id is selected, then Match.country_id would be a similar column) OR

Semantically similar to the selected columns but in a different table (e.g. if Country.id is selected, then Match.country_id would be a similar column) OR
[2]

chain_of_thought_reasoning

Likely to contain data that would complement the selected columns Please respond with a JSON object structured as follows: ```json {{ "chain_of_thought_reasoning": "Your reasoning for selecting additional columns, be concise and clear.", "table_name1": ["additional_column1", "additional_column2"], "table_name2": ["additional_column1"], }} ``` Make sure yo...
[3]

Hint Alignment: Does the data follow the intent and details of the question hint?
[4]

Key Integrity: Does the data respect uniqueness and foreign key relationships in the schema?
[5]

Schema Coverage: Does the data include the relevant columns and relationships from the schema?
[6]

Complexity: Does the data include sufficient complexity and edge cases?
[7]

Variety: Is there enough variety in the data?
[8]

For each criterion, provide a score from 1-10 and specific feedback on what aspects need improvement

Relevance: Is the data directly related to answering the question? {ONE_EXAMPLE} Question: {QUESTION} Database Schema: {DATABASE_SCHEMA} Hint: {HINT} Generated Data: {GENERATED_DATA} Provide a detailed evaluation of the data based on the criteria of Hint Alignment, Key Integrity, Schema Coverage, Complexity, Variety, and Relevance. For each criterion, pro...