ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

Chao Hu; Chen Hou; Danqing Huang; Dazhen Deng; Defeng Xie; Haoxuan Li; Haozhe Feng; Huawei Zheng; Minfeng Zhu; Peng Chen

arxiv: 2606.05836 · v1 · pith:56D2OTZ5new · submitted 2026-06-04 · 💻 cs.CL

ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

Zhaorui Yang , Huawei Zheng , Sen Yang , Yuhui Zhang , Haoxuan Li , Zhizhen Yu , Xuan Yi , Chen Hou

show 9 more authors

Defeng Xie Chao Hu Minfeng Zhu Dazhen Deng Haozhe Feng Danqing Huang Yingcai Wu Peng Chen Wei Chen

This is my paper

Pith reviewed 2026-06-28 01:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords Text-to-SQLAgentic FrameworkSQL-Python IntegrationSchema PruningData ProfilingEnterprise DatabasesLarge Language Models

0 comments

The pith

ProSPy structures Text-to-SQL reasoning into four stages that combine automatic profiling, schema pruning, dialect-agnostic SQL, and Python analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ProSPy to address the limits of single-query Text-to-SQL on enterprise databases that feature large heterogeneous schemas, incomplete metadata, and complex questions. It organizes the workflow into automatic profiling to extract fine-grained data evidence, progressive pruning to focus on task-relevant schema parts, a dialect-agnostic SQL layer to fetch intermediate views, and Python-based analysis for flexible computation. This combination aims to leverage SQL efficiency on large data while adding Python flexibility and lowering dependence on unreliable metadata. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show consistent gains over baselines, reaching 60.15 percent and 60.51 percent execution accuracy with Claude-4.5-Opus without majority voting, plus improved robustness across SQL dialects.

Core claim

ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects.

What carries the argument

The four-stage profiling-driven SQL-Python agentic framework that extracts evidence, prunes schemas, generates views via SQL, and completes analysis in Python.

If this is right

Outperforms strong baselines on Spider 2.0-Lite and Spider 2.0-Snow with both open-source and proprietary models.
Achieves execution accuracies of 60.15 percent and 60.51 percent with Claude-4.5-Opus without majority voting.
Remains robust to SQL dialect variations.
Delivers a favorable trade-off between schema recall and precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profiling-plus-pruning pattern could be tested on non-relational or streaming data sources where schema information is even less stable.
The Python analysis stage opens the possibility of embedding statistical or machine-learning steps directly after SQL retrieval without separate pipelines.
Lower reliance on metadata documentation could make the approach useful for legacy enterprise systems that lack up-to-date schema descriptions.

Load-bearing premise

Automatic profiling reliably extracts the needed fine-grained data evidence and progressive pruning keeps all task-relevant information without critical omissions in heterogeneous enterprise schemas.

What would settle it

A test case on a heterogeneous enterprise database where the profiling step misses key data patterns or pruning drops a required column, producing an incorrect final result that a single correct SQL query would have avoided.

Figures

Figures reproduced from arXiv: 2606.05836 by Chao Hu, Chen Hou, Danqing Huang, Dazhen Deng, Defeng Xie, Haoxuan Li, Haozhe Feng, Huawei Zheng, Minfeng Zhu, Peng Chen, Sen Yang, Wei Chen, Xuan Yi, Yingcai Wu, Yuhui Zhang, Zhaorui Yang, Zhizhen Yu.

**Figure 1.** Figure 1: Overview of ProSPy, which combines a dialect-agnostic DSL for SQL-based retrieval with Python-based analysis to support scalable retrieval, shield SQL dialect differences, and perform complex data analysis for enterprise-scale Text-to-SQL. and in-context learning (Pourreza and Rafiei, 2023; Zhang et al., 2023; Gao et al., 2024), test-time search and multi-path reasoning (Li et al., 2025a; Yuan et al., 2026… view at source ↗

**Figure 2.** Figure 2: Illustration of the ProSPy framework. ProSPy consists of four stages: (A) data profiling, (B) progressive [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Time distribution across different stages. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Error distribution across different stages. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for the first-pass table pruning stage (Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for the first-pass table pruning stage (Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for the field pruning stage (Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for the field pruning stage (Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used for data fetching (Part 1/6). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for data fetching (Part 2/6). [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for data fetching (Part 3/6). [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used for data fetching (Part 4/6). [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for data fetching (Part 5/6). [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt used for data fetching (Part 6/6). [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProSPy gives a workable four-stage pipeline for enterprise Text-to-SQL that mixes profiling, pruning, dialect-agnostic SQL views, and Python analysis, with consistent benchmark gains but no major theoretical shift.

read the letter

ProSPy structures Text-to-SQL into four stages: automatic profiling for data evidence, progressive schema pruning, dialect-agnostic SQL views for intermediates, and Python for downstream analysis. This setup targets large schemas, weak metadata, and dialect issues, reaching about 60% execution accuracy on Spider 2.0-Lite and Snow with Claude-4.5-Opus, beating baselines without voting.

The design does a clear job matching the stated enterprise problems. The hybrid SQL-Python split makes sense for queries that pure SQL struggles with, and the dialect-agnostic interface plus robustness checks add practical value. The paper reports direct baseline comparisons, cross-dialect results, and recall-precision trade-offs, which support the claims without obvious internal contradictions.

Soft spots stay minor. Automatic profiling and pruning could still drop key details on messier real schemas, though the ablations and dialect tests address this directly and keep it from becoming a load-bearing flaw. Results are tied to the Spider 2.0 variants, so generalization beyond those sets remains open. Top numbers use a proprietary model, but open-source results are also shown.

This paper suits readers working on applied Text-to-SQL systems or agentic workflows for production databases. Anyone needing concrete ideas on schema handling or hybrid execution would find the stage breakdown and metrics useful.

The work shows coherent engagement with the literature and its own evidence, so it deserves a serious referee.

Referee Report

1 major / 2 minor

Summary. The paper proposes ProSPy, a profiling-driven SQL-Python agentic framework for enterprise Text-to-SQL. It structures reasoning into four stages—automatic profiling to extract fine-grained data evidence, progressive schema pruning into task-relevant contexts, dialect-agnostic SQL for intermediate views, and Python for downstream analysis—targeting challenges of large heterogeneous schemas, incomplete metadata, and dialect variation. Experiments on Spider 2.0-Lite and Spider 2.0-Snow report execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus (no majority voting), outperforming baselines, with additional analysis on dialect robustness and schema recall/precision trade-offs.

Significance. If the empirical results hold, the work offers a practical pipeline that combines SQL efficiency on large data with Python flexibility for complex analysis, while explicitly reducing metadata dependence. Credit is due for the ablation-style analysis and cross-dialect results that directly test the pipeline components, as well as the direct baseline comparisons on Spider 2.0 variants.

major comments (1)

[§4] §4 (Experiments) and associated tables: the reported execution accuracies rest on the assumption that automatic profiling and progressive pruning preserve task-relevant information without critical omissions; while ablations and cross-dialect results are supplied, a quantitative breakdown of omission-induced failures on heterogeneous schemas would strengthen the load-bearing claim.

minor comments (2)

[Abstract] Abstract: the four-stage description is clear but could explicitly note the Spider 2.0 variants used for each accuracy figure.
[§3] Notation for schema pruning thresholds or profiling granularity is introduced without a dedicated definition table or equation reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: the reported execution accuracies rest on the assumption that automatic profiling and progressive pruning preserve task-relevant information without critical omissions; while ablations and cross-dialect results are supplied, a quantitative breakdown of omission-induced failures on heterogeneous schemas would strengthen the load-bearing claim.

Authors: We agree that explicitly quantifying the proportion of failures attributable to information loss during automatic profiling or progressive pruning would provide stronger support for the claim that these stages preserve task-relevant information on heterogeneous schemas. Our existing schema recall/precision analysis and component ablations already demonstrate the overall effectiveness and trade-offs, but they do not isolate omission-induced errors. In the revised manuscript we will add a targeted error analysis subsection that samples failure cases from Spider 2.0-Lite and Spider 2.0-Snow, manually categorizes them by root cause (profiling omission, pruning omission, SQL generation, Python analysis, or other), and reports the percentages. This will directly address the referee's request without altering the reported accuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with reported accuracies

full rationale

The paper describes a four-stage agentic pipeline (profiling, pruning, dialect-agnostic SQL views, Python analysis) and reports execution accuracies (60.15% and 60.51% on Spider 2.0 variants) as direct experimental outcomes from comparisons to baselines. No equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central claims rest on empirical measurements rather than any derivation chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5816 in / 1137 out tokens · 17435 ms · 2026-06-28T01:54:48.755299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references

[1]

field":

Text-to-sql as dual-state reasoning: Integrating adaptive context and progressive generation.arXiv preprint arXiv:2511.21402. Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, ZHAOQING SUO, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluat...

arXiv 2025
[2]

This is a multi-round iterative process; in each round, you only need to exclude the most obviously irrelevant tables
[3]

The input includes complete field information (profiling); please refer to field information to judge table relevance
[4]

If any field in a table is related to the question, that table must be retained
[5]

In subsequent rounds, you will receive a more streamlined schema and can further exclude less relevant tables
[6]

This stage only excludes tables, not fields (field exclusion is done in the next stage) Please analyze according to the following conservative exclusion strategy: -------------------------------------- Exclude Obviously Irrelevant Tables -------------------------------------- Based on question semantics and field information, only exclude tables that meet...
[7]

The table's core business domain clearly does not match the question's topic (e.g., the question is about user behavior, but the table only stores system logs)
[8]

The table's time range is completely incompatible with the question's time requirements (e.g., the question requires 2024 data, but the table only contains data before 2022)

2024
[9]

ALL fields of the table are irrelevant to the question (please carefully check field profiling information)
[10]

The table is NOT an intermediate table or dimension table that other related tables must JOIN through
[11]

who knows whom

The table does NOT contain data related to any entity, concept, or synonym mentioned in the question Exclusion principle: As long as a table may have a direct, indirect, or potential association with the question, firmly retain it. Figure 5: Prompt used for the first-pass table pruning stage (Part 1/2). 15 Prompt for Schema Pruning First Pass (Part 2/2) S...
[13]

Table names must exactly match those provided in the input, including case and path
[14]

Do not exclude any tables that may be indirectly related through JOINs, subqueries, or complex logic
[15]

If no tables can be safely excluded, return an empty array
[18]

When the question involves multi-entity comparisons, relationship chain queries, or complex aggregations, be extremely conservative and tend to retain more tables
[19]

Only exclude the most obviously irrelevant tables in each round; leave uncertain ones for subsequent rounds
[20]

16 Prompt for Schema Pruning Fields (Part 1/2) You are a rigorous database query expert, skilled at identifying data fields that are irrelevant to natural language questions

**The number of excluded_tables must be less than the total number of input tables**; excluding all tables is not allowed Figure 5: Prompt used for the first-pass table pruning stage (Part 2/2). 16 Prompt for Schema Pruning Fields (Part 1/2) You are a rigorous database query expert, skilled at identifying data fields that are irrelevant to natural languag...
[21]

This is a multi-round iterative process; in each round, you only need to exclude the most obviously irrelevant fields
[22]

Tables have already been filtered in the first stage; this stage only excludes fields
[23]

In subsequent rounds, you will receive a more streamlined schema and can further exclude less relevant fields
[24]

latest",

Please carefully refer to field profiling information (data type, value range, examples, etc.) to judge field relevance Please analyze according to the following conservative exclusion strategy: -------------------------------------- Exclude Obviously Irrelevant Fields (Processed by Category) -------------------------------------- Based on question semant...

2023
[25]

Exclusions must be based on clear, indisputable reasons; any uncertainty should result in retention
[26]

Field names must exactly match those provided in the input, including case and path
[27]

If no fields can be safely excluded, return an empty array
[28]

Only return JSON, without any prefix, suffix, or Markdown formatting
[29]

Conservative is the first principle: false retention is more acceptable than false exclusion
[30]

When the question involves multi-entity comparisons, relationship chain queries, or complex aggregations, be extremely conservative and tend to retain more fields
[31]

table names

Only exclude the most obviously irrelevant fields in each round; leave uncertain ones for subsequent rounds Figure 6: Prompt used for the field pruning stage (Part 2/2). 18 Prompt for Data Fetching (Part 1/6) You are a data scientist and SQL expert proficient in data analysis. Based on the user's question and table schema information, you can generate mul...

2023

[1] [1]

field":

Text-to-sql as dual-state reasoning: Integrating adaptive context and progressive generation.arXiv preprint arXiv:2511.21402. Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, ZHAOQING SUO, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluat...

arXiv 2025

[2] [2]

This is a multi-round iterative process; in each round, you only need to exclude the most obviously irrelevant tables

[3] [3]

The input includes complete field information (profiling); please refer to field information to judge table relevance

[4] [4]

If any field in a table is related to the question, that table must be retained

[5] [5]

In subsequent rounds, you will receive a more streamlined schema and can further exclude less relevant tables

[6] [6]

This stage only excludes tables, not fields (field exclusion is done in the next stage) Please analyze according to the following conservative exclusion strategy: -------------------------------------- Exclude Obviously Irrelevant Tables -------------------------------------- Based on question semantics and field information, only exclude tables that meet...

[7] [7]

The table's core business domain clearly does not match the question's topic (e.g., the question is about user behavior, but the table only stores system logs)

[8] [8]

The table's time range is completely incompatible with the question's time requirements (e.g., the question requires 2024 data, but the table only contains data before 2022)

2024

[9] [9]

ALL fields of the table are irrelevant to the question (please carefully check field profiling information)

[10] [10]

The table is NOT an intermediate table or dimension table that other related tables must JOIN through

[11] [11]

who knows whom

The table does NOT contain data related to any entity, concept, or synonym mentioned in the question Exclusion principle: As long as a table may have a direct, indirect, or potential association with the question, firmly retain it. Figure 5: Prompt used for the first-pass table pruning stage (Part 1/2). 15 Prompt for Schema Pruning First Pass (Part 2/2) S...

[12] [13]

Table names must exactly match those provided in the input, including case and path

[13] [14]

Do not exclude any tables that may be indirectly related through JOINs, subqueries, or complex logic

[14] [15]

If no tables can be safely excluded, return an empty array

[15] [18]

When the question involves multi-entity comparisons, relationship chain queries, or complex aggregations, be extremely conservative and tend to retain more tables

[16] [19]

Only exclude the most obviously irrelevant tables in each round; leave uncertain ones for subsequent rounds

[17] [20]

16 Prompt for Schema Pruning Fields (Part 1/2) You are a rigorous database query expert, skilled at identifying data fields that are irrelevant to natural language questions

**The number of excluded_tables must be less than the total number of input tables**; excluding all tables is not allowed Figure 5: Prompt used for the first-pass table pruning stage (Part 2/2). 16 Prompt for Schema Pruning Fields (Part 1/2) You are a rigorous database query expert, skilled at identifying data fields that are irrelevant to natural languag...

[18] [21]

This is a multi-round iterative process; in each round, you only need to exclude the most obviously irrelevant fields

[19] [22]

Tables have already been filtered in the first stage; this stage only excludes fields

[20] [23]

In subsequent rounds, you will receive a more streamlined schema and can further exclude less relevant fields

[21] [24]

latest",

Please carefully refer to field profiling information (data type, value range, examples, etc.) to judge field relevance Please analyze according to the following conservative exclusion strategy: -------------------------------------- Exclude Obviously Irrelevant Fields (Processed by Category) -------------------------------------- Based on question semant...

2023

[22] [25]

Exclusions must be based on clear, indisputable reasons; any uncertainty should result in retention

[23] [26]

Field names must exactly match those provided in the input, including case and path

[24] [27]

If no fields can be safely excluded, return an empty array

[25] [28]

Only return JSON, without any prefix, suffix, or Markdown formatting

[26] [29]

Conservative is the first principle: false retention is more acceptable than false exclusion

[27] [30]

When the question involves multi-entity comparisons, relationship chain queries, or complex aggregations, be extremely conservative and tend to retain more fields

[28] [31]

table names

Only exclude the most obviously irrelevant fields in each round; leave uncertain ones for subsequent rounds Figure 6: Prompt used for the field pruning stage (Part 2/2). 18 Prompt for Data Fetching (Part 1/6) You are a data scientist and SQL expert proficient in data analysis. Based on the user's question and table schema information, you can generate mul...

2023