Recognition: unknown
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
Pith reviewed 2026-05-10 17:05 UTC · model grok-4.3
The pith
Agent-generated common table expressions decompose complex natural language database queries for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AV-SQL decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query.
What carries the argument
Agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas.
Load-bearing premise
That the agent-generated CTE views reliably encapsulate intermediate logic and correctly filter relevant schema elements from large schemas without introducing semantic errors or losing critical information needed for the final query.
What would settle it
A test set of queries requiring a join or filter across tables that the view generator might skip; if the final SQL runs without error but returns incorrect rows because a required table or condition was omitted from the views, the central claim is falsified.
Figures
read the original abstract
Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real-world settings, where database schemas are large and questions require multi-step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one-shot generation frequently produces non-executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV-SQL, a framework that decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV-SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state-of-the-art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at https://github.com/pminhtam/AV-SQL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. AV-SQL is a multi-agent LLM framework for Text-to-SQL that decomposes complex queries using a rewriter agent, a view generator that produces agentic views (LLM-generated CTEs from schema chunks), and a planner/generator/revisor pipeline to compose the final SQL. The paper claims execution accuracies of 85.59% on Spider, 72.16% on BIRD, 63.78% on KaggleDBQA, and 70.38% on Spider 2.0, outperforming state-of-the-art baselines, with source code released.
Significance. If the performance gains can be robustly attributed to the agentic views mechanism, the work would advance Text-to-SQL for large-schema, multi-step queries by providing a structured decomposition that mitigates context limits and linking errors. The open-source release is a clear strength for reproducibility. However, without isolating the contribution of the views, the significance remains provisional.
major comments (3)
- [§4] §4 (Experiments): The reported 70.38% execution accuracy on Spider 2.0 and outperformance claims lack any ablation studies that isolate the agentic views (e.g., comparing the full pipeline to a version without view generation or with direct schema input), so the central empirical result cannot yet be attributed to the proposed mechanism rather than the multi-agent structure alone.
- [§3.2] §3.2 (View Generator Agent): No per-stage evaluation of agentic view quality is provided, such as a correctness metric for the generated CTEs, manual inspection for semantic errors in predicates/joins, or analysis of cross-chunk schema omissions; this is load-bearing because the framework relies on these LLM-generated intermediates being reliable without introducing incorrect logic.
- [§4.1] §4.1 (Experimental Setup): Details on controls for the stochastic LLM agents (temperature, number of runs, prompt sensitivity, or error analysis of failure cases) are absent, undermining confidence in the benchmark numbers and the claim that gains stem from reliable decomposition.
minor comments (2)
- [§3] The description of schema chunking in §3 lacks explicit pseudocode or parameters for chunk size and overlap, which would clarify how cross-chunk dependencies are handled.
- Table or figure captions for benchmark results should include exact baseline names and versions for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical support and transparency of AV-SQL. We agree with the need for clearer isolation of the agentic views contribution, per-stage analysis, and experimental controls. We will revise the paper accordingly, as detailed in the point-by-point responses below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported 70.38% execution accuracy on Spider 2.0 and outperformance claims lack any ablation studies that isolate the agentic views (e.g., comparing the full pipeline to a version without view generation or with direct schema input), so the central empirical result cannot yet be attributed to the proposed mechanism rather than the multi-agent structure alone.
Authors: We agree that ablation studies are necessary to isolate the contribution of agentic views from the broader multi-agent pipeline. In the revised manuscript, we will add a dedicated ablation subsection in §4. This will include results on Spider 2.0 for: (1) the full AV-SQL pipeline, (2) a variant without the view generator (directly feeding schema chunks to the planner/generator), and (3) a single-agent baseline using the full schema. These comparisons will quantify the specific benefit of the decomposition mechanism. revision: yes
-
Referee: [§3.2] §3.2 (View Generator Agent): No per-stage evaluation of agentic view quality is provided, such as a correctness metric for the generated CTEs, manual inspection for semantic errors in predicates/joins, or analysis of cross-chunk schema omissions; this is load-bearing because the framework relies on these LLM-generated intermediates being reliable without introducing incorrect logic.
Authors: We acknowledge that evaluating the quality of the generated agentic views is essential given their central role. We will add a new analysis subsection (likely in §3.2 or §4) that includes: manual inspection of a random sample of 50 generated views for semantic correctness, join accuracy, and predicate fidelity; an automated metric based on whether each view executes successfully against the database; and discussion of cross-chunk schema coverage by comparing view columns to the full schema. This will provide evidence on the reliability of the intermediates. revision: yes
-
Referee: [§4.1] §4.1 (Experimental Setup): Details on controls for the stochastic LLM agents (temperature, number of runs, prompt sensitivity, or error analysis of failure cases) are absent, undermining confidence in the benchmark numbers and the claim that gains stem from reliable decomposition.
Authors: We will expand §4.1 with the missing experimental controls. Specifically, we will report: the temperature settings used for each agent (e.g., 0.0 for the generator to promote determinism where possible), the number of independent runs (with mean and standard deviation), prompt sensitivity tests on a subset of queries, and a categorized error analysis of failure cases on Spider 2.0 distinguishing view-generation errors from final-SQL errors. This will increase confidence in the reported numbers. revision: yes
Circularity Check
No circularity; empirical benchmark results with no derivation chain
full rationale
The paper describes an agentic pipeline (rewriter, view-generator, planner/generator/revisor) that produces CTE views and final SQL, then reports execution accuracies on public benchmarks (Spider 2.0 at 70.38%, Spider at 85.59%, etc.). No equations, parameter fits, uniqueness theorems, or first-principles derivations appear; the central claims are end-to-end empirical measurements rather than any quantity that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any claimed result. The work is therefore self-contained empirical engineering with no detectable circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM agents can generate syntactically and semantically valid CTEs that correctly capture intermediate query logic from schema chunks
- domain assumption Decomposing the query via views avoids context-window overflow and reduces syntax/schema-linking errors compared with one-shot generation
invented entities (1)
-
agentic view
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. 2024. Seven failure points when engineering a retrieval augmented generation system. InCAIN. 194–199
2024
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.NeurIPS33 (2020), 1877–1901
2020
-
[4]
Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021. LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non- Local Relations. InIJCNLP. 2541–2555
2021
-
[5]
Yeounoh Chung, Gaurav T Kakkar, Yu Gan, Brenton Milne, and Fatma Ozcan
-
[6]
Is long context all you need? leveraging LLM’s extended context for NL2SQL.PVLDB18, 8 (2025), 2735–2747
2025
-
[7]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality- sensitive hashing scheme based on p-stable distributions. InSoCG. 253–262
2004
-
[8]
Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. 2025. Reforce: A Text-to-SQL agent with self- refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild
2025
- [9]
-
[10]
Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining small language models and large language models for zero-shot NL2SQL.PVLDB17, 11 (2024), 2750–2763
2024
- [11]
-
[12]
Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. Catsql: Towards real world natural language to sql applications.PVLDB16, 6 (2023), 1534–1547
2023
-
[13]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Text-to-sql empowered by large language models: A benchmark evaluation.PVLDB17, 5 (2024), 1132–1145
2024
- [14]
- [15]
-
[16]
2025.Context rot: How increasing input tokens impacts llm performance
Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context rot: How increasing input tokens impacts llm performance. Technical Report. Chroma. https:// trychroma.com/research/context-rot
2025
-
[17]
Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. 2021. KaggleD- BQA: Realistic evaluation of text-to-SQL parsers. InIJCNLP. 2261–2273
2021
-
[18]
Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. 2025. Mcs- sql: Leveraging multiple prompts and multiple-choice selection for text-to-sql generation. InCOLING. 337–353
2025
- [19]
-
[20]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.NeurIPS33 (2020), 9459–9474
2020
-
[21]
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The dawn of natural language to sql: Are we fully ready?PVLDB17, 11 (July 2024), 3318–3331
2024
- [22]
-
[23]
Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural language interface for relational databases.PVLDB8, 1 (2014), 73–84
2014
-
[24]
Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. 2023. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. InAAAI, Vol. 37. 13067– 13075
2023
-
[25]
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. Codes: Towards building open-source language models for text-to-sql.PACMMOD2, 3 (2024), 1–28
2024
-
[26]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2024. Can llm already serve as a database interface? a Modeling Ambiguityse grounded text-to-sqls.NeurIPS36 (2024), 42330–42357
2024
- [27]
- [28]
-
[29]
Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2025. A survey of text-to-sql in the era of llms: Where are we, and where are we going?TKDE(2025)
2025
-
[30]
Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. 2025. Natural language to sql: State of the art and open problems.PVLDB18, 12 (2025), 5466– 5471
2025
-
[31]
Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik
- [32]
-
[33]
Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InNeurIPS. 36339–36348
2023
- [34]
-
[35]
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Zhouhan Lin. 2022. RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. InEMNLP. 3215–3229
2022
- [36]
-
[37]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InEMNLP-IJCNLP. 3982–3992
2019
-
[38]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al
-
[39]
OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [40]
-
[41]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [42]
-
[43]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. InACL. 7567–7578
2020
-
[44]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)
work page internal anchor Pith review arXiv 2022
-
[45]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou
-
[46]
Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.NeurIPS33 (2020), 5776–5788
2020
- [47]
- [48]
-
[49]
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. Sqlizer: query synthesis from natural language.PACMPL1 (2017), 1–26
2017
-
[50]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InEMNLP. 3911–3921
2018
-
[51]
Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik
-
[52]
Chain of agents: Large language models collaborating on long-context tasks.NeurIPS37 (2024), 132208–132237
2024
-
[53]
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103(2017)
work page internal anchor Pith review arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.