Bridge the Last-Mile Gap to Semantic Analytics: Compiling Natural-Language Queries into Semantic Operator Pipelines

Ruyu Li; Sairam Gurajada; Wenkai Dong; Yifan Wang

arxiv: 2606.04641 · v1 · pith:DWFDKUHInew · submitted 2026-06-03 · 💻 cs.DB

Bridge the Last-Mile Gap to Semantic Analytics: Compiling Natural-Language Queries into Semantic Operator Pipelines

Wenkai Dong , Ruyu Li , Sairam Gurajada , Yifan Wang This is my paper

Pith reviewed 2026-06-28 04:01 UTC · model grok-4.3

classification 💻 cs.DB

keywords natural language to pipelinesemantic operatorsquery compilationheterogeneous datadatabase middlewaresemantic analytics

0 comments

The pith

NL2Pipe compiles natural-language questions into executable semantic operator pipelines via three-phase compilation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NL2Pipe as a middleware that turns natural-language questions into pipelines of semantic operators for tables, text, and images. It solves the problem that current semantic systems require manual operator selection, ordering, predicate writing, and backend adaptation, which is hard for non-experts and impossible for runtime-varying automated workflows. The approach separates data-aware reasoning from backend code by first linking question entities to actual data and implicit bridges, then planning backend-agnostic operators, and finally generating executable code from an auto-generated reference. This makes the same planner work across backends. Evaluation on complex cross-source workloads shows up to 60 percent higher F1 scores with bounded cost and competitive latency.

Core claim

NL2Pipe treats the translation from natural language to semantic pipelines as a three-phase compilation: a Query-Data Linker grounds entities and finds implicit bridges across sources, a Semantic Planner builds a backend-agnostic sequence of operators with natural-language predicates, and a Code Generator produces executable code using an auto-generated backend reference document. This separation lets one planning process support multiple backends while handling heterogeneous data without manual intervention.

What carries the argument

The three-phase compiler consisting of Query-Data Linker, Semantic Planner, and Code Generator, where the linker grounds entities and discovers bridge entities to connect data sources.

If this is right

Non-expert users and automated workflows can issue natural-language questions over heterogeneous data without manually designing pipelines.
The same planning logic supports multiple semantic operator backends by isolating backend details to the code generation phase.
Complex cross-source queries achieve higher accuracy than manual or baseline approaches while keeping cost bounded.
Queries and data can vary at runtime because the compiler adapts the pipeline automatically rather than requiring static manual design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linker step may become the main bottleneck when data sources lack clear entity overlaps or when implicit bridges are ambiguous.
Extending the planner to include cost or latency estimates during planning could further optimize the generated pipelines.
The approach could apply to additional data modalities such as video or audio if the linker and operator set are expanded accordingly.

Load-bearing premise

The Query-Data Linker can reliably ground question entities against heterogeneous data sources and discover implicit bridge entities that connect tables, text, and images without requiring backend-specific knowledge during planning.

What would settle it

A test set of natural-language questions over tables, text, and images where the linker fails to identify the required bridge entities, causing the planner to produce invalid or incomplete pipelines.

Figures

Figures reproduced from arXiv: 2606.04641 by Ruyu Li, Sairam Gurajada, Wenkai Dong, Yifan Wang.

**Figure 1.** Figure 1: End-to-end architecture of NL2Pipe, illustrated on the running example “How often does the folk festival in the South Moravian city with fewer people than Boskovice but more than others take place?”. The front-end Query Linker (Phase A, two LLM calls) grounds question entities against the data (“Boskovice” maps to a table cell, population 11,622) and discovers bridge entities not named in the question— ran… view at source ↗

**Figure 2.** Figure 2: Bridge entity discovery on a HybridQA example. Top: the table and text passages seen by the system. The yellow [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of an example auto-generated Reference [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Quality–cost trade-off on the three backends over varying datasets. Each point is a (pipeline, dataset) combination [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Automated AI workflows increasingly rely on natural-language reasoning over heterogeneous data, but lack a practical way to execute it through optimized semantic data systems. Recent semantic operator systems, such as Palimpzest and LOTUS, expose declarative operators for filtering, joining, mapping, and aggregating over tables, text, and images using natural-language predicates. However, these systems require users to manually choose operators, order them, write predicates, and adapt the pipeline to backend-specific APIs. This is difficult for non-experts, brittle across backends, and infeasible for automated workflows where queries and data vary at runtime. We present NL2Pipe, a middleware system that compiles natural-language questions into executable semantic operator pipelines, treating this as a three-phase compilation problem. First, a Query-Data Linker grounds question entities against the actual data and discovers implicit bridge entities needed to connect tables, text, and images. Second, a Semantic Planner produces a backend-agnostic action plan of semantic operators and natural-language predicates. Third, a Code Generator translates the plan into executable code for a target backend using an auto-generated reference document capturing operator signatures, example pipelines, and backend constraints. This separates data-aware reasoning from backend-specific code generation, letting the same planning logic support multiple backends. Evaluation shows NL2Pipe substantially outperforms baselines on complex cross-source workloads (e.g., up to 60% higher F1) while maintaining bounded cost and competitive latency. This demonstrates that automatic compilation from natural language to semantic operator pipelines is both practical and effective for bringing semantic analytics to non-expert users and automated AI workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NL2Pipe gives a clean three-phase split for compiling NL questions into semantic operator pipelines, but the 60% F1 claim sits on thin evidence.

read the letter

NL2Pipe turns natural-language questions into pipelines of semantic operators for tables, text, and images. It breaks the job into a Query-Data Linker that grounds entities and finds bridge connections, a Semantic Planner that outputs a backend-agnostic sequence of operators and predicates, and a Code Generator that emits runnable code for a chosen system.

The separation of concerns is the clearest advance. Prior systems like Palimpzest and LOTUS already expose the operators; this work automates the choice and ordering while keeping the planner independent of any one backend. That reuse across targets is a practical improvement for automated workflows where queries and data sources change.

The reported gains—up to 60% higher F1 on complex cross-source workloads with bounded cost—are the main selling point. Yet the abstract supplies no workload descriptions, baseline details, or breakdown of where the linker succeeded or failed. Without those, it is difficult to know whether the architecture itself drove the difference or whether the gains trace to specific implementation choices.

The linker is presented as the critical first step that must discover implicit bridges without backend knowledge. If that component is only sketched at a high level, the downstream planner starts from incomplete input and the overall claims become harder to evaluate.

The paper is aimed at researchers and engineers who already work with semantic operators and want to lower the barrier for non-expert or runtime-driven use. Readers focused on data systems and NL interfaces will see the engineering structure clearly.

I would send it to peer review. The motivation is real, the three-phase design is a reasonable response, and the idea is worth referee scrutiny even if the current evidence needs expansion on the linker and the experiments.

Referee Report

2 major / 1 minor

Summary. The manuscript presents NL2Pipe, a middleware system that compiles natural-language questions into executable semantic operator pipelines for heterogeneous data sources including tables, text, and images. The system is structured as a three-phase compilation process: (1) Query-Data Linker that grounds question entities and discovers implicit bridge entities, (2) Semantic Planner that generates a backend-agnostic action plan of semantic operators and predicates, and (3) Code Generator that translates the plan into backend-specific executable code using an auto-generated reference document. The evaluation claims that NL2Pipe substantially outperforms baselines on complex cross-source workloads, achieving up to 60% higher F1 scores while maintaining bounded cost and competitive latency.

Significance. If the results hold, this work would be significant in bridging the gap between natural language reasoning and optimized semantic data systems. By automating the selection and ordering of operators and adapting to different backends, it makes semantic analytics practical for non-experts and automated workflows, potentially increasing the adoption of systems like Palimpzest and LOTUS.

major comments (2)

Abstract: The central performance claim of up to 60% higher F1 on complex cross-source workloads is presented without details on the specific workloads, baselines used, or error analysis. This prevents verification of whether the data supports the claim, particularly given that the Query-Data Linker is described as backend-agnostic but its reliability in grounding entities and discovering bridges is not independently evaluated.
Abstract: The Query-Data Linker is identified as the first and critical phase, yet no mechanism, algorithm, or evaluation is provided for how it grounds entities against heterogeneous data sources or discovers implicit bridge entities without requiring backend-specific knowledge. This is load-bearing for attributing the F1 gains to the architecture rather than to the linker's success on the evaluated workloads.

minor comments (1)

The abstract mentions 'bounded cost' but does not specify what cost metric is used or how it is bounded.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below. We agree that the abstract would benefit from additional context and will revise accordingly while pointing to the detailed descriptions already present in the body of the manuscript.

read point-by-point responses

Referee: [—] Abstract: The central performance claim of up to 60% higher F1 on complex cross-source workloads is presented without details on the specific workloads, baselines used, or error analysis. This prevents verification of whether the data supports the claim, particularly given that the Query-Data Linker is described as backend-agnostic but its reliability in grounding entities and discovering bridges is not independently evaluated.

Authors: The abstract is necessarily concise, but the full manuscript provides the requested details in Section 5. Workloads are the heterogeneous cross-source QA tasks (tables+text+images) defined in 5.1; baselines include direct LLM prompting, hand-crafted pipelines, and NL-to-pipeline systems in 5.2; F1 results and error analysis appear in 5.3–5.4. The linker is backend-agnostic because it operates solely on a unified schema and embeddings. We will revise the abstract to include a brief reference to the evaluation setup and add an isolated precision/recall evaluation of the linker. revision: yes
Referee: [—] Abstract: The Query-Data Linker is identified as the first and critical phase, yet no mechanism, algorithm, or evaluation is provided for how it grounds entities against heterogeneous data sources or discovers implicit bridge entities without requiring backend-specific knowledge. This is load-bearing for attributing the F1 gains to the architecture rather than to the linker's success on the evaluated workloads.

Authors: Section 3.1 of the manuscript describes the Query-Data Linker in detail: entity grounding uses embedding-based similarity against a backend-agnostic metadata index, while implicit bridge discovery employs LLM reasoning over candidate connections derived from the same index. No backend APIs are invoked. The end-to-end evaluation in Section 5 incorporates linker performance, but we agree a standalone ablation would strengthen attribution. We will add pseudocode for the algorithm and a dedicated linker-accuracy subsection in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: system design with external empirical evaluation

full rationale

The paper presents NL2Pipe as a three-phase middleware system (Query-Data Linker, Semantic Planner, Code Generator) for compiling NL queries to semantic operator pipelines. No equations, fitted parameters, or first-principles derivations appear; performance claims (up to 60% higher F1) rest on external benchmark comparisons rather than any reduction to self-defined inputs or self-citations. The Query-Data Linker is described as a component whose reliability is evaluated empirically, not derived by construction from the planner or generator. This is a standard systems paper whose central claims are falsifiable against independent workloads and baselines, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5833 in / 1064 out tokens · 24483 ms · 2026-06-28T04:01:23.011376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 1 linked inside Pith

[1]

Gonzalez, Carlos Guestrin, and Matei Zaharia

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, and Matei Zaharia. 2024. Text2SQL is Not Enough: Unifying AI and Databases with TAG.arXiv preprint arXiv:2408.14717(2024)

arXiv 2024
[2]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Au- tonomous Chemical Research with Large Language Models.Nature624 (2023), 570–578

2023
[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021
[4]

Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. HybridQA: A Dataset of Multi-Hop Question Answer- ing over Tabular and Textual Data. InFindings of EMNLP

2020
[5]

Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans. 2024. UQE: A Query Engine for Unstructured Databases. arXiv:2407.09522 [cs.DB] https://arxiv.org/abs/2407.09522

arXiv 2024
[6]

Darryl Hannan, Akshay Jain, and Mohit Bansal. 2020. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs. InAAAI

2020
[7]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
[8]

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InICLR
[9]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781

2020
[10]

Vishwajeet Kumar, Yash Gupta, Saneem Chemmengath, Jaydeep Sen, Soumen Chakrabarti, Samarth Bharadwaj, and Feifei Pan. 2023. Multi-Row, Multi-Span Distant Supervision For Table+Text Question Answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 8080– 8094

2023
[11]

Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Im- manuel Trummer. 2025. SemBench: A Benchmark for Semantic Query Processing Engines.arXiv preprint arXiv:2511.01716(2025)

arXiv 2025
[12]

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models. InNAACL

2025
[13]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProc. CIDR

2025
[14]

Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. InFindings of NAACL

2024
[15]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS.Proc. VLDB Endow.18, 11 (2025), 4171–4184

2025
[16]

Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. InNeurIPS

2023
[17]

Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2026. Abacus: A Cost-Based Optimizer for Semantic Operator Systems.Proc. VLDB Endow.19, 5 (2026), 1060–1073

2026
[18]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9

2025
[19]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2024. HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in Hugging Face. InNeurIPS

2024
[20]

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.arXiv preprint arXiv:2508.05002(2025)

arXiv 2025
[21]

Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi, and Jonathan Berant. 2021. MultiModalQA: Complex Question Answering over Text, Tables, and Images. InICLR

2021
[22]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
[23]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 10014–10037
[24]

Matthias Urban and Carsten Binnig. 2024. Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. InCompanion of the 2024 International Conference on Management of Data (SIGMOD)

2024
[25]

Jiayi Wang and Jianhua Feng. 2025. Unify: An Unstructured Data Analytics System. InProc. ICDE

2025
[26]

Wenhan Xiong, Xiang Lorraine Li, Srinivasan Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Wen-tau Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oğuz. 2021. Answering Complex Open-Domain Questions with Multi- Hop Dense Retrieval. InInternational Conference on Learning Representations (ICLR)

2021
[27]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering. InEMNLP

2018
[28]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InICLR

2023
[29]

Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, and Aishwarya Agrawal
[30]

MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model.arXiv preprint arXiv:2310.13265(2023)

arXiv 2023
[31]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InNeurIPS Datasets and Benchmarks Track

2023
[32]

Yongwei Zhou, Junwei Bao, Chaoqun Duan, Youzheng Wu, Xiaodong He, and Tiejun Zhao. 2022. UniRPG: Unified Discrete Reasoning over Table and Text as Program Generation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7494–7507. https...

2022
[33]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In ACL

2021
[34]

Junhao Zhu, Lu Chen, Xiangyu Ke, Ziquan Fang, Tianyi Li, Yunjun Gao, and Christian S. Jensen. 2025. Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization.arXiv preprint arXiv:2511.19830 (2025). 13

arXiv 2025

[1] [1]

Gonzalez, Carlos Guestrin, and Matei Zaharia

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, and Matei Zaharia. 2024. Text2SQL is Not Enough: Unifying AI and Databases with TAG.arXiv preprint arXiv:2408.14717(2024)

arXiv 2024

[2] [2]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Au- tonomous Chemical Research with Large Language Models.Nature624 (2023), 570–578

2023

[3] [3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021

[4] [4]

Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. HybridQA: A Dataset of Multi-Hop Question Answer- ing over Tabular and Textual Data. InFindings of EMNLP

2020

[5] [5]

Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans. 2024. UQE: A Query Engine for Unstructured Databases. arXiv:2407.09522 [cs.DB] https://arxiv.org/abs/2407.09522

arXiv 2024

[6] [6]

Darryl Hannan, Akshay Jain, and Mohit Bansal. 2020. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs. InAAAI

2020

[7] [7]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

[8] [8]

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InICLR

[9] [9]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781

2020

[10] [10]

Vishwajeet Kumar, Yash Gupta, Saneem Chemmengath, Jaydeep Sen, Soumen Chakrabarti, Samarth Bharadwaj, and Feifei Pan. 2023. Multi-Row, Multi-Span Distant Supervision For Table+Text Question Answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 8080– 8094

2023

[11] [11]

Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Im- manuel Trummer. 2025. SemBench: A Benchmark for Semantic Query Processing Engines.arXiv preprint arXiv:2511.01716(2025)

arXiv 2025

[12] [12]

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models. InNAACL

2025

[13] [13]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProc. CIDR

2025

[14] [14]

Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina Semnani, Chen Yu, and Monica Lam. 2024. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. InFindings of NAACL

2024

[15] [15]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS.Proc. VLDB Endow.18, 11 (2025), 4171–4184

2025

[16] [16]

Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. InNeurIPS

2023

[17] [17]

Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2026. Abacus: A Cost-Based Optimizer for Semantic Operator Systems.Proc. VLDB Endow.19, 5 (2026), 1060–1073

2026

[18] [18]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2025. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.Proc. VLDB Endow.18, 9

2025

[19] [19]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2024. HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in Hugging Face. InNeurIPS

2024

[20] [20]

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.arXiv preprint arXiv:2508.05002(2025)

arXiv 2025

[21] [21]

Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi, and Jonathan Berant. 2021. MultiModalQA: Complex Question Answering over Text, Tables, and Images. InICLR

2021

[22] [22]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

[23] [23]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 10014–10037

[24] [24]

Matthias Urban and Carsten Binnig. 2024. Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. InCompanion of the 2024 International Conference on Management of Data (SIGMOD)

2024

[25] [25]

Jiayi Wang and Jianhua Feng. 2025. Unify: An Unstructured Data Analytics System. InProc. ICDE

2025

[26] [26]

Wenhan Xiong, Xiang Lorraine Li, Srinivasan Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Wen-tau Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oğuz. 2021. Answering Complex Open-Domain Questions with Multi- Hop Dense Retrieval. InInternational Conference on Learning Representations (ICLR)

2021

[27] [27]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering. InEMNLP

2018

[28] [28]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InICLR

2023

[29] [29]

Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, and Aishwarya Agrawal

[30] [30]

MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model.arXiv preprint arXiv:2310.13265(2023)

arXiv 2023

[31] [31]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InNeurIPS Datasets and Benchmarks Track

2023

[32] [32]

Yongwei Zhou, Junwei Bao, Chaoqun Duan, Youzheng Wu, Xiaodong He, and Tiejun Zhao. 2022. UniRPG: Unified Discrete Reasoning over Table and Text as Program Generation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7494–7507. https...

2022

[33] [33]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In ACL

2021

[34] [34]

Junhao Zhu, Lu Chen, Xiangyu Ke, Ziquan Fang, Tianyi Li, Yunjun Gao, and Christian S. Jensen. 2025. Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization.arXiv preprint arXiv:2511.19830 (2025). 13

arXiv 2025