pith. machine review for the scientific record. sign in

arxiv: 2603.02537 · v2 · submitted 2026-03-03 · 💻 cs.DB

Recognition: no theorem link

Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:33 UTC · model grok-4.3

classification 💻 cs.DB
keywords LLM-enhanced relational operatorstaxonomybenchmarksemantic query processingdatabase systemslarge language modelsdata imputationentity matching
0
0 comments X

The pith

A unified taxonomy places LLM-enhanced relational operators into five categories and a benchmark of 350 queries identifies concrete design best practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper brings order to scattered uses of large language models inside database operators by grouping them under one taxonomy. It then supplies LROBench, a large set of test queries drawn from many real databases and domains, to measure how different implementations actually perform. From those measurements the authors extract practical rules for choosing operator designs and show that applying those rules improves complete multi-operator pipelines. A sympathetic reader would care because semantic queries that mix filtering, matching, and imputation are becoming common, and without shared definitions and tests progress stays fragmented.

Core claim

LLM-Enhanced Relational Operators are defined as LLM calls that keep a strict relational input-output interface. The taxonomy sorts them into Select, Match, Impute, Cluster, and Order, each with listed operands and implementation variants. Evaluation on the 290 single-operator and 60 multi-operator queries of LROBench yields measurable performance differences across variants, from which the authors extract empirical best practices; these practices are then used to build an LRO suite that is compared directly against existing multi-operator systems on complex semantic workloads.

What carries the argument

The LRO taxonomy that aligns operators into Select, Match, Impute, Cluster and Order categories together with their operand and implementation variants, supported by the LROBench benchmark of 290 single-LRO and 60 multi-LRO queries across 27 databases.

If this is right

  • Designers can select LRO implementations according to the measured performance differences for better accuracy or speed on individual tasks.
  • Instantiating a multi-LRO system with the identified best practices yields higher end-to-end effectiveness on complex semantic queries than current systems.
  • The five-category taxonomy supplies a consistent language for comparing and extending operators across different research efforts.
  • Releasing the full benchmark data and evaluation code allows direct reproduction and extension of the reported measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of the taxonomy could encourage database engines to expose standard interfaces for semantic operators rather than ad-hoc LLM calls.
  • The same evaluation approach might be applied to measure how these operators behave under streaming or distributed execution settings.
  • Hybrid query planners could treat the five LRO types as first-class algebraic operators when optimizing mixed relational and semantic workloads.

Load-bearing premise

The 290 single and 60 multi queries chosen for the benchmark adequately represent the full range of real-world semantic query needs and operating logics.

What would settle it

A fresh collection of semantic queries drawn from an additional domain that produces materially different performance rankings or best-practice recommendations would falsify the claim that the current benchmark results generalize.

Figures

Figures reproduced from arXiv: 2603.02537 by Bolin Ding, Jingren Zhou, Rong Zhu, Tianjing Zeng, Yin Lin, Yunxiang Su, Zhewei Wei, Zhongjun Ding.

Figure 1
Figure 1. Figure 1: Examples of standalone-LRO components and multi-LRO systems. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example to illustrate LLM-enhanced opera [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Single-LRO query numbers by LRO types and [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scoring dimensions of multi-LRO queries. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall score distribution of multi-LRO queries. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effectiveness and cost of Filter at scale. LLM-ALL Score LLM-ONE Score 10 20 50 100 200 500 1k 2k 5k 10k # Input Tuples 0.0 0.2 0.4 0.6 0.8 1.0 LLM Judge Score LLM-ALL Cost LLM-ONE Cost 10 20 50 100 200 500 1k 2k 5k 10k # Input Tuples 0.0 0.2 0.4 0.6 Cost ($) [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effectiveness and cost of Impute at scale. 5.2 Scalability Evaluation While effectiveness is critical, real-world LRO query tasks may often involve large-scale inputs, making scal￾ability a key practical concern. We conduct scalabil￾ity experiments on the two most prevalent tasks for large-scale data: row-wise Filter for semantic filtering in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

With the development of large language models (LLMs), numerous studies integrate LLMs through operator-like components to enhance relational data processing tasks, e.g., filters with semantic predicates, knowledge-augmented table imputation, reasoning-driven entity matching and more challenging semantic query processing. These components invoke LLMs while preserving a relational input/output interface, which we refer to as LLM-Enhanced Relational Operators (LROs). From an operator perspective, unfortunately, these existing LROs suffer from fragmented definition, various implementation strategies and inadequate evaluation benchmarks. To this end, in this paper, we first establish a unified LRO taxonomy to align existing LROs, and categorize them into: Select, Match, Impute, Cluster and Order, along with their operands and implementation variants. Second, we design LROBench, a comprehensive benchmark featuring 290 single-LRO queries and 60 multi-LRO queries, spanning 27 databases across more than 10 domains. LROBench covers all operating logics and operand granularities in its single-LRO workload, and provides challenging multi-LRO queries stratified by query complexity. Based on these, we evaluate individual LROs under various implementations, deriving practical insights into LRO design choices and summarizing our empirical best practices. We further compare the end-to-end performance of existing multi-LRO systems against an LRO suite instantiated with these best practices, in order to investigate how to design an effective LRO set for multi-LRO systems targeting complex semantic queries. Last, to facilitate future work, we outline promising future directions and open-source all benchmark data and evaluation code, available at https://github.com/LROBench/LROBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a unified taxonomy for LLM-Enhanced Relational Operators (LROs), categorizing them into Select, Match, Impute, Cluster, and Order along with their operands and implementation variants. It introduces LROBench, a benchmark with 290 single-LRO queries and 60 multi-LRO queries across 27 databases in more than 10 domains, which is asserted to cover all operating logics and operand granularities for single-LRO workloads. The work evaluates individual LRO implementations, derives empirical best practices, compares end-to-end performance of multi-LRO systems against an LRO suite using those practices, and open-sources the benchmark data and code.

Significance. If the benchmark coverage and evaluation results hold, the taxonomy and LROBench could standardize fragmented research on LLM-augmented relational operators, enabling reproducible comparisons and guiding design choices for semantic query processing. The open-sourcing of data, queries, and code is a clear strength that supports future work in this emerging area.

major comments (1)
  1. [Abstract] Abstract and LROBench description: the claim that the 290 single-LRO queries 'cover all operating logics and operand granularities' is load-bearing for the taxonomy alignment and downstream best-practice recommendations, yet no formal enumeration of the logic space, sampling frame, or completeness argument is provided; queries are described as manually curated and stratified only by complexity, leaving open the possibility that sub-classes (e.g., multi-hop semantic predicates or domain-specific constraints) are underrepresented.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the benchmark construction. We address the major comment below and have prepared revisions to strengthen the presentation of LROBench coverage.

read point-by-point responses
  1. Referee: [Abstract] Abstract and LROBench description: the claim that the 290 single-LRO queries 'cover all operating logics and operand granularities' is load-bearing for the taxonomy alignment and downstream best-practice recommendations, yet no formal enumeration of the logic space, sampling frame, or completeness argument is provided; queries are described as manually curated and stratified only by complexity, leaving open the possibility that sub-classes (e.g., multi-hop semantic predicates or domain-specific constraints) are underrepresented.

    Authors: We agree that the original wording of the coverage claim was insufficiently justified. Section 3 defines the taxonomy by operator categories (Select, Match, Impute, Cluster, Order) together with operand types and granularities. The 290 queries were manually constructed by first enumerating the Cartesian product of these taxonomy dimensions and then instantiating each cell with representative examples drawn from the 27 databases. Stratification occurred along both operator type and complexity, not complexity alone. While a formal completeness proof is not possible for an open semantic space, we will add a new subsection (4.2) that explicitly lists the enumerated logic combinations, provides the sampling rationale, and includes a table mapping each taxonomy cell to the number of queries. We have also incorporated additional multi-hop predicate and domain-constraint examples into the released benchmark. The abstract claim will be revised to state that LROBench covers the operating logics and operand granularities defined by the taxonomy. These changes preserve the empirical results while addressing the concern directly. revision: partial

Circularity Check

0 steps flagged

No circularity in taxonomy or benchmark construction

full rationale

The paper surveys existing LLM-enhanced operators from the literature to establish a taxonomy (Select/Match/Impute/Cluster/Order), then manually curates 290 single-LRO and 60 multi-LRO queries stratified by complexity across 27 databases. No equations, parameter fitting, predictions, or self-citation chains appear in the central claims. The coverage assertion follows directly from the taxonomy categories rather than reducing to a self-definitional loop or fitted input. All benchmark data and code are open-sourced, making the work externally verifiable without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework assumes LLMs can reliably perform semantic operations when wrapped as relational operators; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption LLMs invoked via operator-like components preserve relational input/output interfaces while adding semantic capabilities
    Core premise stated in the abstract for defining LROs.
invented entities (1)
  • LRO (LLM-Enhanced Relational Operator) no independent evidence
    purpose: Unified category for LLM components that act as relational operators
    New conceptual grouping introduced to align existing work; no independent falsifiable evidence provided beyond the taxonomy itself.

pith-pipeline@v0.9.0 · 5622 in / 1202 out tokens · 56904 ms · 2026-05-15T17:33:26.036890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniTQA: A Cost-Aware System for Hybrid Query Processing over Semi-Structured Data

    cs.DB 2026-04 unverdicted novelty 6.0

    OmniTQA integrates LLM semantic reasoning as a first-class query operator with classical relational operators in a cost-aware planner for hybrid structured and semi-structured data.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Qwen3 Technical Report

    Alibaba: Qwen3 technical report. CoRRabs/2505.09388 (2025)

  2. [2]

    In: NeurIPS Datasets and Benchmarks (2021)

    Aly, R., Guo, Z., Schlichtkrull, M.S., Thorne, J., Vla- chos, A., Christodoulopoulos, C., Cocarascu, O., Mittal, A.: FEVEROUS: fact extraction and verification over unstructured and structured information. In: NeurIPS Datasets and Benchmarks (2021)

  3. [3]

    CoRRabs/2409.00847(2024)

    Anderson, E., Fritz, J., Lee, A., Li, B., Lindblad, M., Lindeman, H., Meyer, A., Parmar, P., Ranade, T., Shah, M.A., Sowell, B., Tecuci, D., Thapliyal, V., Welsh, M.: The design of an llm-powered unstructured analytics system. CoRRabs/2409.00847(2024)

  4. [4]

    URL https://www.anthropic.com/news/claude-sonnet-4-5

    Anthropic: Introducing claude sonnet 4.5 (2025). URL https://www.anthropic.com/news/claude-sonnet-4-5

  5. [5]

    arXiv preprint arXiv:2408.14717 (2024)

    Biswal, A., Patel, L., Jha, S., Kamsetty, A., Liu, S., Gonzalez, J.E., Guestrin, C., Zaharia, M.: Text2sql is not enough: Unifying ai and databases with tag. arXiv preprint arXiv:2408.14717 (2024)

  6. [6]

    BEAVER: An Enterprise Benchmark for Text-to-SQL

    Chen, P.B., Wenz, F., Zhang, Y., Kayali, M., Tat- bul, N., Cafarella, M.J., Demiralp, C ¸., Stonebraker, M.: BEAVER: an enterprise benchmark for text-to-sql. CoRR abs/2409.02038(2024)

  7. [7]

    In: ICLR

    Chen, W., Chang, M., Schlinger, E., Wang, W.Y., Cohen, W.W.: Open question answering over tables and text. In: ICLR. OpenReview.net (2021)

  8. [8]

    In: ICLR

    Chen, W., Wang, H., Chen, J., Zhang, Y., Wang, H., Li, S., Zhou, X., Wang, W.Y.: Tabfact: A large-scale dataset for table-based fact verification. In: ICLR. OpenReview.net (2020)

  9. [9]

    In: EMNLP (Findings), Findings of ACL, vol

    Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., Wang, W.Y.: Hybridqa: A dataset of multi-hop question answer- ing over tabular and textual data. In: EMNLP (Findings), Findings of ACL, vol. EMNLP 2020, pp. 1026–1036. Asso- ciation for Computational Linguistics (2020)

  10. [10]

    In: ICLR

    Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y., Xiong, C., Radev, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Yu, T.: Binding language models in symbolic languages. In: ICLR. OpenReview.net (2023)

  11. [11]

    In: RecSys, pp

    Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: RecSys, pp. 39–46. ACM (2010)

  12. [12]

    C., P.S., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository

    Das, S., Doan, A., G. C., P.S., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository. https://sites.google.com/site/anhaidgroup/ projects/data

  13. [13]

    CoRRabs/2307.07306(2023)

    Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., Chen, L., Lin, J., Lou, D.: C3: zero-shot text-to-sql with chatgpt. CoRRabs/2307.07306(2023)

  14. [14]

    Addison-Wesley-Longman (2000)

    Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 3rd Edition. Addison-Wesley-Longman (2000)

  15. [15]

    In: EDBT, pp

    Flores, J., Nadal, S., Romero, O.: Towards scalable data discovery. In: EDBT, pp. 433–438. OpenProceedings.org (2021)

  16. [16]

    In: ACL (Findings), pp

    Gan, Y., Chen, X., Purver, M.: Re-appraising the schema linking for text-to-sql. In: ACL (Findings), pp. 835–852. Association for Computational Linguistics (2023)

  17. [17]

    In: ACL (Findings), pp

    Glenn, P., Dakle, P., Wang, L., Raghavan, P.: Blendsql: A scalable dialect for unifying hybrid question answering in relational algebra. In: ACL (Findings), pp. 453–466. Association for Computational Linguistics (2024)

  18. [18]

    In: ACL (Findings), pp

    He, X., Ban, Y., Zou, J., Wei, T., Cook, C.B., He, J.: Llm- forest: Ensemble learning of llms with graph-augmented prompts for data imputation. In: ACL (Findings), pp. 6921–6936. Association for Computational Linguistics (2025)

  19. [19]

    In: ICDEW, pp

    Huang, Z., Wu, E.: Relationalizing tables with large lan- guage models: The promise and challenges. In: ICDEW, pp. 305–309. IEEE (2024)

  20. [20]

    Journal of classification2(1), 193–218 (1985)

    Hubert, L., Arabie, P.: Comparing partitions. Journal of classification2(1), 193–218 (1985)

  21. [21]

    CoRRabs/2504.04808(2025)

    Jin, T., Zhu, Y., Kang, D.: Elt-bench: An end-to-end benchmark for evaluating AI agents on ELT pipelines. CoRRabs/2504.04808(2025)

  22. [22]

    Jo, S., Trummer, I.: Thalamusdb: Approximate query processing on multi-modal data. Proc. ACM Manag. Data2(3), 186 (2024)

  23. [23]

    In: EDBT, pp

    Katsogiannis-Meimarakis, G., Mirylenka, K., Scotton, P., Fusco, F., Labbi, A.: In-depth analysis of llm-based schema linking. In: EDBT, pp. 117–130. OpenProceed- ings.org (2026)

  24. [24]

    Biometrika30(1-2), 81–93 (1938)

    Kendall, M.G.: A new measure of rank correlation. Biometrika30(1-2), 81–93 (1938)

  25. [25]

    Khatiwada, A., Fan, G., Shraga, R., Chen, Z., Gat- terbauer, W., Miller, R.J., Riedewald, M.: SANTOS: relationship-based semantic table union search. Proc. ACM Manag. Data1(1), 9:1–9:25 (2023)

  26. [26]

    CoRRabs/2511.01716(2025)

    Lao, J., Zimmerer, A., Ovcharenko, O., Cong, T., Russo, M., Vitagliano, G., Cochez, M., ¨Ozcan, F., Gupta, G., Hottelier, T., Jagadish, H.V., Kissel, K., Schelter, S., Kipf, A., Trummer, I.: Sembench: A benchmark for semantic query processing engines. CoRRabs/2511.01716(2025)

  27. [27]

    In: ICLR

    Lei, F., Chen, J., Ye, Y., Cao, R., Shin, D., Su, H., Suo, Z., Gao, H., Hu, W., Yin, P., Zhong, V., Xiong, C., Sun, R., Liu, Q., Wang, S., Yu, T.: Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. In: ICLR. OpenReview.net (2025)

  28. [28]

    In: NeurIPS (2020)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: NeurIPS (2020)

  29. [29]

    In: WWW (Companion Volume), pp

    Li, H., Li, S., Hao, F., Zhang, C.J., Song, Y., Chen, L.: Booster: Leveraging large language models for enhancing entity resolution. In: WWW (Companion Volume), pp. 1043–1046. ACM (2024)

  30. [30]

    In: AAAI, pp

    Li, H., Zhang, J., Li, C., Chen, H.: RESDSQL: decoupling schema linking and skeleton parsing for text-to-sql. In: AAAI, pp. 13,067–13,075. AAAI Press (2023)

  31. [31]

    In: NeurIPS (2023)

    Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Geng, R., Huo, N., Zhou, X., Ma, C., Li, G., Chang, K.C., Huang, F., Cheng, R., Li, Y.: Can LLM already serve as A database interface? A big bench for large-scale database grounded text-to-sqls. In: NeurIPS (2023)

  32. [32]

    Li, P., He, Y., Yashar, D., Cui, W., Ge, S., Zhang, H., Fainman, D.R., Zhang, D., Chaudhuri, S.: Table-gpt: Ta- ble fine-tuned GPT for diverse table tasks. Proc. ACM Manag. Data2(3), 176 (2024)

  33. [33]

    CoRRabs/2405.04674(2024)

    Lin, Y., Hulsebos, M., Ma, R., Shankar, S., Zeighami, S., Parameswaran, A.G., Wu, E.: Towards accurate and efficient document analytics with large language models. CoRRabs/2405.04674(2024)

  34. [34]

    In: Proceed- ings of the Conference on Innovative Database Research (CIDR)

    Liu, C., Russo, M., Cafarella, M., Cao, L., Chen, P.B., Chen, Z., Franklin, M., Kraska, T., Madden, S., Shahout, R., Vitagliano, G.: Palimpzest: Optimizing ai-powered analytics with declarative query processing. In: Proceed- ings of the Conference on Innovative Database Research (CIDR)

  35. [35]

    In: COLING, pp

    Liu, G., Tan, Y., Zhong, R., Xie, Y., Zhao, L., Wang, Q., Hu, B., Li, Z.: Solid-sql: Enhanced schema-linking based in-context learning for robust text-to-sql. In: COLING, pp. 9793–9803. Association for Computational Linguistics (2025) 22 Su et al

  36. [36]

    In: NAACL-HLT (Findings), pp

    Liu, S., Xu, J., Tjangnaka, W., Semnani, S.J., Yu, C.J., Lam, M.: SUQL: conversational search over structured and unstructured data with large language models. In: NAACL-HLT (Findings), pp. 4535–4555. Association for Computational Linguistics (2024)

  37. [37]

    In: IEEE Big Data, pp

    Ma, L., Thakurdesai, N., Chen, J., Xu, J., K¨ orpeoglu, E., Kumar, S., Achan, K.: Llms with user-defined prompts as generic data operators for reliable data processing. In: IEEE Big Data, pp. 3144–3148. IEEE (2023)

  38. [38]

    VLDB Endow.16(4), 738–746 (2022)

    Narayan, A., Chami, I., Orr, L.J., R´ e, C.: Can foundation models wrangle your data? Proc. VLDB Endow.16(4), 738–746 (2022)

  39. [39]

    CoRR (2025)

    OpenAI: Openai gpt-5 system card. CoRR (2025)

  40. [40]

    Patel, L., Jha, S., Pan, M.Z., Gupta, H., Asawa, P., Guestrin, C., Zaharia, M.: Semantic operators and their optimization: Towards ai-based data analytics with accu- racy guarantees. Proc. VLDB Endow.18(11), 4171–4184 (2025)

  41. [41]

    In: EDBT, pp

    Peeters, R., Steiner, A., Bizer, C.: Entity matching using large language models. In: EDBT, pp. 529–541. OpenPro- ceedings.org (2025)

  42. [42]

    ¨O.: CHASE-SQL: multi-path reasoning and preference opti- mized candidate selection in text-to-sql

    Pourreza, M., Li, H., Sun, R., Chung, Y., Talaei, S., Kakkar, G.T., Gan, Y., Saberi, A., Ozcan, F., Arik, S. ¨O.: CHASE-SQL: multi-path reasoning and preference opti- mized candidate selection in text-to-sql. In: ICLR. Open- Review.net (2025)

  43. [43]

    In: NeurIPS (2023)

    Pourreza, M., Rafiei, D.: DIN-SQL: decomposed in- context learning of text-to-sql with self-correction. In: NeurIPS (2023)

  44. [44]

    In: MLSys

    Qian, Y., He, Y., Zhu, R., Huang, J., Ma, Z., Wang, H., Wang, Y., Sun, X., Lian, D., Ding, B., Zhou, J.: Unidm: A unified framework for data manipulation with large language models. In: MLSys. mlsys.org (2024)

  45. [45]

    CoRR abs/2411.19504(2024)

    Qiu, Z., Peng, Y., He, G., Yuan, B., Wang, C.: Tqa- bench: Evaluating llms for multi-table question answering with scalable context and symbolic extension. CoRR abs/2411.19504(2024)

  46. [46]

    CoRR abs/2505.14661(2025)

    Russo, M., Sudhir, S., Vitagliano, G., Liu, C., Kraska, T., Madden, S., Cafarella, M.J.: Abacus: A cost- based optimizer for semantic operator systems. CoRR abs/2505.14661(2025)

  47. [47]

    In: SIGMOD Conference, pp

    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD Conference, pp. 23–34. ACM (1979)

  48. [48]

    Shankar, S., Chambers, T., Shah, T., Parameswaran, A.G., Wu, E.: Docetl: Agentic query rewriting and evaluation for complex document processing. Proc. VLDB Endow. 18(9), 3035–3048 (2025)

  49. [49]

    In: ICDEW, pp

    Steiner, A., Peeters, R., Bizer, C.: Fine-tuning large lan- guage models for entity matching. In: ICDEW, pp. 9–17. IEEE (2025)

  50. [50]

    Strehl, A., Ghosh, J.: Cluster ensembles — A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.3, 583–617 (2002)

  51. [51]

    Sun, Z., Chai, C., Deng, Q., Jin, K., Guo, X., Han, H., Yuan, Y., Wang, G., Cao, L.: QUEST: query optimization in unstructured document analysis. Proc. VLDB Endow. 18(11), 4560–4573 (2025)

  52. [52]

    In: NAACL-HLT, pp

    Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset for fact extraction and verification. In: NAACL-HLT, pp. 809–819. Association for Computational Linguistics (2018)

  53. [53]

    In: CIDR

    Urban, M., Binnig, C.: CAESURA: language models as multi-modal query planners. In: CIDR. www.cidrdb.org (2024)

  54. [54]

    In: COLING, pp

    Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q., Yin, D., Sun, X., Li, Z.: MAC-SQL: A multi-agent collaborative framework for text-to-sql. In: COLING, pp. 540–557. Association for Computational Linguistics (2025)

  55. [55]

    In: COLING, pp

    Wang, T., Chen, X., Lin, H., Chen, X., Han, X., Sun, L., Wang, H., Zeng, Z.: Match, compare, or select? an investigation of large language models for entity matching. In: COLING, pp. 96–109. Association for Computational Linguistics (2025)

  56. [56]

    In: ICLR

    Wu, J., Yang, L., Li, D., Ji, Y., Okumura, M., Zhang, Y.: MMQA: evaluating llms with multi-table multi-hop complex questions. In: ICLR. OpenReview.net (2025)

  57. [57]

    In: EMNLP, pp

    Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., Radev, D.R.: Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text- to-sql task. In: EMNLP, pp. 3911–3921. Association for Computational Linguistics (2018)

  58. [58]

    In: VLDB Workshops

    Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Large lan- guage models as data preprocessors. In: VLDB Workshops. VLDB.org (2024)

  59. [59]

    CoRR abs/2408.00884(2024)

    Zhao, F., Agrawal, D., Abbadi, A.E.: Hybrid querying over relational databases and large language models. CoRR abs/2408.00884(2024)

  60. [60]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Zhong, V., Xiong, C., Socher, R.: Seq2sql: Generating structured queries from natural language using reinforce- ment learning. CoRRabs/1709.00103(2017)