pith. machine review for the scientific record. sign in

arxiv: 2603.15970 · v6 · submitted 2026-03-16 · 💻 cs.DB · cs.AI

Recognition: no theorem link

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:24 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords AI queriesproxy modelssemantic filterembedding vectorscost reductionlatency reductiondatabase architectureLLM approximation
0
0 comments X

The pith

Proxy models over embedding vectors cut AI query costs and latency by more than 100 times while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates an approximation method for LLM-based AI queries inside databases. Lightweight proxy models trained on embedding vectors replace expensive LLM calls for semantic filter and ranking operators. The proxies deliver more than 100 times lower cost and latency on benchmarks including a 10-million-row Amazon reviews dataset. Accuracy remains the same or improves, and the work presents concrete architectures for online use in BigQuery and lower-latency setups with offline training in AlloyDB plus faster training techniques.

Core claim

Proxy models trained on embedding vectors can approximate the semantic filter and ranking operations performed by LLMs in AI queries, delivering more than 100 times reduction in cost and latency with no material loss in accuracy across tested datasets and query types.

What carries the argument

Lightweight proxy models trained on embedding vectors to approximate LLM semantic judgments for filter and ranking operators.

Load-bearing premise

Proxy models trained on embeddings can reliably approximate the semantic judgments of underlying LLMs across diverse datasets and query types without material accuracy loss.

What would settle it

A head-to-head comparison on a fresh large dataset where the proxy model and full LLM disagree on a substantial fraction of semantic filter decisions would disprove reliable approximation.

Figures

Figures reproduced from arXiv: 2603.15970 by Alon Halevy, Fatma \"Ozcan, Jian He, Pushkar Khadilkar, Rushabh Desai, Sam Idicula, Thibaud Hottelier, Xianshun Chen, Yannis Papakonstantinou, Yeounoh Chung, Yu Xiao, Yves-Laurent Kom Samo.

Figure 1
Figure 1. Figure 1: AI query execution plan construction with proxy model approximation process. We parse the AI query and extract [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relative wall-clock time of each step of the proxy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proxy model performance (nDCG@10 on online [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of sampling strategies on training data imbalance ratios, measured across various datasets of varying degrees [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of different imbalanced label training techniques, as described in Section 4.2. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of embedding model and dimensionality on proxy model classification performance. Note that Gecko [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Embedding distinctiveness illustrated by PCA visualization (X-axis is PC1, Y-axis is PC2) and separability scores across [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter operator and also important gains for semantic ranking. The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study of using lightweight proxy models over embedding vectors to approximate LLM evaluations in AI SQL queries. It claims >100x cost and latency reductions for semantic filter operators and gains for semantic ranking, with accuracy preserved or improved on benchmarks including a 10M-row Amazon reviews dataset. Architectures are proposed for Google BigQuery (online) and AlloyDB (HTAP with offline training), along with techniques to speed up proxy model training.

Significance. If the accuracy claims hold under broader conditions, this approach could make semantic AI queries viable for large-scale, cost-sensitive database applications by dramatically reducing reliance on expensive LLM calls. The work provides practical architectures and training optimizations that address real deployment challenges in data warehouses.

major comments (2)
  1. The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.
  2. The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.
minor comments (2)
  1. Clarify in the architecture sections how the AlloyDB offline training path quantitatively improves latency over the BigQuery online path, with specific numbers.
  2. The phrase 'occasionally improve accuracy' in the abstract should be backed by explicit examples or delta metrics in the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the statistical and robustness aspects of our empirical evaluation. We address each major comment below and have revised the manuscript to incorporate additional details, analyses, and clarifications.

read point-by-point responses
  1. Referee: The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.

    Authors: We agree that greater statistical rigor strengthens the presentation. The revised manuscript now includes error bars (standard deviation over 5 independent runs) on all accuracy plots, explicit details on proxy label generation (LLM annotations on a 10k-sample training subset per benchmark with temperature=0 for determinism), and clarification that the agreement metric is end-to-end query fidelity measured by precision and recall of the final result set against full-LLM execution. We have added an exclusion-criteria paragraph describing removal of queries with >20% token-length variance and a new stress-test subsection covering 12 query phrasings plus a cross-dataset shift experiment (Amazon reviews to Yelp reviews). These changes allow direct assessment of approximation error relative to the claimed cost savings. revision: partial

  2. Referee: The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.

    Authors: The original evaluation already covers three distinct domains (product reviews, Q&A, and news) with the 10M-row Amazon benchmark as the largest scale test. In revision we have added a dedicated robustness subsection that evaluates the proxy models on 8 held-out query templates per domain and reports accuracy under a controlled distribution shift (training on 2022 reviews, testing on 2023 reviews). We acknowledge that exhaustive coverage of arbitrary unseen domains lies outside the current scope and have therefore added an explicit limitations paragraph plus future-work directions on continual adaptation. revision: partial

Circularity Check

0 steps flagged

Empirical measurement study with no derivation circularity

full rationale

The paper is an empirical evaluation of proxy models for approximating LLM-based AI queries in databases. It reports measured >100x cost/latency reductions and accuracy preservation on external benchmarks (e.g., 10M-row Amazon reviews) by direct comparison of proxy outputs to LLM judgments. No equations, fitted parameters, or self-citations reduce the central performance claims to inputs defined by the same data or prior author work; the results are externally falsifiable measurements rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical performance study; no new mathematical axioms, free parameters, or invented entities are introduced or required by the central claim. The work relies on standard assumptions that embeddings capture semantic similarity and that small models can be trained to mimic LLM behavior on those embeddings.

pith-pipeline@v0.9.0 · 5587 in / 1123 out tokens · 37876 ms · 2026-05-15T09:24:21.287472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

  1. [1]

    Lukas Bruderer and Mihai Ciorobea. 2025. Boost your Search and RAG agents with Vertex AI’s new state-of-the-art Ranking API.Google Cloud Blog(30 May 2025). https://cloud.google.com/blog/products/ai-machine-learning/launching- our-new-state-of-the-art-vertex-ai-ranking-api

  2. [2]

    Iñigo Casanueva, Hector Perez-Iglesias, Abhinav Rao, Xiaoxue Liu, Yufan Wang, and Hao Sun. 2020. Efficient Intent Detection with Dual Sentence Encoder and Label-Aware Attention. InProceedings of the 2nd Workshop on Natural Lan- guage Processing for Conversational AI. 79–86. https://aclanthology.org/2020. nlp4convai-1.12/

  3. [3]

    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer

  4. [4]

    SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research16 (2002), 321–357

  5. [5]

    Ethan Chern, Steffi Freihat, Yangni Shieh, Stephen Wan, Junjie Zhao, Wayne Xin Zhao, et al . 2023. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios.arXiv preprint arXiv:2307.13528(2023). https://arxiv.org/abs/2307.13528

  6. [6]

    Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Eui- jong Whang. 2019. Automated data slicing for model validation: A big data-ai integration approach.IEEE Transactions on Knowledge and Data Engineering32, 12 (2019), 2284–2296

  7. [7]

    Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. 2019. Slice Finder: Automated Data Slicing for Model Validation. InICDE. 1514–1525. doi:10.1109/ICDE.2019.00138

  8. [8]

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld

  9. [9]

    InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL)

    SPECTER: Document-level Representation Learning using Citation- informed Transformers. InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL). 4211–4222. doi:10.18653/v1/2020.acl- main.384

  10. [10]

    Voorhees, and Ian Soboroff

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. InProceedings of the Thirty-First Text REtrieval Conference (TREC 2022) (NIST Special Publication 500-338). https://trec.nist.gov/pubs/trec31/papers/ Overview_deep.pdf

  11. [11]

    Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans

  12. [12]

    UQE: A Query Engine for Unstructured Databases.Advances in Neural Information Processing Systems37 (2024), 29807–29838

  13. [13]

    Databricks. 2025. AI Functions on Databricks. https://docs.databricks.com/aws/ en/large-language-models/ai-functions. Accessed: 2025-07-31

  14. [14]

    Anas Dorbani, Sunny Yasser, Jimmy Lin, and Amine Mhedhbi. 2025. Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB.Pro- ceedings of the VLDB Endowment18, 12 (2025), 5415–5418. doi:10.14778/3750601. 3750685

  15. [15]

    Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Con- ghui He, Hongzhi Yin, and Wentao Zhang. 2025. A Comprehensive Survey on Imbalanced Data Learning.arXiv preprint arXiv:2502.08960(2025)

  16. [16]

    2025.google/embedding-gemma-300m

    Google. 2025.google/embedding-gemma-300m. https://huggingface.co/google/ embeddinggemma-300m

  17. [17]

    Google Cloud. 2025. AlloyDB AI. https://cloud.google.com/alloydb/ai?e= 48754805. Accessed: 2025-07-31

  18. [18]

    Google Cloud. 2025. BigQuery ML overview. https://cloud.google.com/bigquery/ docs/bqml-introduction. Accessed: July 31, 2025

  19. [19]

    2025.Generative AI pricing

    Google Cloud. 2025.Generative AI pricing. Google. https://cloud.google.com/ vertex-ai/generative-ai/pricing

  20. [20]

    Google DeepMind. 2025. Gemini: Model Thinking Updates. Google DeepMind Blog Post. https://blog.google/technology/google-deepmind/gemini-model- thinking-updates-march-2025/ Accessed: October 2025

  21. [21]

    David Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. InProceedings of the 23rd International Conference on Machine Learning (ICML). ACM, Pittsburgh, PA, USA, 377–384

  22. [22]

    Srinivasan Iyer, Sewon Min, Yashar Mehdad, and Wen-tau Yih. 2021. RECON- SIDER: improved re-ranking using span-focused cross-attention for open domain question answering. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 1280–1287

  23. [23]

    Zhaocheng Ji, Hao Sun, Jun Du, Yan Xu, Ruiqi Wang, Shuo Yuan, Jing Li, Shanshan Liu, Yuan Li, Wenbo Zhu, and Yan Shen. 2022. MHQA: Multi-hop Question Answering on Mental Health.arXiv preprint arXiv:2210.02111(2022). https: //arxiv.org/abs/2210.02111

  24. [24]

    Saehan Jo and Immanuel Trummer. 2024. Thalamusdb: Approximate query processing on multi-modal data.Proceedings of the ACM on Management of Data 2, 3 (2024), 1–26

  25. [25]

    Kaggle. 2020. Tweet Sentiment Extraction - Kaggle Competition. Web link to Kaggle competition. https://www.kaggle.com/c/tweet-sentiment-extraction/ Accessed: October 2025

  26. [26]

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)

  27. [27]

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249

  28. [28]

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernandez Abrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain...

  29. [29]

    Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile Text Embeddings Distil...

  30. [30]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150. https://aclanthology.org/ P11-1015/

  31. [31]

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Chal- lenge: Financial Opinion Mining and Question Answering. InWWW ’18 Com- panion: The 2018 Web Conference Companion. ACM, Lyon, France, 1941–1942. doi:10.1145/3184558.3192301

  32. [32]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv:2501.19393 [cs.CL]

  33. [33]

    Baharan Nouriinanloo and Maxime Lamothe. 2024. Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models.arXiv preprint arXiv:2406.18740(2024)

  34. [34]

    OpenAI. 2022. Classification using embeddings. https://cookbook.openai.com/ examples/classification_using_embeddings. Accessed: 2025-07-29

  35. [35]

    Kelley Pace and Ronald Barry

    R. Kelley Pace and Ronald Barry. 1997. Sparse Spatial Autoregressions.Statistics and Probability Letters33, 3 (1997), 291–297. doi:10.1016/S0167-7152(96)00076-4

  36. [36]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv e-prints(2024), arXiv–2407

  37. [37]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProceed- ings of the 11th International Conference on Very Large Databases (CIDR 2025). Am- sterdam, The Netherlands, 12. https://vldb.org/cidrdb/papers/2025/p12-liu.pdf

  38. [38]

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...

  39. [39]

    Voorhees, Lucy Lu Wang, and William R

    Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen M. Voorhees, Lucy Lu Wang, and William R. Hersh. 2021. Searching for scientific evidence in a pandemic: An overview of TREC-COVID. Journal of Biomedical Informatics121 (2021), 103865. doi:10.1016/j.jbi.2021.103865

  40. [40]

    Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2025. Abacus: A Cost-Based Optimizer 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models for Semantic Operator Systems.arXiv preprint arXiv:2505.14661(2025)

  41. [41]

    Elvis Saravia, Hsien-Che Liu, Yen-Hao Huang, Ssu-Han Wu, and Yi-Shin Chen

  42. [42]

    InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    CARER: Contextualized Affect Representations for Emotion Recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3689–3698. https://aclanthology.org/D18-1404/

  43. [43]

    Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

  44. [44]

    Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, An- drew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. 2025. RankLLM: A Python Package for Reranking with LLMs. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3681–3690

  45. [45]

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.arXiv preprint arXiv:2408.03314(2024)

  46. [46]

    Snowflake Inc. 2025. AI SQL. https://docs.snowflake.com/en/user-guide/ snowflake-cortex/aisql. Accessed: 2025-07-31

  47. [47]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

  48. [48]

    TensorFlow Tutorial. 2023. Word embeddings. https://www.tensorflow.org/text/ guide/word_embeddings. Accessed: 2025-07-29

  49. [49]

    text2vec.org. 2018. Vectorization. https://text2vec.org/vectorization.html. Ac- cessed: 2025-07-29

  50. [50]

    The Devastator. 2022. DBpedia Ontology: Text Classification Dataset with 14 Classes. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology- dataset Kaggle Dataset

  51. [51]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

  52. [52]

    In NAACL-HLT

    FEVER: a Large-scale Dataset for Fact Extraction and VERification. In NAACL-HLT

  53. [53]

    Matthias Urban and Carsten Binnig. 2024. Eleet: Efficient learned query execution over text and tables.Proceedings of the VLDB Endowment17, 13 (2024), 4867–4880

  54. [54]

    Enzo Veltri, Donatello Santoro, Jean-Flavien Bussotti, and Paolo Papotti. 2025. Logical and physical optimizations for SQL query execution over large language models.Proc. ACM Manag. Data3, 3 (2025), 181:1–181:28. doi:10.1145/3725411

  55. [55]

    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7545–7557. doi:10.18653/v1/2020.emnlp-main.609

  56. [56]

    Ellery Wulczyn, Nithum Thain, and Samuel Dixon. 2017. Ex Machina: Personal Attacks Detoxified. InProceedings of the 26th International Conference on World Wide Web (WWW). 1371–1379. doi:10.1145/3038912.3052591

  57. [57]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

  58. [58]

    Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. InAdvances in Neural Information Processing Systems (NIPS), Vol. 28. 649–657. https://proceedings.neurips.cc/paper_files/ paper/2015/file/8559aa24a0d8102d861d85d03831b0e5-Paper.pdf