Selectivity Estimation for Semantic Filters on Image Data

Carsten Binnig; Gabriele Sanmartino; Matthias Urban; Paolo Papotti; Vu Huy Nguyen

arxiv: 2606.04610 · v1 · pith:P52I7TCYnew · submitted 2026-06-03 · 💻 cs.DB

Selectivity Estimation for Semantic Filters on Image Data

Matthias Urban , Vu Huy Nguyen , Gabriele Sanmartino , Paolo Papotti , Carsten Binnig This is my paper

classification 💻 cs.DB

keywords semanticqueriesdatarangeexecutionfiltersselectivityhistograms

0 comments

read the original abstract

Semantic data systems integrate Large Language Models (LLMs) and Vision-Language Models (VLMs) directly into database query execution, enabling expressive queries on multi-modal data. However, optimizing these queries requires accurate selectivity estimates to determine the most efficient operator execution order. Contemporary systems rely on online sample-based profiling, a process that incurs severe latency overheads and struggles with low-selectivity queries. In this paper, we introduce Semantic Histograms, a novel selectivity estimator for semantic filters on image data that leverages shared embedding spaces to bypass traditional profiling. We realize that all semantic filters are implicit range queries, as they match a range of different images. Some filter predicates are more general, yielding a wide range, while others are more specific, yielding a smaller range. To address the challenge of implicit ranges, we propose two approaches to estimate the queries' specificity, with an ensemble of the two performing best. The evaluation shows that Semantic Histograms can reduce the end-to-end runtime overhead of query optimization and execution by up to 86%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SemCEB: A Cardinality Estimation Benchmark for Semantic Operators
cs.DB 2026-06 unverdicted novelty 7.0

SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.