Selectivity Estimation for Semantic Filters on Image Data

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing semantic data systems suffer from high latency and poor performance in low-selectivity scenarios when handling multimodal queries with semantic filters, as they rely on online sampling to estimate selectivity. This work proposes Semantic Histograms, a novel approach that models semantic filtering as implicit range queries in a shared embedding space derived from large language models (LLMs) and vision-language models (VLMs). The method introduces two complementary, query-specific selectivity estimation strategies whose integration significantly enhances both accuracy and efficiency. By eliminating the need for online sampling, Semantic Histograms enables end-to-end query optimization and execution, reducing runtime overhead by up to 86%.
📝 Abstract
Semantic data systems integrate Large Language Models (LLMs) and Vision-Language Models (VLMs) directly into database query execution, enabling expressive queries on multi-modal data. However, optimizing these queries requires accurate selectivity estimates to determine the most efficient operator execution order. Contemporary systems rely on online sample-based profiling, a process that incurs severe latency overheads and struggles with low-selectivity queries. In this paper, we introduce Semantic Histograms, a novel selectivity estimator for semantic filters on image data that leverages shared embedding spaces to bypass traditional profiling. We realize that all semantic filters are implicit range queries, as they match a range of different images. Some filter predicates are more general, yielding a wide range, while others are more specific, yielding a smaller range. To address the challenge of implicit ranges, we propose two approaches to estimate the queries' specificity, with an ensemble of the two performing best. The evaluation shows that Semantic Histograms can reduce the end-to-end runtime overhead of query optimization and execution by up to 86%.
Problem

Research questions and friction points this paper is trying to address.

Selectivity Estimation
Semantic Filters
Image Data
Query Optimization
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Histograms
Selectivity Estimation
Vision-Language Models
Query Optimization
Embedding Space