🤖 AI Summary
Existing spatial data search systems suffer from a disconnection between dataset-level (coarse-grained) and data-point-level (fine-grained) retrieval, coupled with misaligned indexing and optimization mechanisms. Method: This paper proposes Spadas, a multi-granularity unified search system featuring a novel unified spatial index structure that supports cross-granularity joint queries while jointly optimizing high-dimensional indexing overhead and outlier handling. It introduces, for the first time, approximate boundary computation with guaranteed error bounds and batch-wise pruning to enable cross-granularity joint optimization. The system supports multiple distance metrics and efficient approximate querying. Contribution/Results: Evaluated on six real-world spatial data warehouses, Spadas achieves 1–3 orders-of-magnitude speedup over state-of-the-art methods. It has been deployed as a publicly accessible online service, and its practicality and scalability are validated through representative application scenarios.
📝 Abstract
There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they consider the two types of searches independently. To enable search operations ranging from the coarse-grained dataset level to the fine-grained data point level, we provide an integrated one that supports diverse query types and distance metrics. In this paper, we focus on designing a multi-granularity spatial data search system, called Spadas, that supports both dataset and data point search operations. To address the challenges of the high cost of indexing and susceptibility to outliers, we propose a unified index that can drastically improve query efficiency in various scenarios by organizing data reasonably and removing outliers in datasets. Moreover, to accelerate all data search operations, we propose a set of pruning mechanisms based on the unified index, including fast bound estimation, approximation technique with error bound, and pruning in batch techniques, to effectively filter out non-relevant datasets and points. Finally, we report the results of a detailed experimental evaluation using six spatial data repositories, achieving orders of magnitude faster than the state-of-the-art algorithms and demonstrating the effectiveness by case study. An online spatial data search system of Spadas is also implemented and made accessible to users.