A Theoretical Framework for Distribution-Aware Dataset Search

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This paper addresses distribution-aware dataset retrieval, enabling efficient quantile and preference-based queries (e.g., projection-based order statistics) in both centralized and federated settings. We propose the first theoretically rigorous distribution-aware indexing framework, establishing—via a novel lower bound—that near-linear indexing is impossible in the centralized setting. To overcome this, we design a sublinear-space index ($ ilde{O}(N)$) supporting output-sensitive query time ($ ilde{O}(1+mathrm{OUT})$), with an additive error bound of $varepsilon + 2delta$, while guaranteeing zero false negatives and bounded false positives. Our approach innovatively unifies three components: a quantile index (Ptile), a preference index (Pref), and a summary-driven federated query mechanism. Theoretical guarantees are derived via computational geometry and random projection analysis, ensuring both soundness and practical efficiency.

Technology Category

Application Category

📝 Abstract

Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. Let $mathcal{P}={P_1,...,P_N}$ be a repository of $N$ datasets, where $P_isubset mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $ heta$) report all indexes $J$ such that $jin J$ iff $|P_jcap R|/|P_j|in heta$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $ heta$) report all indexes $J$ such that $jin J$ iff $omega(P_j,v)in heta$, where $omega(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $ ilde{O}(N)$ space data structures that answer Ptile and Pref queries in $ ilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $iin J$ and ii) if $jin J$ then $P_j$ satisfies the predicate up to an additive error $varepsilon+2delta$, where $varepsilonin(0,1)$ and $delta$ is the error of synopses.

Problem

Research questions and friction points this paper is trying to address.

Develops a theoretical framework for dataset search with distributional characteristics.

Addresses percentile and preference indexing in centralized and federated settings.

Proposes efficient data structures for querying datasets with bounded error guarantees.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework for distribution-aware dataset search

Unified data discovery in centralized and decentralized settings

Efficient percentile and preference indexing structures

🔎 Similar Papers

Effective and General Distance Computation for Approximate Nearest Neighbor Search