🤖 AI Summary
This paper addresses the storage bottleneck of static function data structures. Conventional static functions explicitly store the key set, incurring space overhead bounded below by the zero-order empirical entropy. Method: We propose Learned Static Functions (LSFs), which—under the assumptions of a fixed key set, support only point queries, and permitting arbitrary outputs for out-of-set keys—eliminate explicit key-set storage. LSFs employ machine learning to model the key-value joint probability distribution, generate key-specific adaptive prefix codes, and embed these codes into classical static function structures. Contribution/Results: The core innovation lies in co-designing probabilistic prediction with deterministic data structures to achieve model-driven compact representation. Experiments demonstrate up to one-order-of-magnitude space reduction on real-world datasets and up to three-orders-of-magnitude compression on synthetic data, surpassing entropy-based lower bounds.
📝 Abstract
We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.