🤖 AI Summary
Extracting high-frequency n-grams—especially for large n—from massive datasets poses significant challenges in terms of accuracy, efficiency, and determinism.
Method: This paper proposes Intergrams, a hardware-aware multi-pass algorithm that exploits the power-law distribution of n-gram frequencies. It generates candidate n-grams from frequent (n−1)-grams, applies frequency-based pruning, and incorporates low-level optimizations to progressively shrink the search space across iterative passes. Theoretical analysis guides algorithm design to ensure exactness and strong scalability.
Results: On real-world large-scale datasets, Intergrams achieves 10.3×–33× speedup over the state-of-the-art method. It is the first deterministic approach to break the performance bottleneck for extracting high-frequency n-grams with large n, while guaranteeing correctness and scalability.
📝 Abstract
The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n - 1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33x!) over the next known fastest algorithm, even when similar optimizations are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach. Our code is available at https://github.com/rcurtin/Intergrams.