Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Extracting high-frequency n-grams—especially for large n—from massive datasets poses significant challenges in terms of accuracy, efficiency, and determinism. Method: This paper proposes Intergrams, a hardware-aware multi-pass algorithm that exploits the power-law distribution of n-gram frequencies. It generates candidate n-grams from frequent (n−1)-grams, applies frequency-based pruning, and incorporates low-level optimizations to progressively shrink the search space across iterative passes. Theoretical analysis guides algorithm design to ensure exactness and strong scalability. Results: On real-world large-scale datasets, Intergrams achieves 10.3×–33× speedup over the state-of-the-art method. It is the first deterministic approach to break the performance bottleneck for extracting high-frequency n-grams with large n, while guaranteeing correctness and scalability.

Technology Category

Application Category

📝 Abstract

The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n - 1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33x!) over the next known fastest algorithm, even when similar optimizations are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach. Our code is available at https://github.com/rcurtin/Intergrams.

Problem

Research questions and friction points this paper is trying to address.

Efficiently computing top-k frequent n-grams for large n and datasets

Overcoming exponential growth of n-gram features in computational processing

Deterministic fast n-gram extraction with hardware-optimized multi-pass algorithm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-pass algorithm constructs n-grams from preceding grams

Hardware-optimized design achieves 33x speedup over competitors

Uses power-law distribution theory to validate multi-pass efficacy

🔎 Similar Papers

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens