On Computing the Smallest Suffixient Set

📅 2024-07-26

🏛️ SPIRE

📈 Citations: 1

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This paper addresses the dual challenges of high space overhead in suffix arrays (SAs) and poor cache performance in existing compression schemes. We propose a novel approach that achieves both high compression ratios and efficient query support. Our key innovation is the first formal definition and exact computation of the minimal suffixient set—the smallest subset of suffixes sufficient to answer maximal exact match (MEM) queries correctly. We design an optimal O(n)-time algorithm for its construction. Leveraging reverse Burrows–Wheeler transform (BWT) run-length analysis and the theory of right-maximal substrings, we further develop a low-memory compressed representation requiring only O(n + r̄|Σ|) space, where r̄ denotes the number of BWT runs and |Σ| is the alphabet size. Experiments on highly repetitive texts demonstrate substantial memory reduction while preserving linear-time query processing and strong cache locality—resolving a long-standing minimization challenge in compressed indexing.

Technology Category

Application Category

📝 Abstract

Let T in Sigma^n be a text over alphabet Sigma. A suffixient set S subseteq [n] for T is a set of positions such that, for every one-character right-extension T[i,j] of every right-maximal substring T[i,j-1] of T, there exists x in S such that T[i,j] is a suffix of T[1,x]. It was recently shown that, given a suffixient set of cardinality q and an oracle offering fast random access on T (for example, a straight-line program), there is a data structure of O(q) words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern P in T with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a O(n + ar r|Sigma|)-time algorithm running in compressed working space (ar r is the number of runs in the Burrows-Wheeler transform of T reversed), and an optimal O(n)-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.

Problem

Research questions and friction points this paper is trying to address.

Compresses Suffix Arrays to reduce space usage.

Improves query speed and cache locality.

Efficient for repetitive text collections.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Suffixient Array: efficient suffix array compression

Uses tiny subset for pattern matching via binary search

Linear time computation with compressed working space

🔎 Similar Papers

No similar papers found.