🤖 AI Summary
This work addresses the problem of constructing τ-synchronized sets for strings in optimal time to enable efficient data compression, text indexing, and similarity computation. We present the first algorithm that achieves the theoretical time lower bound in the word-RAM model: for any τ, it constructs a τ-synchronized set in O((n log τ)/(τ log n)) time, with preprocessing requiring only O(n log σ / log n) time. The core innovations include a variable-length encoding framework based on sparse integer sequences, an improved van Emde Boas tree structure, and a deterministic linear-time construction method. The resulting synchronized set can be output either as a sorted list in O(n/τ) time or as a bitmap in O(n/log n) time, supporting O(1)-time select queries and near-optimal rank queries while matching the unconditional cell-probe lower bound.
📝 Abstract
A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A $\tau$-synchronizing set of a length-$n$ string is a set of $O(n/\tau)$ positions, chosen via their length-$2\tau$ contexts, such that (outside highly periodic regions) at least one position in every length-$\tau$ window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string $T \in [0..\sigma)^n$ in $O(n\log\sigma/\log n)$ time so that, for any $\tau\in[1..n]$, a $\tau$-synchronizing set of $T$ can be constructed in $O((n\log\tau)/(\tau\log n))$ time. Both bounds are optimal in the word RAM model with word size $w=\Theta(\log n)$. Previously, the construction time was $O(n/\tau)$, either after an $O(n)$-time preprocessing [Kociumaka, Radoszewski, Rytter, Wale\'n; SICOMP 2024], or without preprocessing if $\tau<0.2\log_\sigma n$ [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in $O(n/\tau)$ time, or as a bitmask in $O(n/\log n)$ time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in $O(1)$ time and rank queries in $O(\log(\tfrac{\log\tau}{\log\log n}))$ time, matching unconditional cell-probe lower bounds for $\tau\le n^{1-\Omega(1)}$. We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [P\u{a}tra\c{s}cu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).