🤖 AI Summary
This paper addresses periodicity detection in noisy string streams, supporting realistic perturbations including mismatches, wildcards, and edit operations. For Hamming distance, we present the first single-pass streaming algorithm for period detection that imposes no restrictions on suffix characters and fully supports wildcards; for edit distance, we design the first two-pass streaming algorithm. Our approach integrates and significantly extends three key techniques: Clifford et al.’s Hamming sketch, Charalampopoulos’ structural analysis for k-mismatch periodicity, and Bhattacharya–Koucký’s grammar-based decomposition. Compared to prior work, our algorithms achieve improved time complexity, eliminate traditional constraints on wildcard positions, and demonstrate superior robustness and practicality on real-world noisy data.
📝 Abstract
In this work, we study the problem of detecting periodic trends in strings. While detecting exact periodicity has been studied extensively, real-world data is often noisy, where small deviations or mismatches occur between repetitions. This work focuses on a generalized approach to period detection that efficiently handles noise. Given a string $S$ of length $n$, the task is to identify integers $p$ such that the prefix and the suffix of $S$, each of length $n-p+1$, are similar under a given distance measure. Ergün et al. [APPROX-RANDOM 2017] were the first to study this problem in the streaming model under the Hamming distance. In this work, we combine, in a non-trivial way, the Hamming distance sketch of Clifford et al. [SODA 2019] and the structural description of the $k$-mismatch occurrences of a pattern in a text by Charalampopoulos et al. [FOCS 2020] to present a more efficient streaming algorithm for period detection under the Hamming distance. As a corollary, we derive a streaming algorithm for detecting periods of strings which may contain wildcards, a special symbol that match any character of the alphabet. Our algorithm is not only more efficient than that of Ergün et al. [TCS 2020], but it also operates without their assumption that the string must be free of wildcards in its final characters. Additionally, we introduce the first two-pass streaming algorithm for computing periods under the edit distance by leveraging and extending the Bhattacharya-Koucký's grammar decomposition technique [STOC 2023].