Faster and Simpler Online Computation of String Net Frequency

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of efficiently computing the net frequency (NF) of repeated substrings in strings over large alphabets (e.g., Chinese), supporting online updates. We propose the first online algorithm for NF computation based on the Weiner suffix tree, departing from the conventional Ukkonen’s construction. By leveraging dynamic string indexing and a reformulated NF definition, our method substantially reduces computational complexity: single-NF queries are optimized to *O*(*m* log *σ*), while all-NF reporting achieves output-optimal *O*(|NF⁺(*S*)|) time—where |NF⁺(*S*)| = *O*(*n*)—with total construction time *O*(*n* log *σ*) and space complexity *O*(*n*). A key contribution is the elimination of the dominant *σ*² term present in prior approaches, enabling truly lightweight and scalable NF computation for large-alphabet settings.

Technology Category

Application Category

📝 Abstract
An occurrence of a repeated substring $u$ in a string $S$ is called a net occurrence if extending the occurrence to the left or to the right decreases the number of occurrences to 1. The net frequency (NF) of a repeated substring $u$ in a string $S$ is the number of net occurrences of $u$ in $S$. Very recently, Guo et al. [SPIRE 2024] proposed an online $O(n log sigma)$-time algorithm that maintains a data structure of $O(n)$ space which answers Single-NF queries in $O(mlog sigma + sigma^2)$ time and reports all answers of the All-NF problem in $O(nsigma^2)$ time. Here, $n$ is the length of the input string $S$, $m$ is the query pattern length, and $sigma$ is the alphabet size. The $sigma^2$ term is a major drawback of their method since computing string net frequencies is originally motivated for Chinese language processing where $sigma$ can be thousands large. This paper presents an improved online $O(n log sigma)$-time algorithm, which answers Single-NF queries in $O(m log sigma)$ time and reports all answers to the All-NF problem in output-optimal $O(|mathsf{NF}^+(S)|)$ time, where $mathsf{NF}^+(S)$ is the set of substrings of $S$ paired with their positive NF values. We note that $|mathsf{NF}^+(S)| = O(n)$ always holds. In contract to Guo et al.'s algorithm that is based on Ukkonen's suffix tree construction, our algorithm is based on Weiner's suffix tree construction.
Problem

Research questions and friction points this paper is trying to address.

Improving online computation of string net frequency.
Reducing query time for Single-NF to O(m log σ).
Achieving output-optimal time for All-NF problem.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online O(n log σ)-time algorithm
Single-NF queries in O(m log σ) time
All-NF answers in output-optimal O(|NF+(S)|) time
🔎 Similar Papers
No similar papers found.