Sorted Consecutive Occurrence Queries in Substrings

📅 2024-11-18

🏛️ Annual Symposium on Combinatorial Pattern Matching

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper studies the **range consecutive occurrences query** problem in string indexing: given a text $T$, a pattern $P$, and a substring interval $[a,b]$, efficiently locate all *consecutive* occurrences of $P$ within $T[a..b]$. We introduce two novel query types: (1) **range top-$k$ nearest occurrences**, returning the $k$ occurrences closest to $[a,b]$ in ascending order of distance; and (2) **range gap-bounded occurrences**, reporting only those occurrences whose inter-occurrence distances lie within $[g_1,g_2]$. This work is the first to generalize consecutive occurrence queries from the entire text to arbitrary substrings, breaking the traditional global-text assumption. We design a hierarchical index combining suffix arrays, interval trees, and 2D geometric range reporting techniques. Our structure supports top-$k$ queries in $O(|P| + log log n + k)$ time using $O(n log^2 n)$ space, and gap-bounded queries in $O(|P| + log log n + mathrm{output})$ time using $O(n log^{2+varepsilon} n)$ space—delivering the first efficient solution for range string matching with both ordering and inter-occurrence gap constraints.

Technology Category

Application Category

📝 Abstract

The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text $T$ of length $n$ for preprocessing and a pattern $P$ of length $m$ as a query, the goal is to report all occurrences of $P$ as substrings of $T$. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of $P$ in $T$ such that their gaps (i.e., the distances between them) lie within a query-specified range $[g_1, g_2]$. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-$k$ close consecutive occurrence query, which reports the $k$ closest consecutive occurrences of $P$ in $T$, sorted in non-descending order of distance. Both problems are optimally solved in query time with $O(n log n)$-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of $P$ in a specified substring $T[a.. b]$ of $T$. Our contributions are as follows: (1) We propose an $O(n log^2 n)$-space data structure that answers the range top-$k$ consecutive occurrence query in $O(|P| + loglog n + k)$ time. (2) We propose an $O(n log^{2+epsilon} n)$-space data structure that answers the range gap-bounded consecutive occurrence query in $O(|P| + loglog n + mathit{output})$ time, where $epsilon$ is a positive constant and $mathit{output}$ denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.

Problem

Research questions and friction points this paper is trying to address.

Generalizing consecutive occurrence queries to range query model

Proposing data structures for range top-k consecutive occurrences

Addressing range gap-bounded consecutive occurrence queries efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Range top-k consecutive occurrence query data structure

Range gap-bounded consecutive occurrence query solution

Geometric algorithms for weighted horizontal segments

🔎 Similar Papers

No similar papers found.