Sorted Consecutive Occurrence Queries in Substrings

๐Ÿ“… 2024-11-18
๐Ÿ›๏ธ Annual Symposium on Combinatorial Pattern Matching
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper studies the **range consecutive occurrences query** problem in string indexing: given a text $T$, a pattern $P$, and a substring interval $[a,b]$, efficiently locate all *consecutive* occurrences of $P$ within $T[a..b]$. We introduce two novel query types: (1) **range top-$k$ nearest occurrences**, returning the $k$ occurrences closest to $[a,b]$ in ascending order of distance; and (2) **range gap-bounded occurrences**, reporting only those occurrences whose inter-occurrence distances lie within $[g_1,g_2]$. This work is the first to generalize consecutive occurrence queries from the entire text to arbitrary substrings, breaking the traditional global-text assumption. We design a hierarchical index combining suffix arrays, interval trees, and 2D geometric range reporting techniques. Our structure supports top-$k$ queries in $O(|P| + log log n + k)$ time using $O(n log^2 n)$ space, and gap-bounded queries in $O(|P| + log log n + mathrm{output})$ time using $O(n log^{2+varepsilon} n)$ spaceโ€”delivering the first efficient solution for range string matching with both ordering and inter-occurrence gap constraints.

Technology Category

Application Category

๐Ÿ“ Abstract
The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text $T$ of length $n$ for preprocessing and a pattern $P$ of length $m$ as a query, the goal is to report all occurrences of $P$ as substrings of $T$. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of $P$ in $T$ such that their gaps (i.e., the distances between them) lie within a query-specified range $[g_1, g_2]$. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-$k$ close consecutive occurrence query, which reports the $k$ closest consecutive occurrences of $P$ in $T$, sorted in non-descending order of distance. Both problems are optimally solved in query time with $O(n log n)$-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of $P$ in a specified substring $T[a.. b]$ of $T$. Our contributions are as follows: (1) We propose an $O(n log^2 n)$-space data structure that answers the range top-$k$ consecutive occurrence query in $O(|P| + loglog n + k)$ time. (2) We propose an $O(n log^{2+epsilon} n)$-space data structure that answers the range gap-bounded consecutive occurrence query in $O(|P| + loglog n + mathit{output})$ time, where $epsilon$ is a positive constant and $mathit{output}$ denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.
Problem

Research questions and friction points this paper is trying to address.

Generalizing consecutive occurrence queries to range query model
Proposing data structures for range top-k consecutive occurrences
Addressing range gap-bounded consecutive occurrence queries efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Range top-k consecutive occurrence query data structure
Range gap-bounded consecutive occurrence query solution
Geometric algorithms for weighted horizontal segments
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Waseem Akram
Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India
Takuya Mieno
Takuya Mieno
The University of Electro-Communications
Stringology