🤖 AI Summary
This paper studies the Word Break problem on strings compressed via Straight-Line Programs (SLPs): given an SLP of size $g$ representing a string $w$ of uncompressed length $N$, and a dictionary $D$ of size $K$, determine whether $w$ can be fully segmented into words from $D$. We propose the first efficient SLP-based algorithmic framework, integrating matrix multiplication to accelerate dynamic programming, constructing a compressed index supporting fast substring queries, and conducting combinatorial complexity lower-bound analysis. Our algorithm achieves $O(g cdot m^omega + M)$ preprocessing time and $O(m^2 log N)$ time per substring query, where $m$ is the maximum word length in $D$ and $M$ is the index space. Furthermore, under the $k$-Clique conjecture, we establish a tight lower bound of $Omega(g cdot m^{2-varepsilon} + M)$, thereby identifying the fundamental computational bottleneck of Word Break in the compressed setting.
📝 Abstract
Word Break is a prototypical factorization problem in string processing: Given a word $w$ of length $N$ and a dictionary $mathcal{D} = {d_1, d_2, ldots, d_{K}}$ of $K$ strings, determine whether we can partition $w$ into words from $mathcal{D}$. We propose the first algorithm that solves the Word Break problem over the SLP-compressed input text $w$. Specifically, we show that, given the string $w$ represented using an SLP of size $g$, we can solve the Word Break problem in $mathcal{O}(g cdot m^{omega} + M)$ time, where $m = max_{i=1}^{K} |d_i|$, $M = sum_{i=1}^{K} |d_i|$, and $omega geq 2$ is the matrix multiplication exponent. We obtain our algorithm as a simple corollary of a more general result: We show that in $mathcal{O}(g cdot m^{omega} + M)$ time, we can index the input text $w$ so that solving the Word Break problem for any of its substrings takes $mathcal{O}(m^2 log N)$ time (independent of the substring length). Our second contribution is a lower bound: We prove that, unless the Combinatorial $k$-Clique Conjecture fails, there is no combinatorial algorithm for Word Break on SLP-compressed strings running in $mathcal{O}(g cdot m^{2-epsilon} + M)$ time for any $epsilon>0$.