Word Break on SLP-Compressed Texts

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the Word Break problem on strings compressed via Straight-Line Programs (SLPs): given an SLP of size $g$ representing a string $w$ of uncompressed length $N$, and a dictionary $D$ of size $K$, determine whether $w$ can be fully segmented into words from $D$. We propose the first efficient SLP-based algorithmic framework, integrating matrix multiplication to accelerate dynamic programming, constructing a compressed index supporting fast substring queries, and conducting combinatorial complexity lower-bound analysis. Our algorithm achieves $O(g cdot m^omega + M)$ preprocessing time and $O(m^2 log N)$ time per substring query, where $m$ is the maximum word length in $D$ and $M$ is the index space. Furthermore, under the $k$-Clique conjecture, we establish a tight lower bound of $Omega(g cdot m^{2-varepsilon} + M)$, thereby identifying the fundamental computational bottleneck of Word Break in the compressed setting.

Technology Category

Application Category

📝 Abstract
Word Break is a prototypical factorization problem in string processing: Given a word $w$ of length $N$ and a dictionary $mathcal{D} = {d_1, d_2, ldots, d_{K}}$ of $K$ strings, determine whether we can partition $w$ into words from $mathcal{D}$. We propose the first algorithm that solves the Word Break problem over the SLP-compressed input text $w$. Specifically, we show that, given the string $w$ represented using an SLP of size $g$, we can solve the Word Break problem in $mathcal{O}(g cdot m^{omega} + M)$ time, where $m = max_{i=1}^{K} |d_i|$, $M = sum_{i=1}^{K} |d_i|$, and $omega geq 2$ is the matrix multiplication exponent. We obtain our algorithm as a simple corollary of a more general result: We show that in $mathcal{O}(g cdot m^{omega} + M)$ time, we can index the input text $w$ so that solving the Word Break problem for any of its substrings takes $mathcal{O}(m^2 log N)$ time (independent of the substring length). Our second contribution is a lower bound: We prove that, unless the Combinatorial $k$-Clique Conjecture fails, there is no combinatorial algorithm for Word Break on SLP-compressed strings running in $mathcal{O}(g cdot m^{2-epsilon} + M)$ time for any $epsilon>0$.
Problem

Research questions and friction points this paper is trying to address.

Solves Word Break on SLP-compressed texts efficiently
Indexes text for fast substring Word Break queries
Proves lower bound for combinatorial algorithms on SLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLP-compressed text algorithm for Word Break
Indexing enables efficient substring Word Break
Lower bound proof for combinatorial algorithms
🔎 Similar Papers
No similar papers found.
R
Rajat De
Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
Dominik Kempa
Dominik Kempa
Assistant Professor, Stony Brook University
AlgorithmsData StructuresString AlgorithmsData Compression