🤖 AI Summary
This work addresses context-dependent deletion errors in nanopore sequencing—particularly non-independent deletions induced by long repetitive sequences—by introducing the first systematic model of a context-aware deletion channel, thereby overcoming the limitations of traditional independent deletion models. Leveraging coding theory, information-theoretic analysis, and extremal combinatorics, the study characterizes repetition-length thresholds and proposes an efficient single-deletion correcting code. Theoretically, under t context-dependent deletions, the proposed construction achieves redundancy on the order of (1−C)t log n; when t=1 and C>1/2, it attains near-optimal redundancy. Furthermore, for constant k in the extremal channel setting, the work precisely establishes a tight bound on the maximum achievable rate, significantly outperforming general-purpose deletion-correcting codes.
📝 Abstract
The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least $k$. The model critically depends on the runlength threshold $k$, and we examine two regimes for $k$: a) $k=C\log n$ for a constant $C\in(0,1)$; in this case, we study error-correcting codes that can protect from a constant number $t$ of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between $(1-C)t\log n$ and $2(1-C)t\log n$, meaning that it is a ($1-C$)-fraction of that of arbitrary $t$-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant $t$. In particular, for $t=1$ and $C>1/2$ we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) $k$ equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least $k$, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least $k$ with probability $p$ and taking the limit $p\to 1$. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant $k$.