🤖 AI Summary
This work addresses the coding challenges in in vivo DNA storage arising from replication errors caused by palindromic or reverse-complementary sequences of length $k$. The paper proposes efficient error-correcting codes capable of correcting an arbitrary number of such errors. It establishes, for the first time, that a single redundant symbol suffices to correct any number of replication errors of a given length and derives the Gilbert–Varshamov bound for general replication error-correcting codes. Furthermore, two explicit constructions are presented: for alphabet size $q \geq 4$, they achieve redundancies of $2t \log_q n + O(\log_q \log_q n)$ and $(2t - 1) \log_q n + O(\log_q \log_q n)$, respectively, approaching the optimal redundancy upper bound of $2 \log_q n + \log_q \log_q n + O(1)$. These constructions strike a new balance between redundancy and encoding/decoding complexity.
📝 Abstract
Motivated by applications in in-vivo DNA storage, we study codes for correcting duplications. A reverse-complement duplication of length $k$ is the insertion of the reversed and complemented copy of a substring of length $k$ adjacent to its original position, while a palindromic duplication only inserts the reversed copy without complementation. We first construct an explicit code with a single redundant symbol capable of correcting an arbitrary number of reverse-complement duplications (respectively, palindromic duplications), provided that all duplications have length $k \ge 3\lceil \log_q n \rceil$ and are disjoint. Next, we derive a Gilbert-Varshamov bound for codes that can correct a reverse-complement duplication (respectively, palindromic duplication) of arbitrary length, showing that the optimal redundancy is upper bounded by $2\log_q n + \log_q\log_q n + O(1)$. Finally, for $q \ge 4$, we present two explicit constructions of codes that can correct $t$ length-one reverse-complement duplications. The first construction achieves a redundancy of $2t\log_q n + O(\log_q\log_q n)$ with encoding complexity $O(n)$ and decoding complexity $O\big(n(\log_2 n)^4\big)$. The second construction achieves an improved redundancy of $(2t-1)\log_q n + O(\log_q\log_q n)$, but with encoding and decoding complexities of $O\big(n \cdot \mathrm{poly}(\log_2 n)\big)$.