DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage

📅 2023-12-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
DNA storage faces significant challenges in correcting multiple insertion, deletion, and substitution (IDS) errors, where conventional single-error-correcting codes lack practicality and shortening codeword length degrades storage density. Method: We propose the first end-to-end IDS-correcting coding framework grounded in deep Levenshtein distance embedding: (i) we introduce a differentiable Levenshtein distance embedding into continuous vector space while preserving structural similarity; (ii) we design an embedding-driven codeword search and segmented decoding mechanism to overcome combinatorial construction bottlenecks; and (iii) we integrate sequence alignment modeling with IDS-channel-aware training. Results: Experiments demonstrate that our code achieves redundancy approaching the information-theoretic lower bound and substantially outperforms Varshamov–Tenengolts codes in code rate. It is the first IDS-correcting scheme that simultaneously satisfies deep learning implementability and information-theoretic optimality.
📝 Abstract
Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correcting code is still a challenge. In this paper, we propose an innovative approach that utilizes deep Levenshtein distance embedding to bypass these mathematical challenges. By representing the Levenshtein distance between two sequences as a conventional distance between their corresponding embedding vectors, the inherent structural property of Levenshtein distance is revealed in the friendly embedding space. Leveraging this embedding space, we introduce the DoDo-Code, an IDS-correcting code that incorporates deep embedding of Levenshtein distance, deep embedding-based codeword search, and deep embedding-based segment correcting. To address the requirements of DNA storage, we also present a preliminary algorithm for long sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting code designed using plausible deep learning methodologies, potentially paving the way for a new direction in error-correcting code research. It is also the first IDS code that exhibits characteristics of being `optimal' in terms of redundancy, significantly outperforming the mainstream IDS-correcting codes of the Varshamov-Tenengolts code family in code rate.
Problem

Research questions and friction points this paper is trying to address.

Designing efficient codes for correcting multiple IDS channel errors
Improving poor code rates in short-length IDS-correcting codewords
Bypassing mathematical challenges in Levenshtein distance-based code design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Levenshtein distance embedding for code design
Codeword search and segment correcting in embedding space
High code rate short-length IDS-correcting codes
🔎 Similar Papers
No similar papers found.
Alan J.X. Guo
Alan J.X. Guo
Center for Applied Mathematics, Tianjin Univ.
CombinatoricsDeep Learning
S
Sihan Sun
Center for Applied Mathematics, Tianjin University, China
X
Xiang Wei
Center for Applied Mathematics, Tianjin University, China
Mengyi Wei
Mengyi Wei
Ph.D. Candidate, Technical University of Munich
AI EthicsData VisualizationHuman-Computer Interaction
X
Xin Chen
Center for Applied Mathematics, Tianjin University, China