🤖 AI Summary
This work addresses insertion, deletion, substitution (IDS) errors and sequence loss inherent in DNA data storage due to synthesis, amplification, and sequencing processes by proposing DNA-MGC+, a universal encoding–decoding framework. By optimizing the encoding strategy, DNA-MGC+ simultaneously achieves high reliability, improved sequencing depth efficiency, reduced read cost, faster decoding, higher storage density, and enhanced error correction capability. The system is compatible with both Illumina and Nanopore platforms and supports low-depth sequencing under electrochemical synthesis conditions. In both simulations and experiments, DNA-MGC+ enables accurate data recovery even at IDS error rates as high as 24%, while maintaining reliable decoding at sequencing depths below 3× and read costs under 3.5 bits per nucleotide.
📝 Abstract
The biochemical processes underlying DNA data storage, including synthesis, amplification, and sequencing, are inherently noisy. Consequently, base-level insertion, deletion, and substitution (IDS) errors, as well as sequence-level dropouts, occur and pose major challenges for reliable data retrieval. Here we introduce DNA-MGC+, a DNA storage codec designed to enable reliable and resource-efficient data retrieval under diverse operating conditions. We evaluate DNA-MGC+ across a wide range of in silico and in vitro settings, including experiments with both Illumina and Nanopore sequencing, and show that it consistently outperforms existing codecs. In particular, DNA-MGC+ achieves simultaneous gains in sequencing depth requirements, read cost, decoding time, storage density, and error-correction capability under explicit reliability constraints. Notable results include reliable decoding under IDS error rates of up to 24% in synthetic scenarios, and reliable retrieval at sequencing depths below 3x with read costs below 3.5 bits/nt under electrochemical synthesis for both Illumina and Nanopore sequencing.